2024-05-01 15:13:50

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET v6] sched: Implement BPF extensible scheduler class

Updates and Changes
-------------------

This is v6 of sched_ext (SCX) patchset.

During the past five months, both the development and adoption of sched_ext
have been progressing briskly. Here are some highlights around adoption:

- Valve has been working with Igalia to implement a sched_ext scheduler for
Steam Deck. The development is still in its early stages but they are
already happy with the results (more consistent FPS) and are planning to
enable the scheduler on Steam Deck.

https://github.com/sched-ext/scx/tree/main/scheds/rust/scx_lavd
https://ossna2024.sched.com/event/1aBOT/optimizing-scheduler-for-linux-gaming-changwoo-min-igalia

- Ubuntu is considering to include sched_ext in the upcoming 24.10 release.
Andrea Righi of Canonical has been actively working on a userspace
scheduling framework since the end of the last year.

https://github.com/sched-ext/scx/tree/main/scheds/rust/scx_rustland
https://discourse.ubuntu.com/t/introducing-kernel-6-8-for-the-24-04-noble-numbat-release/41958

- We (Meta) are now deploying a sched_ext scheduler on one large production
workload (web), conducting wide-scale verification benchmark on another
(ads), and preparing production deployment on yet another workload (ML
training). These are all using scx_layered, which is useful for quick
prototyping and experiments. In the process, we've identified several
common strategies which are useful across multiple workloads (e.g.
soft-affinitizing related threads) and are in the process of implementing
something more generic and polished.

https://github.com/sched-ext/scx/tree/main/scheds/rust/scx_layered

- Because Google's ghOSt framework (userspace scheduling framework with BPF
hooks for optimization) is already available in the Google fleet, that's
what Google is currently experimenting with. They are seeing promising
results in a couple important workloads (search and cloud hosting) and
trying to move on to deployment. The gap between ghOSt and sched_ext is
not wide at this point and Google is working to port ghOSt schedulers on
top of sched_ext.

- ChromeOS is looking into scx_layered with focus on reducing latency (as a
replacement to RT) with a prototype port of sched_ext on ChromeOS.

- Oculus is facing a number of scheduling related challenges and looking
into sched_ext. They have an android port that they're experimenting with
although any actual deployment would have to wait until a newer platform
kernel can be rolled out which will take quite a while.

Although sched_ext is still out of tree, we're seeing wide interest and
adoption across multiple organizations and different use cases. Plus, our
first-hand experiences at Meta and the reports from other users definitively
confirm the hypothesized merits of sched_ext - among others, lowered barrier
of entry coupled with rapid and safe experiments leading to insights and
performance gains in an easily deployable form.

For example, scx_rusty and scx_layered are proving substantial benefits of
better work conservation within L3 domains, soft affinity (flexibly grouping
related threads together) and application specific prioritization for
request-driven server workloads. scx_lavd is demonstrating the benefits of a
new set of heuristics based on the runtime and the frequencies of waking up
others and being woken up for gaming, and possibly other interactive,
workloads.

We're seeing constantly increasing interests both within Meta and from the
wider community. The benefits are inherent and clear enough that I don't see
a reason for the trend to change. However, being out of tree does add a lot
of overhead in terms of decision making and logistics for everyone involved.

Given that there already is substantial adoption which continues to grow and
sched_ext doesn't affect the built-in schedulers or the rest of kernel in an
invasive manner, I believe it's reasonable to consider sched_ext for
inclusion. I and David Vernet would be happy to run the tree, respond to bug
reports, coordinate with scheduler core or any other kernel subsystem that
sched_ext may interact with.


If you're interested in high-level arguments for and against. Please refer
to the following discussion in the v4 posting:

http://lkml.kernel.org/r/[email protected]


If you're interested in getting your hands dirty, the following repository
contains example and practical schedulers along with documentation on how to
get started:

https://github.com/sched-ext/scx

The kernel and scheduler packages are available for Ubuntu, CachyOS and Arch
(through CachyOS repo). Fedora packaging is in the works.

There are also a slack and weekly office hour:

https://schedextworkspace.slack.com

Office Hour: Mondays at 16:00 UTC (8:00 AM PST, 16:00 UTC, 17:00 CEST,
1:00 AM KST). Please see the #office-hours channel for the
zoom invite.


The followings are significant changes from v5
(http://lkml.kernel.org/r/[email protected]). For more
detailed list of changes, please refer to the patch descriptions.

- scx_pair, scx_userland, scx_next and the rust schedulers are removed from
kernel tree and now hosted in https://github.com/sched-ext/scx along with
all other schedulers.

- SCX_OPS_DISABLING state is replaced with the new bypass mechanism which
allows temporarily putting the system into simple FIFO scheduling mode. In
addition to the shut down path, this is used to isolate the BPF scheduler
across power management events.

- ops.prep_enable() is replaced with ops.init_task() and
ops.enable/disable() are now called whenever the task enters and leaves
sched_ext instead of when the task becomes schedulable on sched_ext and
stops being so. A new operation - ops.exit_task() - is called when the
task stops being schedulable on sched_ext.

- scx_bpf_dispatch() can now be called from ops.select_cpu() too. This
removes the need for communicating local dispatch decision made by
ops.select_cpu() to ops.enqueue() via per-task storage.

- SCX_TASK_ENQ_LOCAL which told the BPF scheduler that scx_select_cpu_dfl()
wants the task to be dispatched to the local DSQ was removed. Instead,
scx_bpf_select_cpu_dfl() now dispatches directly if it finds a suitable
idle CPU. If such behavior is not desired, users can use
scx_bpf_select_cpu_dfl() which returns the verdict in a bool out param.

- Dispatch decisions made in ops.dispatch() may now be cancelled with a new
scx_bpf_dispatch_cancel() kfunc.

- A new SCX_KICK_IDLE flag is available for use with scx_bpf_kick_cpu() to
only send a resched IPI if the target CPU is idle.

- exit_code added to scx_exit_info. This is used to indicate different exit
conditions on non-error exits and enables e.g. handling CPU hotplugs by
restarting the scheduler.

- Debug dump added. When the BPF scheduler gets aborted, the states of all
runqueues and runnable tasks are captured and sent to the scheduler binary
to aid debugging. See https://github.com/sched-ext/scx/issues/234 for an
example of the debug dump being used to root cause a bug in scx_lavd.

- The BPF scheduler can now iterate DSQs and consume specific tasks.

- CPU frequency scaling support added through cpuperf kfunc interface.

- The current state of sched_ext can now be monitored through files under
/sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is to enable
monitoring on kernels which don't enable debugfs. A drgn script
tools/sched_ext/scx_show_state.py is added for additional visibility.

- tools/sched_ext/include/scx/compat[.bpf].h and other facilities to allow
schedulers to be loaded on older kernels are added. The current tentative
target is maintaining backward compatibility for at least one major kernel
release where reasonable.

- Code reorganized so that only the parts necessary to integrate with the
rest of the kernel are in the header files.


Overview
--------

This patch set proposes a new scheduler class called ‘ext_sched_class’, or
sched_ext, which allows scheduling policies to be implemented as BPF programs.

More details will be provided on the overall architecture of sched_ext
throughout the various patches in this set, as well as in the “How” section
below. We realize that this patch set is a significant proposal, so we will be
going into depth in the following “Motivation” section to explain why we think
it’s justified. That section is laid out as follows, touching on three main
axes where we believe that sched_ext provides significant value:

1. Ease of experimentation and exploration: Enabling rapid iteration of new
scheduling policies.

2. Customization: Building application-specific schedulers which implement
policies that are not applicable to general-purpose schedulers.

3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling
policies in production environments.

After the motivation section, we’ll provide a more detailed (but still
high-level) overview of how sched_ext works.


Motivation
----------

1. Ease of experimentation and exploration

*Why is exploration important?*

Scheduling is a challenging problem space. Small changes in scheduling
behavior can have a significant impact on various components of a system, with
the corresponding effects varying widely across different platforms,
architectures, and workloads.

While complexities have always existed in scheduling, they have increased
dramatically over the past 10-15 years. In the mid-late 2000s, cores were
typically homogeneous and further apart from each other, with the criteria for
scheduling being roughly the same across the entire die.

Systems in the modern age are by comparison much more complex. Modern CPU
designs, where the total power budget of all CPU cores often far exceeds the
power budget of the socket, with dynamic frequency scaling, and with or
without chiplets, have significantly expanded the scheduling problem space.
Cache hierarchies have become less uniform, with Core Complex (CCX) designs
such as recent AMD processors having multiple shared L3 caches within a single
socket. Such topologies resemble NUMA sans persistent NUMA node stickiness.

Use-cases have become increasingly complex and diverse as well. Applications
such as mobile and VR have strict latency requirements to avoid missing
deadlines that impact user experience. Stacking workloads in servers is
constantly pushing the demands on the scheduler in terms of workload isolation
and resource distribution.

Experimentation and exploration are important for any non-trivial problem
space. However, given the recent hardware and software developments, we
believe that experimentation and exploration are not just important, but
_critical_ in the scheduling problem space.

Indeed, other approaches in industry are already being explored. AMD has
proposed an experimental patch set [0] which enables userspace to provide
hints to the scheduler via “Userspace Hinting”. The approach adds a prctl()
API which allows callers to set a numerical “hint” value on a struct
task_struct. This hint is then optionally read by the scheduler to adjust the
cost calculus for various scheduling decisions.

[0]: https://lore.kernel.org/lkml/[email protected]/

Huawei have also expressed interest [1] in enabling some form of programmable
scheduling. While we’re unaware of any patch sets which have been sent to the
upstream list for this proposal, it similarly illustrates the need for more
flexibility in the scheduler.

[1]: https://lore.kernel.org/bpf/[email protected]/

Additionally, Google has developed ghOSt [2] with the goal of enabling custom,
userspace driven scheduling policies. Prior presentations at LPC [3] have
discussed ghOSt and how BPF can be used to accelerate scheduling.

[2]: https://dl.acm.org/doi/pdf/10.1145/3477132.3483542
[3]: https://lpc.events/event/16/contributions/1365/

*Why can’t we just explore directly with CFS?*

Experimenting with CFS directly or implementing a new sched_class from scratch
is of course possible, but is often difficult and time consuming. Newcomers to
the scheduler often require years to understand the codebase and become
productive contributors. Even for seasoned kernel engineers, experimenting
with and upstreaming features can take a very long time. The iteration process
itself is also time consuming, as testing scheduler changes on real hardware
requires reinstalling the kernel and rebooting the host.

Core scheduling is an example of a feature that took a significant amount of
time and effort to integrate into the kernel. Part of the difficulty with core
scheduling was the inherent mismatch in abstraction between the desire to
perform core-wide scheduling, and the per-cpu design of the kernel scheduler.
This caused issues, for example ensuring proper fairness between the
independent runqueues of SMT siblings.

The high barrier to entry for working on the scheduler is an impediment to
academia as well. Master’s/PhD candidates who are interested in improving the
scheduler will spend years ramping-up, only to complete their degrees just as
they’re finally ready to make significant changes. A lower entrance barrier
would allow researchers to more quickly ramp up, test out hypotheses, and
iterate on novel ideas. Research methodology is also severely hampered by the
high barrier of entry to make modifications; for example, the Shenango [4] and
Shinjuku scheduling policies used sched affinity to replicate the desired
policy semantics, due to the difficulty of incorporating these policies into
the kernel directly.

[4]: https://www.usenix.org/system/files/nsdi19-ousterhout.pdf

The iterative process itself also imposes a significant cost to working on the
scheduler. Testing changes requires developers to recompile and reinstall the
kernel, reboot their machines, rewarm their workloads, and then finally rerun
their benchmarks. Though some of this overhead could potentially be mitigated
by enabling schedulers to be implemented as kernel modules, a machine crash or
subtle system state corruption is always only one innocuous mistake away.
These problems are exacerbated when testing production workloads in a
datacenter environment as well, where multiple hosts may be involved in an
experiment; requiring a significantly longer ramp up time. Warming up memcache
instances in the Meta production environment takes hours, for example.

*How does sched_ext help with exploration?*

sched_ext attempts to address all of the problems described above. In this
section, we’ll describe the benefits to experimentation and exploration that
are afforded by sched_ext, provide real-world examples of those benefits, and
discuss some of the trade-offs and considerations in our design choices.

One of our main goals was to lower the barrier to entry for experimenting
with the scheduler. sched_ext provides ergonomic callbacks and helpers to
ease common operations such as managing idle CPUs, scheduling tasks on
arbitrary CPUs, handling preemptions from other scheduling classes, and
more. While sched_ext does require some ramp-up, the complexity is
self-contained, and the learning curve gradual. Developers can ramp up by
first implementing simple policies such as global weighted vtime scheduling
in only tens of lines of code, and then continue to learn the APIs and
building blocks available with sched_ext as they build more featureful and
complex schedulers.

Another critical advantage provided by sched_ext is the use of BPF. BPF
provides strong safety guarantees by statically analyzing programs at load
time to ensure that they cannot corrupt or crash the system. sched_ext
guarantees system integrity no matter what BPF scheduler is loaded, and
provides mechanisms to safely disable the current BPF scheduler and migrate
tasks back to a trusted scheduler. For example, we also implement in-kernel
safety mechanisms to guarantee that a misbehaving scheduler cannot
indefinitely starve tasks. BPF also enables sched_ext to significantly improve
iteration speed for running experiments. Loading and unloading a BPF scheduler
is simply a matter of running and terminating a sched_ext binary.

BPF also provides programs with a rich set of APIs, such as maps, kfuncs,
and BPF helpers. In addition to providing useful building blocks to programs
that run entirely in kernel space (such as many of our example schedulers),
these APIs also allow programs to leverage user space in making scheduling
decisions. Specifically, the Atropos sample scheduler has a relatively
simple weighted vtime or FIFO scheduling layer in BPF, paired with a load
balancing component in userspace written in Rust. As described in more
detail below, we also built a more general user-space scheduling framework
called “rhone” by leveraging various BPF features.

On the other hand, BPF does have shortcomings, as can be plainly seen from
the complexity in some of the example schedulers. scx_pair.bpf.c illustrates
this point well. To start, it requires a good amount of code to emulate
cgroup-local-storage. In the kernel proper, this would simply be a matter of
adding another pointer to the struct cgroup, but in BPF, it requires a
complex juggling of data amongst multiple different maps, a good amount of
boilerplate code, and some unwieldy bpf_loop()‘s and atomics. The code is
also littered with explicit and often unnecessary sanity checks to appease
the verifier.

That being said, BPF is being rapidly improved. For example, Yonghong Song
recently upstreamed a patch set [5] to add a cgroup local storage map type,
allowing scx_pair.bpf.c to be simplified. There are plans to address other
issues as well, such as providing statically-verified locking, and avoiding
the need for unnecessary sanity checks. Addressing these shortcomings is a
high priority for BPF, and as progress continues to be made, we expect most
deficiencies to be addressed in the not-too-distant future.

[5]: https://lore.kernel.org/bpf/[email protected]/

Yet another exploration advantage of sched_ext is helping widening the scope
of experiments. For example, sched_ext makes it easy to defer CPU assignment
until a task starts executing, allowing schedulers to share scheduling queues
at any granularity (hyper-twin, CCX and so on). Additionally, higher level
frameworks can be built on top to further widen the scope. For example, the
aforementioned “rhone” [6] library allows implementing scheduling policies in
user-space by encapsulating the complexity around communicating scheduling
decisions with the kernel. This allows taking advantage of a richer
programming environment in user-space, enabling experimenting with, for
instance, more complex mathematical models.

[6]: https://github.com/Decave/rhone

sched_ext also allows developers to leverage machine learning. At Meta, we
experimented with using machine learning to predict whether a running task
would soon yield its CPU. These predictions can be used to aid the scheduler
in deciding whether to keep a runnable task on its current CPU rather than
migrating it to an idle CPU, with the hope of avoiding unnecessary cache
misses. Using a tiny neural net model with only one hidden layer of size 16,
and a decaying count of 64 syscalls as a feature, we were able to achieve a
15% throughput improvement on an Nginx benchmark, with an 87% inference
accuracy.

2. Customization

This section discusses how sched_ext can enable users to run workloads on
application-specific schedulers.

*Why deploy custom schedulers rather than improving CFS?*

Implementing application-specific schedulers and improving CFS are not
conflicting goals. Scheduling features explored with sched_ext which yield
beneficial results, and which are sufficiently generalizable, can and should
be integrated into CFS. However, CFS is fundamentally designed to be a general
purpose scheduler, and thus is not conducive to being extended with some
highly targeted application or hardware specific changes.

Targeted, bespoke scheduling has many potential use cases. For example, VM
scheduling can make certain optimizations that are infeasible in CFS due to
the constrained problem space (scheduling a static number of long-running
VCPUs versus an arbitrary number of threads). Additionally, certain
applications might want to make targeted policy decisions based on hints
directly from the application (for example, a service that knows the different
deadlines of incoming RPCs).

Google has also experimented with some promising, novel scheduling policies.
One example is “central” scheduling, wherein a single CPU makes all
scheduling decisions for the entire system. This allows most cores on the
system to be fully dedicated to running workloads, and can have significant
performance improvements for certain use cases. For example, central
scheduling with VCPUs can avoid expensive vmexits and cache flushes, by
instead delegating the responsibility of preemption checks from the tick to
a single CPU. See scx_central.bpf.c for a simple example of a central
scheduling policy built in sched_ext.

Some workloads also have non-generalizable constraints which enable
optimizations in a scheduling policy which would otherwise not be feasible.
For example,VM workloads at Google typically have a low overcommit ratio
compared to the number of physical CPUs. This allows the scheduler to support
bounded tail latencies, as well as longer blocks of uninterrupted time.

Yet another interesting use case is the scx_flatcg scheduler, which is in
0024-sched_ext-Add-cgroup-support.patch and provides a flattened
hierarchical vtree for cgroups. This scheduler does not account for
thundering herd problems among cgroups, and therefore may not be suitable
for inclusion in CFS. However, in a simple benchmark using wrk[8] on apache
serving a CGI script calculating sha1sum of a small file, it outperformed
CFS by ~3% with CPU controller disabled and by ~10% with two apache
instances competing with 2:1 weight ratio nested four level deep.

[7] https://github.com/wg/wrk

Certain industries require specific scheduling behaviors that do not apply
broadly. For example, ARINC 653 defines scheduling behavior that is widely
used by avionic software, and some out-of-tree implementations
(https://ieeexplore.ieee.org/document/7005306) have been built. While the
upstream community may decide to merge one such implementation in the future,
it would also be entirely reasonable to not do so given the narrowness of
use-case, and non-generalizable, strict requirements. Such cases can be well
served by sched_ext in all stages of the software development lifecycle --
development, testing, deployment and maintenance.

There are also classes of policy exploration, such as machine learning, or
responding in real-time to application hints, that are significantly harder
(and not necessarily appropriate) to integrate within the kernel itself.

*Won’t this increase fragmentation?*

We acknowledge that to some degree, sched_ext does run the risk of
increasing the fragmentation of scheduler implementations. As a result of
exploration, however, we believe that enabling the larger ecosystem to
innovate will ultimately accelerate the overall development and performance
of Linux.

BPF programs are required to be GPLv2, which is enforced by the verifier on
program loads. With regards to API stability, just as with other semi-internal
interfaces such as BPF kfuncs, we won’t be providing any API stability
guarantees to BPF schedulers. While we intend to make an effort to provide
compatibility when possible, we will not provide any explicit, strong
guarantees as the kernel typically does with e.g. UAPI headers. For users who
decide to keep their schedulers out-of-tree,the licensing and maintenance
overheads will be fundamentally the same as for carrying out-of-tree patches.

With regards to the schedulers included in this patch set, and any other
schedulers we implement in the future, both Meta and Google will open-source
all of the schedulers we implement which have any relevance to the broader
upstream community. We expect that some of these, such as the simple example
schedulers and scx_rusty scheduler, will be upstreamed as part of the kernel
tree. Distros will be able to package and release these schedulers with the
kernel, allowing users to utilize these schedulers out-of-the-box without
requiring any additional work or dependencies such as clang or building the
scheduler programs themselves. Other schedulers and scheduling frameworks
such as rhone may be open-sourced through separate per-project repos.

3. Rapid scheduler deployments

Rolling out kernel upgrades is a slow and iterative process. At a large scale
it can take months to roll a new kernel out to a fleet of servers. While this
latency is expected and inevitable for normal kernel upgrades, it can become
highly problematic when kernel changes are required to fix bugs. Livepatch [8]
is available to quickly roll out critical security fixes to large fleets, but
the scope of changes that can be applied with livepatching is fairly limited,
and would likely not be usable for patching scheduling policies. With
sched_ext, new scheduling policies can be rapidly rolled out to production
environments.

[8]: https://www.kernel.org/doc/html/latest/livepatch/livepatch.html

As an example, one of the variants of the L1 Terminal Fault (L1TF) [9]
vulnerability allows a VCPU running a VM to read arbitrary host kernel
memory for pages in L1 data cache. The solution was to implement core
scheduling, which ensures that tasks running as hypertwins have the same
“cookie”.

[9]: https://www.intel.com/content/www/us/en/architecture-and-technology/l1tf.html

While core scheduling works well, it took a long time to finalize and land
upstream. This long rollout period was painful, and required organizations to
make difficult choices amongst a bad set of options. Some companies such as
Google chose to implement and use their own custom L1TF-safe scheduler, others
chose to run without hyper-threading enabled, and yet others left
hyper-threading enabled and crossed their fingers.

Once core scheduling was upstream, organizations had to upgrade the kernels on
their entire fleets. As downtime is not an option for many, these upgrades had
to be gradually rolled out, which can take a very long time for large fleets.

An example of an sched_ext scheduler that illustrates core scheduling
semantics is scx_pair.bpf.c, which co-schedules pairs of tasks from the same
cgroup, and is resilient to L1TF vulnerabilities. While this example
scheduler is certainly not suitable for production in its current form, a
similar scheduler that is more performant and featureful could be written
and deployed if necessary.

Rapid scheduling deployments can similarly be useful to quickly roll-out new
scheduling features without requiring kernel upgrades. At Google, for example,
it was observed that some low-priority workloads were causing degraded
performance for higher-priority workloads due to consuming a disproportionate
share of memory bandwidth. While a temporary mitigation was to use sched
affinity to limit the footprint of this low-priority workload to a small
subset of CPUs, a preferable solution would be to implement a more featureful
task-priority mechanism which automatically throttles lower-priority tasks
which are causing memory contention for the rest of the system. Implementing
this in CFS and rolling it out to the fleet could take a very long time.

sched_ext would directly address these gaps. If another hardware bug or
resource contention issue comes in that requires scheduler support to
mitigate, sched_ext can be used to experiment with and test different
policies. Once a scheduler is available, it can quickly be rolled out to as
many hosts as necessary, and function as a stop-gap solution until a
longer-term mitigation is upstreamed.


How
---

sched_ext is a new sched_class which allows scheduling policies to be
implemented in BPF programs.

sched_ext leverages BPF’s struct_ops feature to define a structure which
exports function callbacks and flags to BPF programs that wish to implement
scheduling policies. The struct_ops structure exported by sched_ext is struct
sched_ext_ops, and is conceptually similar to struct sched_class. The role of
sched_ext is to map the complex sched_class callbacks to the more simple and
ergonomic struct sched_ext_ops callbacks.

Unlike some other BPF program types which have ABI requirements due to
exporting UAPIs, struct_ops has no ABI requirements whatsoever. This provides
us with the flexibility to change the APIs provided to schedulers as
necessary. BPF struct_ops is also already being used successfully in other
subsystems, such as in support of TCP congestion control.

The only struct_ops field that is required to be specified by a scheduler is
the ‘name’ field. Otherwise, sched_ext will provide sane default behavior,
such as automatically choosing an idle CPU on the task wakeup path if
select_cpu() is missing.

*Dispatch queues*

To bridge the workflow imbalance between the scheduler core and sched_ext_ops
callbacks, sched_ext uses simple FIFOs called dispatch queues (dsq's). By
default, there is one global dsq (SCX_DSQ_GLOBAL), and one local per-CPU dsq
(SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for convenience and need not be
used by a scheduler that doesn't require it. As described in more detail
below, SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when
putting the next task on the CPU. The BPF scheduler can manage an arbitrary
number of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().

*Scheduling cycle*

The following briefly shows a typical workflow for how a waking task is
scheduled and executed.

1. When a task is waking up, .select_cpu() is the first operation invoked.
This serves two purposes. It both allows a scheduler to optimize task
placement by specifying a CPU where it expects the task to eventually be
scheduled, and the latter is that the selected CPU will be woken if it’s
idle.

2. Once the target CPU is selected, .enqueue() is invoked. It can make one of
the following decisions:

- Immediately dispatch the task to either the global dsq (SCX_DSQ_GLOBAL)
or the current CPU’s local dsq (SCX_DSQ_LOCAL).

- Immediately dispatch the task to a user-created dispatch queue.

- Queue the task on the BPF side, e.g. in an rbtree map for a vruntime
scheduler, with the intention of dispatching it at a later time from
.dispatch().

3. When a CPU is ready to schedule, it first looks at its local dsq. If empty,
it invokes .consume() which should make one or more scx_bpf_consume() calls
to consume tasks from dsq's. If a scx_bpf_consume() call succeeds, the CPU
has the next task to run and .consume() can return. If .consume() is not
defined, sched_ext will by-default consume from only the built-in
SCX_DSQ_GLOBAL dsq.

4. If there's still no task to run, .dispatch() is invoked which should make
one or more scx_bpf_dispatch() calls to dispatch tasks from the BPF
scheduler to one of the dsq's. If more than one task has been dispatched,
go back to the previous consumption step.

*Verifying callback behavior*

sched_ext always verifies that any value returned from a callback is valid,
and will issue an error and unload the scheduler if it is not. For example, if
select_cpu() returns an invalid CPU, or if an attempt is made to invoke the
scx_bpf_dispatch() with invalid enqueue flags. Furthermore, if a task remains
runnable for too long without being scheduled, sched_ext will detect it and
error-out the scheduler.


Closing Thoughts
----------------

Both Meta and Google have experimented quite a lot with schedulers in the
last several years. Google has benchmarked various workloads using user
space scheduling, and have achieved performance wins by trading off
generality for application specific needs. At Meta, we are actively
experimenting with multiple production workloads and seeing significant
performance gains, and are in the process of deploying sched_ext schedulers
on production workloads at scale. We expect to leverage it extensively to
run various experiments and develop customized schedulers for a number of
critical workloads.

In closing, both Meta and Google believe that sched_ext will significantly
evolve how the broader community explores the scheduling problem space,
while also enabling targeted policies for custom applications. We’ll be able
to experiment easier and faster, explore uncharted areas, and deploy
emergency scheduler changes when necessary. The same applies to anyone who
wants to work on the scheduler, including academia and specialized
industries. sched_ext will push forward the state of the art when it comes
to scheduling and performance in Linux.


Written By
----------

David Vernet <[email protected]>
Josh Don <[email protected]>
Tejun Heo <[email protected]>
Barret Rhoden <[email protected]>


Supported By
------------

Paul Turner <[email protected]>
Neel Natu <[email protected]>
Patrick Bellasi <[email protected]>
Hao Luo <[email protected]>
Dimitrios Skarlatos <[email protected]>


Patchset
--------

This patchset is on top of bpf/for-next as of 2024-04-29:

07801a24e2f1 ("bpf, docs: Clarify PC use in instruction-set.rst")

and contains the following patches:

NOTE: The doc added by 0038 contains a high-level overview and might be good
place to start.

0001-cgroup-Implement-cgroup_show_cftypes.patch
0002-sched-Restructure-sched_class-order-sanity-checks-in.patch
0003-sched-Allow-sched_cgroup_fork-to-fail-and-introduce-.patch
0004-sched-Add-sched_class-reweight_task.patch
0005-sched-Add-sched_class-switching_to-and-expose-check_.patch
0006-sched-Factor-out-cgroup-weight-conversion-functions.patch
0007-sched-Expose-css_tg-and-__setscheduler_prio.patch
0008-sched-Enumerate-CPU-cgroup-file-types.patch
0009-sched-Add-reason-to-sched_class-rq_-on-off-line.patch
0010-sched-Factor-out-update_other_load_avgs-from-__updat.patch
0011-cpufreq_schedutil-Refactor-sugov_cpu_is_busy.patch
0012-sched-Add-normal_policy.patch
0013-sched_ext-Add-boilerplate-for-extensible-scheduler-c.patch
0014-sched_ext-Implement-BPF-extensible-scheduler-class.patch
0015-sched_ext-Add-scx_simple-and-scx_example_qmap-exampl.patch
0016-sched_ext-Add-sysrq-S-which-disables-the-BPF-schedul.patch
0017-sched_ext-Implement-runnable-task-stall-watchdog.patch
0018-sched_ext-Allow-BPF-schedulers-to-disallow-specific-.patch
0019-sched_ext-Print-sched_ext-info-when-dumping-stack.patch
0020-sched_ext-Print-debug-dump-after-an-error-exit.patch
0021-tools-sched_ext-Add-scx_show_state.py.patch
0022-sched_ext-Implement-scx_bpf_kick_cpu-and-task-preemp.patch
0023-sched_ext-Add-a-central-scheduler-which-makes-all-sc.patch
0024-sched_ext-Make-watchdog-handle-ops.dispatch-looping-.patch
0025-sched_ext-Add-task-state-tracking-operations.patch
0026-sched_ext-Implement-tickless-support.patch
0027-sched_ext-Track-tasks-that-are-subjects-of-the-in-fl.patch
0028-sched_ext-Add-cgroup-support.patch
0029-sched_ext-Add-a-cgroup-scheduler-which-uses-flattene.patch
0030-sched_ext-Implement-SCX_KICK_WAIT.patch
0031-sched_ext-Implement-sched_ext_ops.cpu_acquire-releas.patch
0032-sched_ext-Implement-sched_ext_ops.cpu_online-offline.patch
0033-sched_ext-Bypass-BPF-scheduler-while-PM-events-are-i.patch
0034-sched_ext-Implement-core-sched-support.patch
0035-sched_ext-Add-vtime-ordered-priority-queue-to-dispat.patch
0036-sched_ext-Implement-DSQ-iterator.patch
0037-sched_ext-Add-cpuperf-support.patch
0038-sched_ext-Documentation-scheduler-Document-extensibl.patch
0039-sched_ext-Add-selftests.patch

0001 : Cgroup prep.

0002-0012: Scheduler prep.

0013-0015: sched_ext core implementation and a couple example BPF scheduler.

0016-0021: Utility features including safety mechanisms, switch-all and
printing sched_ext state when dumping backtraces.

0022-0027: Kicking and preempting other CPUs, task state transition tracking
and tickless support. Demonstrated with an example central
scheduler which makes all scheduling decisions on one CPU.

0029-0030: cgroup support and the ability to wait for other CPUs after
kicking them.

0031-0033: Add CPU preemption and hotplug and power-management support.

0034 : Add core-sched support.

0035-0036: Add DSQ rbtree and iterator support.

0037 : Add cpuperf (frequency scaling) support.

0038 : Add documentation.

0039 : Add selftests.

The patchset is also available in the following git branch:

git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git sched_ext-v6

diffstat follows.

Documentation/scheduler/index.rst | 1
Documentation/scheduler/sched-ext.rst | 307 +
MAINTAINERS | 13
Makefile | 8
drivers/tty/sysrq.c | 1
include/asm-generic/vmlinux.lds.h | 1
include/linux/cgroup-defs.h | 8
include/linux/cgroup.h | 5
include/linux/sched.h | 5
include/linux/sched/ext.h | 210
include/linux/sched/task.h | 3
include/uapi/linux/sched.h | 1
init/Kconfig | 5
init/init_task.c | 12
kernel/Kconfig.preempt | 24
kernel/cgroup/cgroup.c | 97
kernel/fork.c | 17
kernel/sched/build_policy.c | 8
kernel/sched/core.c | 324 +
kernel/sched/cpufreq_schedutil.c | 50
kernel/sched/deadline.c | 4
kernel/sched/debug.c | 3
kernel/sched/ext.c | 6641 +++++++++++++++++++++++++++++++
kernel/sched/ext.h | 139
kernel/sched/fair.c | 25
kernel/sched/idle.c | 2
kernel/sched/rt.c | 4
kernel/sched/sched.h | 123
kernel/sched/topology.c | 4
lib/dump_stack.c | 1
tools/Makefile | 10
tools/sched_ext/.gitignore | 2
tools/sched_ext/Makefile | 246 +
tools/sched_ext/README.md | 270 +
tools/sched_ext/include/bpf-compat/gnu/stubs.h | 11
tools/sched_ext/include/scx/common.bpf.h | 301 +
tools/sched_ext/include/scx/common.h | 71
tools/sched_ext/include/scx/compat.bpf.h | 110
tools/sched_ext/include/scx/compat.h | 197
tools/sched_ext/include/scx/user_exit_info.h | 111
tools/sched_ext/scx_central.bpf.c | 361 +
tools/sched_ext/scx_central.c | 135
tools/sched_ext/scx_flatcg.bpf.c | 939 ++++
tools/sched_ext/scx_flatcg.c | 233 +
tools/sched_ext/scx_flatcg.h | 51
tools/sched_ext/scx_qmap.bpf.c | 673 +++
tools/sched_ext/scx_qmap.c | 150
tools/sched_ext/scx_show_state.py | 39
tools/sched_ext/scx_simple.bpf.c | 149
tools/sched_ext/scx_simple.c | 107
tools/testing/selftests/sched_ext/.gitignore | 6
tools/testing/selftests/sched_ext/Makefile | 216 +
tools/testing/selftests/sched_ext/config | 9
tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c | 42
tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.c | 57
tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c | 39
tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.c | 56
tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c | 21
tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c | 60
tools/testing/selftests/sched_ext/enq_select_cpu_fails.bpf.c | 43
tools/testing/selftests/sched_ext/enq_select_cpu_fails.c | 61
tools/testing/selftests/sched_ext/exit.bpf.c | 84
tools/testing/selftests/sched_ext/exit.c | 55
tools/testing/selftests/sched_ext/exit_test.h | 20
tools/testing/selftests/sched_ext/hotplug.bpf.c | 55
tools/testing/selftests/sched_ext/hotplug.c | 168
tools/testing/selftests/sched_ext/hotplug_test.h | 15
tools/testing/selftests/sched_ext/init_enable_count.bpf.c | 53
tools/testing/selftests/sched_ext/init_enable_count.c | 166
tools/testing/selftests/sched_ext/maximal.bpf.c | 164
tools/testing/selftests/sched_ext/maximal.c | 51
tools/testing/selftests/sched_ext/maybe_null.bpf.c | 26
tools/testing/selftests/sched_ext/maybe_null.c | 40
tools/testing/selftests/sched_ext/maybe_null_fail.bpf.c | 25
tools/testing/selftests/sched_ext/minimal.bpf.c | 21
tools/testing/selftests/sched_ext/minimal.c | 58
tools/testing/selftests/sched_ext/prog_run.bpf.c | 32
tools/testing/selftests/sched_ext/prog_run.c | 78
tools/testing/selftests/sched_ext/reload_loop.c | 75
tools/testing/selftests/sched_ext/runner.c | 201
tools/testing/selftests/sched_ext/scx_test.h | 131
tools/testing/selftests/sched_ext/select_cpu_dfl.bpf.c | 40
tools/testing/selftests/sched_ext/select_cpu_dfl.c | 72
tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.bpf.c | 89
tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.c | 72
tools/testing/selftests/sched_ext/select_cpu_dispatch.bpf.c | 41
tools/testing/selftests/sched_ext/select_cpu_dispatch.c | 70
tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.bpf.c | 37
tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.c | 56
tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c | 38
tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.c | 56
tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c | 92
tools/testing/selftests/sched_ext/select_cpu_vtime.c | 59
tools/testing/selftests/sched_ext/test_example.c | 49
tools/testing/selftests/sched_ext/util.c | 71
tools/testing/selftests/sched_ext/util.h | 13
96 files changed, 15056 insertions(+), 139 deletions(-)


Patchset History
----------------

v4 (http://lkml.kernel.org/r/[email protected]) -> v5:

- Updated to rebase on top of the current bpf/for-next (2023-11-06).
'0002-0010: Scheduler prep' were simply rebased on top of new EEVDF
scheduler which demonstrate clean cut API boundary between sched-ext and
sched core.

- To accommodate 32bit configs, fields which use atomic ops and
store_release/load_acquire are switched from 64bits to longs.

- To help triaging, if sched_ext is enabled, backtrace dumps now show the
currently active scheduler along with some debug information.

- Fixes for bugs including p->scx.flags corruption due to unsynchronized
SCX_TASK_DSQ_ON_PRIQ changes, and overly permissive BTF struct and scx_bpf
kfunc access checks.

- Other misc changes including renaming "type" to "kind" in scx_exit_info to
ease usage from rust and other languages in which "type" is a reserved
keyword.

- scx_atropos is renamed to scx_rusty and received signficant updates to
improve scalability. Load metrics are now tracked in BPF and accessed only
as necessary from userspace.

- Misc example scheduler improvements including the usage of resizable BPF
.bss array, the introduction of SCX_BUG[_ON](), and timer CPU pinning in
scx_central.

- Improve Makefile and documentation for example schedulers.

v3 (https://lkml.kernel.org/r/[email protected]) -> v4:

- There aren't any significant changes to the sched_ext API even though we
kept experimenting heavily with a couple BPF scheduler implementations
indicating that the core API reached a level of maturity.

- 0002-sched-Encapsulate-task-attribute-change-sequence-int.patch which
implemented custom guard scope for scheduler attribute changes dropped as
upstream is moving towards a more generic implementation.

- Build fixes with different CONFIG combinations.

- Core code cleanups and improvements including how idle CPU is selected and
disabling ttwu_queue for tasks on SCX to avoid confusing BPF schedulers
expecting ->select_cpu() call. See
0012-sched_ext-Implement-BPF-extensible-scheduler-class.patch for more
details.

- "_example" dropped from the example schedulers as the distinction between
the example-only and practically-useful isn't black-and-white. Instead,
each scheduler has detailed comments and there's also a README file.

- scx_central, scx_pair and scx_flatcg are moved into their own patches as
suggested by Josh Don.

- scx_atropos received sustantial updates including fixes for bugs that
could cause temporary stalls and improvements in load balancing and wakeup
target CPU selection. For details, See
0034-sched_ext-Add-a-rust-userspace-hybrid-example-schedu.patch.

v2 (http://lkml.kernel.org/r/[email protected]) -> v3:

- ops.set_weight() added to allow BPF schedulers to track weight changes
without polling p->scx.weight.

- scx_bpf_task_cgroup() kfunc added to allow BPF scheduler to reliably
determine the current cpu cgroup under rq lock protection. This required
improving the kf_mask SCX operation verification mechanism and adding
0023-sched_ext-Track-tasks-that-are-subjects-of-the-in-fl.patch.

- Updated to use the latest BPF improvements including KF_RCU and the inline
iterator.

- scx_example_flatcg added to 0024-sched_ext-Add-cgroup-support.patch. It
uses the new BPF RB tree support to implement flattened cgroup hierarchy.

- A DSQ now also contains an rbtree so that it can be used to implement
vtime based scheduling among tasks sharing a DSQ conveniently and
efficiently. For more details, see
0029-sched_ext-Add-vtime-ordered-priority-queue-to-dispat.patch. All
eligible example schedulers are updated to default to weighted vtime
scheduilng.

- atropos scheduler's userspace code is substantially restructred and
rewritten. The binary is renamed to scx_atropos and can auto-config the
domains according to the cache topology.

- Various other example scheduler updates including scx_example_dummy being
renamed to scx_example_simple, the example schedulers defaulting to
enabling switch_all and clarifying performance expectation of each example
scheduler.

- A bunch of fixes and improvements. Please refer to each patch for details.

v1 (http://lkml.kernel.org/r/[email protected]) -> v2:

- Rebased on top of bpf/for-next - a5f6b9d577eb ("Merge branch 'Enable
struct_ops programs to be sleepable'"). There were several missing
features including generic cpumask helpers and sleepable struct_ops
operation support that v1 was working around. The rebase gets rid of all
SCX specific temporary helpers.

- Some kfunc helpers are context-sensitive and can only be called from
specific operations. v1 didn't restrict kfunc accesses allowing them to be
misused which can lead to crashes and other malfunctions. v2 makes more
kfuncs safe to be called from anywhere and implements per-task mask based
runtime access control for the rest. The longer-term plan is to make the
BPF verifier enforce these restrictions. Combined with the above, sans
mistakes and bugs, it shouldn't be possible to crash the machine through
SCX and its helpers.

- Core-sched support. While v1 implemented the pick_task operation, there
were multiple missing pieces for working core-sched support. v2 adds
0027-sched_ext-Implement-core-sched-support.patch. SCX by default
implements global FIFO ordering and allows the BPF schedulers to implement
custom ordering via scx_ops.core_sched_before(). scx_example_qmap is
updated so that the five queues' relative priorities are correctly
reflected when core-sched is enabled.

- Dropped balance_scx_on_up() which was called from put_prev_task_balance().
UP support is now contained in SCX proper.

- 0002-sched-Encapsulate-task-attribute-change-sequence-int.patch adds
SCHED_CHANGE_BLOCK() which encapsulates the preparation and restoration
sequences used for task attribute changes. For SCX, this replaces
sched_deq_and_put_task() and sched_enq_and_set_task() from v1.

- 0011-sched-Add-reason-to-sched_move_task.patch dropped from v1. SCX now
distinguishes cgroup and autogroup tg's using task_group_is_autogroup().

- Other misc changes including fixes for bugs that Julia Lawall noticed and
patch descriptions updates with more details on how the introduced changes
are going to be used.

- MAINTAINERS entries added.

The followings are discussion points which were raised but didn't result in
code changes in this iteration.

- There were discussions around exposing __setscheduler_prio() and, in v2,
SCHED_CHANGE_BLOCK() in kernel/sched/sched.h. Switching scheduler
implementations is innate for SCX. At the very least, it needs to be able
to turn on and off the BPF scheduler which requires something equivalent
to SCHED_CHANGE_BLOCK(). The use of __setscheduler_prio() depends on the
behavior we want to present to userspace. The current one of using CFS as
the fallback when BPF scheduler is not available seems more friendly and
less error-prone to other options.

- Another discussion point was around for_each_active_class() and friends
which skip over CFS or SCX when it's known that the sched_class must be
empty. I left it as-is for now as it seems to be cleaner and more robust
than trying to plug each operation which may added unnecessary overheads.

Thanks.

--
tejun


2024-05-01 15:14:07

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 03/39] sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork()

A new BPF extensible sched_class will need more control over the forking
process. It wants to be able to fail from sched_cgroup_fork() after the new
task's sched_task_group is initialized so that the loaded BPF program can
prepare the task with its cgroup association is established and reject fork
if e.g. allocation fails.

Allow sched_cgroup_fork() to fail by making it return int instead of void
and adding sched_cancel_fork() to undo sched_fork() in the error path.

sched_cgroup_fork() doesn't fail yet and this patch shouldn't cause any
behavior changes.

v2: Patch description updated to detail the expected use.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/sched/task.h | 3 ++-
kernel/fork.c | 15 ++++++++++-----
kernel/sched/core.c | 8 +++++++-
3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index d362aacf9f89..4df2f9055587 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -63,7 +63,8 @@ extern asmlinkage void schedule_tail(struct task_struct *prev);
extern void init_idle(struct task_struct *idle, int cpu);

extern int sched_fork(unsigned long clone_flags, struct task_struct *p);
-extern void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs);
+extern int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs);
+extern void sched_cancel_fork(struct task_struct *p);
extern void sched_post_fork(struct task_struct *p);
extern void sched_dead(struct task_struct *p);

diff --git a/kernel/fork.c b/kernel/fork.c
index 39a5046c2f0b..02f12033db9c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2362,7 +2362,7 @@ __latent_entropy struct task_struct *copy_process(

retval = perf_event_init_task(p, clone_flags);
if (retval)
- goto bad_fork_cleanup_policy;
+ goto bad_fork_sched_cancel_fork;
retval = audit_alloc(p);
if (retval)
goto bad_fork_cleanup_perf;
@@ -2495,7 +2495,9 @@ __latent_entropy struct task_struct *copy_process(
* cgroup specific, it unconditionally needs to place the task on a
* runqueue.
*/
- sched_cgroup_fork(p, args);
+ retval = sched_cgroup_fork(p, args);
+ if (retval)
+ goto bad_fork_cancel_cgroup;

/*
* From this point on we must avoid any synchronous user-space
@@ -2541,13 +2543,13 @@ __latent_entropy struct task_struct *copy_process(
/* Don't start children in a dying pid namespace */
if (unlikely(!(ns_of_pid(pid)->pid_allocated & PIDNS_ADDING))) {
retval = -ENOMEM;
- goto bad_fork_cancel_cgroup;
+ goto bad_fork_core_free;
}

/* Let kill terminate clone/fork in the middle */
if (fatal_signal_pending(current)) {
retval = -EINTR;
- goto bad_fork_cancel_cgroup;
+ goto bad_fork_core_free;
}

/* No more failure paths after this point. */
@@ -2621,10 +2623,11 @@ __latent_entropy struct task_struct *copy_process(

return p;

-bad_fork_cancel_cgroup:
+bad_fork_core_free:
sched_core_free(p);
spin_unlock(&current->sighand->siglock);
write_unlock_irq(&tasklist_lock);
+bad_fork_cancel_cgroup:
cgroup_cancel_fork(p, args);
bad_fork_put_pidfd:
if (clone_flags & CLONE_PIDFD) {
@@ -2663,6 +2666,8 @@ __latent_entropy struct task_struct *copy_process(
audit_free(p);
bad_fork_cleanup_perf:
perf_event_free_task(p);
+bad_fork_sched_cancel_fork:
+ sched_cancel_fork(p);
bad_fork_cleanup_policy:
lockdep_free_task(p);
#ifdef CONFIG_NUMA
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c166c506244f..b12b1b7405fd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4817,7 +4817,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
return 0;
}

-void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
+int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
{
unsigned long flags;

@@ -4844,6 +4844,12 @@ void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
if (p->sched_class->task_fork)
p->sched_class->task_fork(p);
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+
+ return 0;
+}
+
+void sched_cancel_fork(struct task_struct *p)
+{
}

void sched_post_fork(struct task_struct *p)
--
2.44.0


2024-05-01 15:14:16

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 04/39] sched: Add sched_class->reweight_task()

Currently, during a task weight change, sched core directly calls
reweight_task() defined in fair.c if @p is on CFS. Let's make it a proper
sched_class operation instead. CFS's reweight_task() is renamed to
reweight_task_fair() and now called through sched_class.

While it turns a direct call into an indirect one, set_load_weight() isn't
called from a hot path and this change shouldn't cause any noticeable
difference. This will be used to implement reweight_task for a new BPF
extensible sched_class so that it can keep its cached task weight
up-to-date.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/core.c | 4 ++--
kernel/sched/fair.c | 3 ++-
kernel/sched/sched.h | 4 ++--
3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b12b1b7405fd..4b9cb2228b04 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1342,8 +1342,8 @@ static void set_load_weight(struct task_struct *p, bool update_load)
* SCHED_OTHER tasks have to update their load when changing their
* weight
*/
- if (update_load && p->sched_class == &fair_sched_class) {
- reweight_task(p, prio);
+ if (update_load && p->sched_class->reweight_task) {
+ p->sched_class->reweight_task(task_rq(p), p, prio);
} else {
load->weight = scale_load(sched_prio_to_weight[prio]);
load->inv_weight = sched_prio_to_wmult[prio];
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03be0d1330a6..5d7cffee1a4e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3835,7 +3835,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
}
}

-void reweight_task(struct task_struct *p, int prio)
+static void reweight_task_fair(struct rq *rq, struct task_struct *p, int prio)
{
struct sched_entity *se = &p->se;
struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -13135,6 +13135,7 @@ DEFINE_SCHED_CLASS(fair) = {
.task_tick = task_tick_fair,
.task_fork = task_fork_fair,

+ .reweight_task = reweight_task_fair,
.prio_changed = prio_changed_fair,
.switched_from = switched_from_fair,
.switched_to = switched_to_fair,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d2242679239e..8e23f19e8096 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2303,6 +2303,8 @@ struct sched_class {
*/
void (*switched_from)(struct rq *this_rq, struct task_struct *task);
void (*switched_to) (struct rq *this_rq, struct task_struct *task);
+ void (*reweight_task)(struct rq *this_rq, struct task_struct *task,
+ int newprio);
void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
int oldprio);

@@ -2460,8 +2462,6 @@ extern void init_sched_dl_class(void);
extern void init_sched_rt_class(void);
extern void init_sched_fair_class(void);

-extern void reweight_task(struct task_struct *p, int prio);
-
extern void resched_curr(struct rq *rq);
extern void resched_cpu(int cpu);

--
2.44.0


2024-05-01 15:14:31

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 05/39] sched: Add sched_class->switching_to() and expose check_class_changing/changed()

When a task switches to a new sched_class, the prev and new classes are
notified through ->switched_from() and ->switched_to(), respectively, after
the switching is done.

A new BPF extensible sched_class will have callbacks that allow the BPF
scheduler to keep track of relevant task states (like priority and cpumask).
Those callbacks aren't called while a task is on a different sched_class.
When a task comes back, we wanna tell the BPF progs the up-to-date state
before the task gets enqueued, so we need a hook which is called before the
switching is committed.

This patch adds ->switching_to() which is called during sched_class switch
through check_class_changing() before the task is restored. Also, this patch
exposes check_class_changing/changed() in kernel/sched/sched.h. They will be
used by the new BPF extensible sched_class to implement implicit sched_class
switching which is used e.g. when falling back to CFS when the BPF scheduler
fails or unloads.

This is a prep patch and doesn't cause any behavior changes. The new
operation and exposed functions aren't used yet.

v2: Improve patch description w/ details on planned use.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/core.c | 19 ++++++++++++++++---
kernel/sched/sched.h | 7 +++++++
2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4b9cb2228b04..311efc00da63 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2214,6 +2214,17 @@ inline int task_curr(const struct task_struct *p)
return cpu_curr(task_cpu(p)) == p;
}

+/*
+ * ->switching_to() is called with the pi_lock and rq_lock held and must not
+ * mess with locking.
+ */
+void check_class_changing(struct rq *rq, struct task_struct *p,
+ const struct sched_class *prev_class)
+{
+ if (prev_class != p->sched_class && p->sched_class->switching_to)
+ p->sched_class->switching_to(rq, p);
+}
+
/*
* switched_from, switched_to and prio_changed must _NOT_ drop rq->lock,
* use the balance_callback list if you want balancing.
@@ -2221,9 +2232,9 @@ inline int task_curr(const struct task_struct *p)
* this means any call to check_class_changed() must be followed by a call to
* balance_callback().
*/
-static inline void check_class_changed(struct rq *rq, struct task_struct *p,
- const struct sched_class *prev_class,
- int oldprio)
+void check_class_changed(struct rq *rq, struct task_struct *p,
+ const struct sched_class *prev_class,
+ int oldprio)
{
if (prev_class != p->sched_class) {
if (prev_class->switched_from)
@@ -7253,6 +7264,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
}

__setscheduler_prio(p, prio);
+ check_class_changing(rq, p, prev_class);

if (queued)
enqueue_task(rq, p, queue_flag);
@@ -7898,6 +7910,7 @@ static int __sched_setscheduler(struct task_struct *p,
__setscheduler_prio(p, newprio);
}
__setscheduler_uclamp(p, attr);
+ check_class_changing(rq, p, prev_class);

if (queued) {
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8e23f19e8096..99e292368d11 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2301,6 +2301,7 @@ struct sched_class {
* cannot assume the switched_from/switched_to pair is serialized by
* rq->lock. They are however serialized by p->pi_lock.
*/
+ void (*switching_to) (struct rq *this_rq, struct task_struct *task);
void (*switched_from)(struct rq *this_rq, struct task_struct *task);
void (*switched_to) (struct rq *this_rq, struct task_struct *task);
void (*reweight_task)(struct rq *this_rq, struct task_struct *task,
@@ -2540,6 +2541,12 @@ static inline void sub_nr_running(struct rq *rq, unsigned count)
extern void activate_task(struct rq *rq, struct task_struct *p, int flags);
extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);

+extern void check_class_changing(struct rq *rq, struct task_struct *p,
+ const struct sched_class *prev_class);
+extern void check_class_changed(struct rq *rq, struct task_struct *p,
+ const struct sched_class *prev_class,
+ int oldprio);
+
extern void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags);

#ifdef CONFIG_PREEMPT_RT
--
2.44.0


2024-05-01 15:15:04

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 07/39] sched: Expose css_tg() and __setscheduler_prio()

These will be used by a new BPF extensible sched_class.

css_tg() will be used in the init and exit paths to visit all task_groups by
walking cgroups.

__setscheduler_prio() is used to pick the sched_class matching the current
prio of the task. For the new BPF extensible sched_class, the mapping from
the task configuration to sched_class isn't static and depends on a few
factors - e.g. whether the BPF progs implementing the scheduler are loaded
and in a serviceable state. That mapping logic will be added to
__setscheduler_prio().

When the BPF scheduler progs get loaded and unloaded, the mapping changes
and the new sched_class will walk the tasks applying the new mapping using
__setscheduler_prio().

v3: Dropped SCHED_CHANGE_BLOCK() as upstream is adding more generic cleanup
mechanism.

v2: Expose SCHED_CHANGE_BLOCK() too and update the description.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
Reported-by: kernel test robot <[email protected]>
---
kernel/sched/core.c | 7 +------
kernel/sched/sched.h | 7 +++++++
2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9b60df944263..987209c0e672 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7098,7 +7098,7 @@ int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flag
}
EXPORT_SYMBOL(default_wake_function);

-static void __setscheduler_prio(struct task_struct *p, int prio)
+void __setscheduler_prio(struct task_struct *p, int prio)
{
if (dl_prio(prio))
p->sched_class = &dl_sched_class;
@@ -10542,11 +10542,6 @@ void sched_move_task(struct task_struct *tsk)
}
}

-static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
-{
- return css ? container_of(css, struct task_group, css) : NULL;
-}
-
static struct cgroup_subsys_state *
cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
{
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24b3d120700b..7e0de4cb5a52 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -477,6 +477,11 @@ static inline int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
return walk_tg_tree_from(&root_task_group, down, up, data);
}

+static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
+{
+ return css ? container_of(css, struct task_group, css) : NULL;
+}
+
extern int tg_nop(struct task_group *tg, void *data);

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -2481,6 +2486,8 @@ extern void init_sched_dl_class(void);
extern void init_sched_rt_class(void);
extern void init_sched_fair_class(void);

+extern void __setscheduler_prio(struct task_struct *p, int prio);
+
extern void resched_curr(struct rq *rq);
extern void resched_cpu(int cpu);

--
2.44.0


2024-05-01 15:15:15

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 06/39] sched: Factor out cgroup weight conversion functions

Factor out sched_weight_from/to_cgroup() which convert between scheduler
shares and cgroup weight. No functional change. The factored out functions
will be used by a new BPF extensible sched_class so that the weights can be
exposed to the BPF programs in a way which is consistent cgroup weights and
easier to interpret.

The weight conversions will be used regardless of cgroup usage. It's just
borrowing the cgroup weight range as it's more intuitive.
CGROUP_WEIGHT_MIN/DFL/MAX constants are moved outside CONFIG_CGROUPS so that
the conversion helpers can always be defined.

v2: The helpers are now defined regardless of COFNIG_CGROUPS.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/cgroup.h | 4 ++--
kernel/sched/core.c | 28 +++++++++++++---------------
kernel/sched/sched.h | 18 ++++++++++++++++++
3 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 509e2e8a1d35..32679fcff0a7 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -29,8 +29,6 @@

struct kernel_clone_args;

-#ifdef CONFIG_CGROUPS
-
/*
* All weight knobs on the default hierarchy should use the following min,
* default and max values. The default value is the logarithmic center of
@@ -40,6 +38,8 @@ struct kernel_clone_args;
#define CGROUP_WEIGHT_DFL 100
#define CGROUP_WEIGHT_MAX 10000

+#ifdef CONFIG_CGROUPS
+
enum {
CSS_TASK_ITER_PROCS = (1U << 0), /* walk only threadgroup leaders */
CSS_TASK_ITER_THREADED = (1U << 1), /* walk all threaded css_sets in the domain */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 311efc00da63..9b60df944263 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11313,29 +11313,27 @@ static int cpu_local_stat_show(struct seq_file *sf,
}

#ifdef CONFIG_FAIR_GROUP_SCHED
+
+static unsigned long tg_weight(struct task_group *tg)
+{
+ return scale_load_down(tg->shares);
+}
+
static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- struct task_group *tg = css_tg(css);
- u64 weight = scale_load_down(tg->shares);
-
- return DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024);
+ return sched_weight_to_cgroup(tg_weight(css_tg(css)));
}

static int cpu_weight_write_u64(struct cgroup_subsys_state *css,
- struct cftype *cft, u64 weight)
+ struct cftype *cft, u64 cgrp_weight)
{
- /*
- * cgroup weight knobs should use the common MIN, DFL and MAX
- * values which are 1, 100 and 10000 respectively. While it loses
- * a bit of range on both ends, it maps pretty well onto the shares
- * value used by scheduler and the round-trip conversions preserve
- * the original value over the entire range.
- */
- if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX)
+ unsigned long weight;
+
+ if (cgrp_weight < CGROUP_WEIGHT_MIN || cgrp_weight > CGROUP_WEIGHT_MAX)
return -ERANGE;

- weight = DIV_ROUND_CLOSEST_ULL(weight * 1024, CGROUP_WEIGHT_DFL);
+ weight = sched_weight_from_cgroup(cgrp_weight);

return sched_group_set_shares(css_tg(css), scale_load(weight));
}
@@ -11343,7 +11341,7 @@ static int cpu_weight_write_u64(struct cgroup_subsys_state *css,
static s64 cpu_weight_nice_read_s64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- unsigned long weight = scale_load_down(css_tg(css)->shares);
+ unsigned long weight = tg_weight(css_tg(css));
int last_delta = INT_MAX;
int prio, delta;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 99e292368d11..24b3d120700b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -221,6 +221,24 @@ static inline void update_avg(u64 *avg, u64 sample)
#define shr_bound(val, shift) \
(val >> min_t(typeof(shift), shift, BITS_PER_TYPE(typeof(val)) - 1))

+/*
+ * cgroup weight knobs should use the common MIN, DFL and MAX values which are
+ * 1, 100 and 10000 respectively. While it loses a bit of range on both ends, it
+ * maps pretty well onto the shares value used by scheduler and the round-trip
+ * conversions preserve the original value over the entire range.
+ */
+static inline unsigned long sched_weight_from_cgroup(unsigned long cgrp_weight)
+{
+ return DIV_ROUND_CLOSEST_ULL(cgrp_weight * 1024, CGROUP_WEIGHT_DFL);
+}
+
+static inline unsigned long sched_weight_to_cgroup(unsigned long weight)
+{
+ return clamp_t(unsigned long,
+ DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024),
+ CGROUP_WEIGHT_MIN, CGROUP_WEIGHT_MAX);
+}
+
/*
* !! For sched_setattr_nocheck() (kernel) only !!
*
--
2.44.0


2024-05-01 15:15:21

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 08/39] sched: Enumerate CPU cgroup file types

Rename cpu[_legacy]_files to cpu[_legacy]_cftypes for clarity and add
cpu_cftype_id which enumerates every cgroup2 interface file type. This
doesn't make any functional difference now. The enums will be used to access
specific cftypes by a new BPF extensible sched_class to selectively show and
hide CPU controller interface files depending on the capability of the
currently loaded BPF scheduler progs.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/core.c | 22 +++++++++++-----------
kernel/sched/sched.h | 21 +++++++++++++++++++++
2 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 987209c0e672..e48af9fbbd71 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11196,7 +11196,7 @@ static int cpu_idle_write_s64(struct cgroup_subsys_state *css,
}
#endif

-static struct cftype cpu_legacy_files[] = {
+static struct cftype cpu_legacy_cftypes[] = {
#ifdef CONFIG_FAIR_GROUP_SCHED
{
.name = "shares",
@@ -11425,21 +11425,21 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
}
#endif

-static struct cftype cpu_files[] = {
+struct cftype cpu_cftypes[CPU_CFTYPE_CNT + 1] = {
#ifdef CONFIG_FAIR_GROUP_SCHED
- {
+ [CPU_CFTYPE_WEIGHT] = {
.name = "weight",
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = cpu_weight_read_u64,
.write_u64 = cpu_weight_write_u64,
},
- {
+ [CPU_CFTYPE_WEIGHT_NICE] = {
.name = "weight.nice",
.flags = CFTYPE_NOT_ON_ROOT,
.read_s64 = cpu_weight_nice_read_s64,
.write_s64 = cpu_weight_nice_write_s64,
},
- {
+ [CPU_CFTYPE_IDLE] = {
.name = "idle",
.flags = CFTYPE_NOT_ON_ROOT,
.read_s64 = cpu_idle_read_s64,
@@ -11447,13 +11447,13 @@ static struct cftype cpu_files[] = {
},
#endif
#ifdef CONFIG_CFS_BANDWIDTH
- {
+ [CPU_CFTYPE_MAX] = {
.name = "max",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cpu_max_show,
.write = cpu_max_write,
},
- {
+ [CPU_CFTYPE_MAX_BURST] = {
.name = "max.burst",
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = cpu_cfs_burst_read_u64,
@@ -11461,13 +11461,13 @@ static struct cftype cpu_files[] = {
},
#endif
#ifdef CONFIG_UCLAMP_TASK_GROUP
- {
+ [CPU_CFTYPE_UCLAMP_MIN] = {
.name = "uclamp.min",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cpu_uclamp_min_show,
.write = cpu_uclamp_min_write,
},
- {
+ [CPU_CFTYPE_UCLAMP_MAX] = {
.name = "uclamp.max",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cpu_uclamp_max_show,
@@ -11488,8 +11488,8 @@ struct cgroup_subsys cpu_cgrp_subsys = {
.can_attach = cpu_cgroup_can_attach,
#endif
.attach = cpu_cgroup_attach,
- .legacy_cftypes = cpu_legacy_files,
- .dfl_cftypes = cpu_files,
+ .legacy_cftypes = cpu_legacy_cftypes,
+ .dfl_cftypes = cpu_cftypes,
.early_init = true,
.threaded = true,
};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7e0de4cb5a52..0b6a34ba2457 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3505,4 +3505,25 @@ static inline void init_sched_mm_cid(struct task_struct *t) { }
extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);

+#ifdef CONFIG_CGROUP_SCHED
+enum cpu_cftype_id {
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ CPU_CFTYPE_WEIGHT,
+ CPU_CFTYPE_WEIGHT_NICE,
+ CPU_CFTYPE_IDLE,
+#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+ CPU_CFTYPE_MAX,
+ CPU_CFTYPE_MAX_BURST,
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ CPU_CFTYPE_UCLAMP_MIN,
+ CPU_CFTYPE_UCLAMP_MAX,
+#endif
+ CPU_CFTYPE_CNT,
+};
+
+extern struct cftype cpu_cftypes[CPU_CFTYPE_CNT + 1];
+#endif /* CONFIG_CGROUP_SCHED */
+
#endif /* _KERNEL_SCHED_SCHED_H */
--
2.44.0


2024-05-01 15:15:53

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 10/39] sched: Factor out update_other_load_avgs() from __update_blocked_others()

RT, DL, thermal and irq load and utilization metrics need to be decayed and
updated periodically and before consumption to keep the numbers reasonable.
This is currently done from __update_blocked_others() as a part of the fair
class load balance path. Let's factor it out to update_other_load_avgs().
Pure refactor. No functional changes.

This will be used by the new BPF extensible scheduling class to ensure that
the above metrics are properly maintained.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
---
kernel/sched/core.c | 19 +++++++++++++++++++
kernel/sched/fair.c | 16 +++-------------
kernel/sched/sched.h | 3 +++
3 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 90b505fbb488..7542a39f1fde 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7486,6 +7486,25 @@ int sched_core_idle_cpu(int cpu)
#endif

#ifdef CONFIG_SMP
+/*
+ * Load avg and utiliztion metrics need to be updated periodically and before
+ * consumption. This function updates the metrics for all subsystems except for
+ * the fair class. @rq must be locked and have its clock updated.
+ */
+bool update_other_load_avgs(struct rq *rq)
+{
+ u64 now = rq_clock_pelt(rq);
+ const struct sched_class *curr_class = rq->curr->sched_class;
+ unsigned long thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
+
+ lockdep_assert_rq_held(rq);
+
+ return update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
+ update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
+ update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure) |
+ update_irq_load_avg(rq, 0);
+}
+
/*
* This function computes an effective utilization for the given CPU, to be
* used for frequency selection given the linear relation: f = u * f_max.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8032256d3972..51301ae13725 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9283,28 +9283,18 @@ static inline void update_blocked_load_status(struct rq *rq, bool has_blocked) {

static bool __update_blocked_others(struct rq *rq, bool *done)
{
- const struct sched_class *curr_class;
- u64 now = rq_clock_pelt(rq);
- unsigned long thermal_pressure;
- bool decayed;
+ bool updated;

/*
* update_load_avg() can call cpufreq_update_util(). Make sure that RT,
* DL and IRQ signals have been updated before updating CFS.
*/
- curr_class = rq->curr->sched_class;
-
- thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
-
- decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
- update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
- update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure) |
- update_irq_load_avg(rq, 0);
+ updated = update_other_load_avgs(rq);

if (others_have_blocked(rq))
*done = false;

- return decayed;
+ return updated;
}

#ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bcc8056acadb..ccf2fff0e2ae 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3042,6 +3042,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif

#ifdef CONFIG_SMP
+bool update_other_load_avgs(struct rq *rq);
unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
unsigned long *min,
unsigned long *max);
@@ -3084,6 +3085,8 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
{
return READ_ONCE(rq->avg_rt.util_avg);
}
+#else
+static inline bool update_other_load_avgs(struct rq *rq) { return false; }
#endif

#ifdef CONFIG_UCLAMP_TASK
--
2.44.0


2024-05-01 15:16:04

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 11/39] cpufreq_schedutil: Refactor sugov_cpu_is_busy()

sugov_cpu_is_busy() is used to avoid decreasing performance level while the
CPU is busy and called by sugov_update_single_freq() and
sugov_update_single_perf(). Both callers repeat the same pattern to first
test for uclamp and then the business. Let's refactor so that the tests
aren't repeated.

The new helper is named sugov_hold_freq() and tests both the uclamp
exception and CPU business. No functional changes. This will make adding
more exception conditions easier.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 38 +++++++++++++++-----------------
1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index eece6244f9d2..972b7dd65af2 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -325,16 +325,27 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
}

#ifdef CONFIG_NO_HZ_COMMON
-static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
+static bool sugov_hold_freq(struct sugov_cpu *sg_cpu)
{
- unsigned long idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
- bool ret = idle_calls == sg_cpu->saved_idle_calls;
+ unsigned long idle_calls;
+ bool ret;
+
+ /* if capped by uclamp_max, always update to be in compliance */
+ if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)))
+ return false;
+
+ /*
+ * Maintain the frequency if the CPU has not been idle recently, as
+ * reduction is likely to be premature.
+ */
+ idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
+ ret = idle_calls == sg_cpu->saved_idle_calls;

sg_cpu->saved_idle_calls = idle_calls;
return ret;
}
#else
-static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
+static inline bool sugov_hold_freq(struct sugov_cpu *sg_cpu) { return false; }
#endif /* CONFIG_NO_HZ_COMMON */

/*
@@ -382,14 +393,8 @@ static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
return;

next_f = get_next_freq(sg_policy, sg_cpu->util, max_cap);
- /*
- * Do not reduce the frequency if the CPU has not been idle
- * recently, as the reduction is likely to be premature then.
- *
- * Except when the rq is capped by uclamp_max.
- */
- if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) &&
- sugov_cpu_is_busy(sg_cpu) && next_f < sg_policy->next_freq &&
+
+ if (sugov_hold_freq(sg_cpu) && next_f < sg_policy->next_freq &&
!sg_policy->need_freq_update) {
next_f = sg_policy->next_freq;

@@ -436,14 +441,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
if (!sugov_update_single_common(sg_cpu, time, max_cap, flags))
return;

- /*
- * Do not reduce the target performance level if the CPU has not been
- * idle recently, as the reduction is likely to be premature then.
- *
- * Except when the rq is capped by uclamp_max.
- */
- if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) &&
- sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
+ if (sugov_hold_freq(sg_cpu) && sg_cpu->util < prev_util)
sg_cpu->util = prev_util;

cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
--
2.44.0


2024-05-01 15:16:06

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 09/39] sched: Add @reason to sched_class->rq_{on|off}line()

->rq_{on|off}line are called either during CPU hotplug or cpuset partition
updates. A planned BPF extensible sched_class wants to tell the BPF
scheduler progs about CPU hotplug events in a way that's synchronized with
rq state changes.

As the BPF scheduler progs aren't necessarily affected by cpuset partition
updates, we need a way to distinguish the two types of events. Let's add an
argument to tell them apart.

v2: Patch description updated to detail the expected use.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/core.c | 12 ++++++------
kernel/sched/deadline.c | 4 ++--
kernel/sched/fair.c | 4 ++--
kernel/sched/rt.c | 4 ++--
kernel/sched/sched.h | 13 +++++++++----
kernel/sched/topology.c | 4 ++--
6 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e48af9fbbd71..90b505fbb488 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9593,7 +9593,7 @@ static inline void balance_hotplug_wait(void)

#endif /* CONFIG_HOTPLUG_CPU */

-void set_rq_online(struct rq *rq)
+void set_rq_online(struct rq *rq, enum rq_onoff_reason reason)
{
if (!rq->online) {
const struct sched_class *class;
@@ -9603,12 +9603,12 @@ void set_rq_online(struct rq *rq)

for_each_class(class) {
if (class->rq_online)
- class->rq_online(rq);
+ class->rq_online(rq, reason);
}
}
}

-void set_rq_offline(struct rq *rq)
+void set_rq_offline(struct rq *rq, enum rq_onoff_reason reason)
{
if (rq->online) {
const struct sched_class *class;
@@ -9616,7 +9616,7 @@ void set_rq_offline(struct rq *rq)
update_rq_clock(rq);
for_each_class(class) {
if (class->rq_offline)
- class->rq_offline(rq);
+ class->rq_offline(rq, reason);
}

cpumask_clear_cpu(rq->cpu, rq->rd->online);
@@ -9712,7 +9712,7 @@ int sched_cpu_activate(unsigned int cpu)
rq_lock_irqsave(rq, &rf);
if (rq->rd) {
BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
- set_rq_online(rq);
+ set_rq_online(rq, RQ_ONOFF_HOTPLUG);
}
rq_unlock_irqrestore(rq, &rf);

@@ -9756,7 +9756,7 @@ int sched_cpu_deactivate(unsigned int cpu)
rq_lock_irqsave(rq, &rf);
if (rq->rd) {
BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
- set_rq_offline(rq);
+ set_rq_offline(rq, RQ_ONOFF_HOTPLUG);
}
rq_unlock_irqrestore(rq, &rf);

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a04a436af8cc..010d1dc5f918 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2607,7 +2607,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
}

/* Assumes rq->lock is held */
-static void rq_online_dl(struct rq *rq)
+static void rq_online_dl(struct rq *rq, enum rq_onoff_reason reason)
{
if (rq->dl.overloaded)
dl_set_overload(rq);
@@ -2618,7 +2618,7 @@ static void rq_online_dl(struct rq *rq)
}

/* Assumes rq->lock is held */
-static void rq_offline_dl(struct rq *rq)
+static void rq_offline_dl(struct rq *rq, enum rq_onoff_reason reason)
{
if (rq->dl.overloaded)
dl_clear_overload(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5d7cffee1a4e..8032256d3972 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12446,14 +12446,14 @@ void trigger_load_balance(struct rq *rq)
nohz_balancer_kick(rq);
}

-static void rq_online_fair(struct rq *rq)
+static void rq_online_fair(struct rq *rq, enum rq_onoff_reason reason)
{
update_sysctl();

update_runtime_enabled(rq);
}

-static void rq_offline_fair(struct rq *rq)
+static void rq_offline_fair(struct rq *rq, enum rq_onoff_reason reason)
{
update_sysctl();

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3261b067b67e..8620474d117d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2426,7 +2426,7 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
}

/* Assumes rq->lock is held */
-static void rq_online_rt(struct rq *rq)
+static void rq_online_rt(struct rq *rq, enum rq_onoff_reason reason)
{
if (rq->rt.overloaded)
rt_set_overload(rq);
@@ -2437,7 +2437,7 @@ static void rq_online_rt(struct rq *rq)
}

/* Assumes rq->lock is held */
-static void rq_offline_rt(struct rq *rq)
+static void rq_offline_rt(struct rq *rq, enum rq_onoff_reason reason)
{
if (rq->rt.overloaded)
rt_clear_overload(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0b6a34ba2457..bcc8056acadb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2271,6 +2271,11 @@ extern const u32 sched_prio_to_wmult[40];

#define RETRY_TASK ((void *)-1UL)

+enum rq_onoff_reason {
+ RQ_ONOFF_HOTPLUG, /* CPU is going on/offline */
+ RQ_ONOFF_TOPOLOGY, /* sched domain topology update */
+};
+
struct affinity_context {
const struct cpumask *new_mask;
struct cpumask *user_mask;
@@ -2309,8 +2314,8 @@ struct sched_class {

void (*set_cpus_allowed)(struct task_struct *p, struct affinity_context *ctx);

- void (*rq_online)(struct rq *rq);
- void (*rq_offline)(struct rq *rq);
+ void (*rq_online)(struct rq *rq, enum rq_onoff_reason reason);
+ void (*rq_offline)(struct rq *rq, enum rq_onoff_reason reason);

struct rq *(*find_lock_rq)(struct task_struct *p, struct rq *rq);
#endif
@@ -2853,8 +2858,8 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
raw_spin_rq_unlock(rq1);
}

-extern void set_rq_online (struct rq *rq);
-extern void set_rq_offline(struct rq *rq);
+extern void set_rq_online (struct rq *rq, enum rq_onoff_reason reason);
+extern void set_rq_offline(struct rq *rq, enum rq_onoff_reason reason);
extern bool sched_smp_initialized;

#else /* CONFIG_SMP */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 99ea5986038c..12501543c56d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -497,7 +497,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
old_rd = rq->rd;

if (cpumask_test_cpu(rq->cpu, old_rd->online))
- set_rq_offline(rq);
+ set_rq_offline(rq, RQ_ONOFF_TOPOLOGY);

cpumask_clear_cpu(rq->cpu, old_rd->span);

@@ -515,7 +515,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)

cpumask_set_cpu(rq->cpu, rd->span);
if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
- set_rq_online(rq);
+ set_rq_online(rq, RQ_ONOFF_TOPOLOGY);

rq_unlock_irqrestore(rq, &rf);

--
2.44.0


2024-05-01 15:16:18

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 12/39] sched: Add normal_policy()

A new BPF extensible sched_class will need to dynamically change how a task
picks its sched_class. For example, if the loaded BPF scheduler progs fail,
the tasks will be forced back on CFS even if the task's policy is set to the
new sched_class. To support such mapping, add normal_policy() which wraps
testing for %SCHED_NORMAL. This doesn't cause any behavior changes.

v2: Update the description with more details on the expected use.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/fair.c | 2 +-
kernel/sched/sched.h | 8 +++++++-
2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 51301ae13725..8a9b2e95d06b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8322,7 +8322,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
* Batch and idle tasks do not preempt non-idle tasks (their preemption
* is driven by the tick):
*/
- if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
+ if (unlikely(!normal_policy(p->policy)) || !sched_feat(WAKEUP_PREEMPTION))
return;

find_matching_se(&se, &pse);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ccf2fff0e2ae..11b4345d2638 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -171,9 +171,15 @@ static inline int idle_policy(int policy)
{
return policy == SCHED_IDLE;
}
+
+static inline int normal_policy(int policy)
+{
+ return policy == SCHED_NORMAL;
+}
+
static inline int fair_policy(int policy)
{
- return policy == SCHED_NORMAL || policy == SCHED_BATCH;
+ return normal_policy(policy) || policy == SCHED_BATCH;
}

static inline int rt_policy(int policy)
--
2.44.0


2024-05-01 15:17:03

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 13/39] sched_ext: Add boilerplate for extensible scheduler class

This adds dummy implementations of sched_ext interfaces which interact with
the scheduler core and hook them in the correct places. As they're all
dummies, this doesn't cause any behavior changes. This is split out to help
reviewing.

v2: balance_scx_on_up() dropped. This will be handled in sched_ext proper.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/sched/ext.h | 12 ++++++++++++
kernel/fork.c | 2 ++
kernel/sched/core.c | 32 ++++++++++++++++++++++++--------
kernel/sched/cpufreq_schedutil.c | 4 +++-
kernel/sched/ext.h | 25 +++++++++++++++++++++++++
kernel/sched/idle.c | 2 ++
kernel/sched/sched.h | 2 ++
7 files changed, 70 insertions(+), 9 deletions(-)
create mode 100644 include/linux/sched/ext.h
create mode 100644 kernel/sched/ext.h

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
new file mode 100644
index 000000000000..a05dfcf533b0
--- /dev/null
+++ b/include/linux/sched/ext.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SCHED_EXT_H
+#define _LINUX_SCHED_EXT_H
+
+#ifdef CONFIG_SCHED_CLASS_EXT
+#error "NOT IMPLEMENTED YET"
+#else /* !CONFIG_SCHED_CLASS_EXT */
+
+static inline void sched_ext_free(struct task_struct *p) {}
+
+#endif /* CONFIG_SCHED_CLASS_EXT */
+#endif /* _LINUX_SCHED_EXT_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 02f12033db9c..6238b1ae3306 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -23,6 +23,7 @@
#include <linux/sched/task.h>
#include <linux/sched/task_stack.h>
#include <linux/sched/cputime.h>
+#include <linux/sched/ext.h>
#include <linux/seq_file.h>
#include <linux/rtmutex.h>
#include <linux/init.h>
@@ -970,6 +971,7 @@ void __put_task_struct(struct task_struct *tsk)
WARN_ON(refcount_read(&tsk->usage));
WARN_ON(tsk == current);

+ sched_ext_free(tsk);
io_uring_free(tsk);
cgroup_free(tsk);
task_numa_free(tsk, true);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7542a39f1fde..0231905ec827 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4767,6 +4767,8 @@ late_initcall(sched_core_sysctl_init);
*/
int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
+ int ret;
+
__sched_fork(clone_flags, p);
/*
* We mark the process as NEW here. This guarantees that
@@ -4803,12 +4805,16 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
p->sched_reset_on_fork = 0;
}

- if (dl_prio(p->prio))
- return -EAGAIN;
- else if (rt_prio(p->prio))
+ scx_pre_fork(p);
+
+ if (dl_prio(p->prio)) {
+ ret = -EAGAIN;
+ goto out_cancel;
+ } else if (rt_prio(p->prio)) {
p->sched_class = &rt_sched_class;
- else
+ } else {
p->sched_class = &fair_sched_class;
+ }

init_entity_runnable_average(&p->se);

@@ -4826,6 +4832,10 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
RB_CLEAR_NODE(&p->pushable_dl_tasks);
#endif
return 0;
+
+out_cancel:
+ scx_cancel_fork(p);
+ return ret;
}

int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
@@ -4856,16 +4866,18 @@ int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
p->sched_class->task_fork(p);
raw_spin_unlock_irqrestore(&p->pi_lock, flags);

- return 0;
+ return scx_fork(p);
}

void sched_cancel_fork(struct task_struct *p)
{
+ scx_cancel_fork(p);
}

void sched_post_fork(struct task_struct *p)
{
uclamp_post_fork(p);
+ scx_post_fork(p);
}

unsigned long to_ratio(u64 period, u64 runtime)
@@ -6017,7 +6029,7 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
* We can terminate the balance pass as soon as we know there is
* a runnable task of @class priority or higher.
*/
- for_class_range(class, prev->sched_class, &idle_sched_class) {
+ for_balance_class_range(class, prev->sched_class, &idle_sched_class) {
if (class->balance(rq, prev, rf))
break;
}
@@ -6035,6 +6047,9 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
const struct sched_class *class;
struct task_struct *p;

+ if (scx_enabled())
+ goto restart;
+
/*
* Optimization: we know that if all tasks are in the fair class we can
* call that function directly, but only if the @prev task wasn't of a
@@ -6075,7 +6090,7 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
if (prev->dl_server)
prev->dl_server = NULL;

- for_each_class(class) {
+ for_each_active_class(class) {
p = class->pick_next_task(rq);
if (p)
return p;
@@ -6108,7 +6123,7 @@ static inline struct task_struct *pick_task(struct rq *rq)
const struct sched_class *class;
struct task_struct *p;

- for_each_class(class) {
+ for_each_active_class(class) {
p = class->pick_task(rq);
if (p)
return p;
@@ -10135,6 +10150,7 @@ void __init sched_init(void)
balance_push_set(smp_processor_id(), false);
#endif
init_sched_fair_class();
+ init_sched_ext_class();

psi_init();

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 972b7dd65af2..0827864c35ff 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -197,7 +197,9 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,

static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
{
- unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
+ unsigned long min, max;
+ unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu) +
+ scx_cpuperf_target(sg_cpu->cpu);

util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
util = max(util, boost);
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
new file mode 100644
index 000000000000..42bd1e47b304
--- /dev/null
+++ b/kernel/sched/ext.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifdef CONFIG_SCHED_CLASS_EXT
+#error "NOT IMPLEMENTED YET"
+#else /* CONFIG_SCHED_CLASS_EXT */
+
+#define scx_enabled() false
+
+static inline void scx_pre_fork(struct task_struct *p) {}
+static inline int scx_fork(struct task_struct *p) { return 0; }
+static inline void scx_post_fork(struct task_struct *p) {}
+static inline void scx_cancel_fork(struct task_struct *p) {}
+static inline void init_sched_ext_class(void) {}
+static inline u32 scx_cpuperf_target(s32 cpu) { return 0; }
+
+#define for_each_active_class for_each_class
+#define for_balance_class_range for_class_range
+
+#endif /* CONFIG_SCHED_CLASS_EXT */
+
+#if defined(CONFIG_SCHED_CLASS_EXT) && defined(CONFIG_SMP)
+#error "NOT IMPLEMENTED YET"
+#else
+static inline void scx_update_idle(struct rq *rq, bool idle) {}
+#endif
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 6135fbe83d68..3b6540cc436a 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -458,11 +458,13 @@ static void wakeup_preempt_idle(struct rq *rq, struct task_struct *p, int flags)

static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
{
+ scx_update_idle(rq, false);
}

static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
{
update_idle_core(rq);
+ scx_update_idle(rq, true);
schedstat_inc(rq->sched_goidle);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 11b4345d2638..fbd1e9ea8b18 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3540,4 +3540,6 @@ enum cpu_cftype_id {
extern struct cftype cpu_cftypes[CPU_CFTYPE_CNT + 1];
#endif /* CONFIG_CGROUP_SCHED */

+#include "ext.h"
+
#endif /* _KERNEL_SCHED_SCHED_H */
--
2.44.0


2024-05-01 15:17:09

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 16/39] sched_ext: Add sysrq-S which disables the BPF scheduler

This enables the admin to abort the BPF scheduler and revert to CFS anytime.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
drivers/tty/sysrq.c | 1 +
kernel/sched/build_policy.c | 1 +
kernel/sched/ext.c | 20 ++++++++++++++++++++
3 files changed, 22 insertions(+)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 02217e3c916b..1ce3535cba6d 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -520,6 +520,7 @@ static const struct sysrq_key_op *sysrq_key_table[62] = {
NULL, /* P */
NULL, /* Q */
NULL, /* R */
+ /* S: May be registered by sched_ext for resetting */
NULL, /* S */
NULL, /* T */
NULL, /* U */
diff --git a/kernel/sched/build_policy.c b/kernel/sched/build_policy.c
index 2a2f10367ceb..e0e73b44afe9 100644
--- a/kernel/sched/build_policy.c
+++ b/kernel/sched/build_policy.c
@@ -31,6 +31,7 @@
#include <linux/suspend.h>
#include <linux/tsacct_kern.h>
#include <linux/vtime.h>
+#include <linux/sysrq.h>
#include <linux/percpu-rwsem.h>

#include <uapi/linux/sched/types.h>
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index d4f52209111f..e017b79aa1e7 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -20,6 +20,7 @@ enum scx_exit_kind {
SCX_EXIT_UNREG = 64, /* user-space initiated unregistration */
SCX_EXIT_UNREG_BPF, /* BPF-initiated unregistration */
SCX_EXIT_UNREG_KERN, /* kernel-initiated unregistration */
+ SCX_EXIT_SYSRQ, /* requested by 'S' sysrq */

SCX_EXIT_ERROR = 1024, /* runtime error, error msg contains details */
SCX_EXIT_ERROR_BPF, /* ERROR but triggered through scx_bpf_error() */
@@ -2767,6 +2768,8 @@ static const char *scx_exit_reason(enum scx_exit_kind kind)
return "Scheduler unregistered from BPF";
case SCX_EXIT_UNREG_KERN:
return "Scheduler unregistered from the main kernel";
+ case SCX_EXIT_SYSRQ:
+ return "disabled by sysrq-S";
case SCX_EXIT_ERROR:
return "runtime error";
case SCX_EXIT_ERROR_BPF:
@@ -3506,6 +3509,21 @@ static struct bpf_struct_ops bpf_sched_ext_ops = {
* System integration and init.
*/

+static void sysrq_handle_sched_ext_reset(u8 key)
+{
+ if (scx_ops_helper)
+ scx_ops_disable(SCX_EXIT_SYSRQ);
+ else
+ pr_info("sched_ext: BPF scheduler not yet used\n");
+}
+
+static const struct sysrq_key_op sysrq_sched_ext_reset_op = {
+ .handler = sysrq_handle_sched_ext_reset,
+ .help_msg = "reset-sched-ext(S)",
+ .action_msg = "Disable sched_ext and revert all tasks to CFS",
+ .enable_mask = SYSRQ_ENABLE_RTNICE,
+};
+
void __init init_sched_ext_class(void)
{
s32 cpu, v;
@@ -3529,6 +3547,8 @@ void __init init_sched_ext_class(void)
init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL);
INIT_LIST_HEAD(&rq->scx.runnable_list);
}
+
+ register_sysrq_key('S', &sysrq_sched_ext_reset_op);
}


--
2.44.0


2024-05-01 15:17:25

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 17/39] sched_ext: Implement runnable task stall watchdog

From: David Vernet <[email protected]>

The most common and critical way that a BPF scheduler can misbehave is by
failing to run runnable tasks for too long. This patch implements a
watchdog.

* All tasks record when they become runnable.

* A watchdog work periodically scans all runnable tasks. If any task has
stayed runnable for too long, the BPF scheduler is aborted.

* scheduler_tick() monitors whether the watchdog itself is stuck. If so, the
BPF scheduler is aborted.

Because the watchdog only scans the tasks which are currently runnable and
usually very infrequently, the overhead should be negligible.
scx_qmap is updated so that it can be told to stall user and/or
kernel tasks.

A detected task stall looks like the following:

sched_ext: BPF scheduler "qmap" errored, disabling
sched_ext: runnable task stall (dbus-daemon[953] failed to run for 6.478s)
scx_check_timeout_workfn+0x10e/0x1b0
process_one_work+0x287/0x560
worker_thread+0x234/0x420
kthread+0xe9/0x100
ret_from_fork+0x1f/0x30

A detected watchdog stall:

sched_ext: BPF scheduler "qmap" errored, disabling
sched_ext: runnable task stall (watchdog failed to check in for 5.001s)
scheduler_tick+0x2eb/0x340
update_process_times+0x7a/0x90
tick_sched_timer+0xd8/0x130
__hrtimer_run_queues+0x178/0x3b0
hrtimer_interrupt+0xfc/0x390
__sysvec_apic_timer_interrupt+0xb7/0x2b0
sysvec_apic_timer_interrupt+0x90/0xb0
asm_sysvec_apic_timer_interrupt+0x1b/0x20
default_idle+0x14/0x20
arch_cpu_idle+0xf/0x20
default_idle_call+0x50/0x90
do_idle+0xe8/0x240
cpu_startup_entry+0x1d/0x20
kernel_init+0x0/0x190
start_kernel+0x0/0x392
start_kernel+0x324/0x392
x86_64_start_reservations+0x2a/0x2c
x86_64_start_kernel+0x104/0x109
secondary_startup_64_no_verify+0xce/0xdb

Note that this patch exposes scx_ops_error[_type]() in kernel/sched/ext.h to
inline scx_notify_sched_tick().

v4: - While disabling, cancel_delayed_work_sync(&scx_watchdog_work) was
being called before forward progress was guaranteed and thus could
lead to system lockup. Relocated.

- While enabling, it was comparing msecs against jiffies without
conversion leading to spurious load failures on lower HZ kernels.
Fixed.

- runnable list management is now used by core bypass logic and moved to
the patch implementing sched_ext core.

v3: - bpf_scx_init_member() was incorrectly comparing ops->timeout_ms
against SCX_WATCHDOG_MAX_TIMEOUT which is in jiffies without
conversion leading to spurious load failures in lower HZ kernels.
Fixed.

v2: - Julia Lawall noticed that the watchdog code was mixing msecs and
jiffies. Fix by using jiffies for everything.

Signed-off-by: David Vernet <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
Cc: Julia Lawall <[email protected]>
---
include/linux/sched/ext.h | 1 +
init/init_task.c | 1 +
kernel/sched/core.c | 1 +
kernel/sched/ext.c | 130 ++++++++++++++++++++++++++++++++-
kernel/sched/ext.h | 2 +
tools/sched_ext/scx_qmap.bpf.c | 12 +++
tools/sched_ext/scx_qmap.c | 12 ++-
7 files changed, 153 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 65b1740a9a07..6a5a10092fb3 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -122,6 +122,7 @@ struct sched_ext_entity {
atomic_long_t ops_state;

struct list_head runnable_node; /* rq->scx.runnable_list */
+ unsigned long runnable_at;

u64 ddsp_dsq_id;
u64 ddsp_enq_flags;
diff --git a/init/init_task.c b/init/init_task.c
index ef349ebe52f8..b85635b7eed0 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -105,6 +105,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
.sticky_cpu = -1,
.holding_cpu = -1,
.runnable_node = LIST_HEAD_INIT(init_task.scx.runnable_node),
+ .runnable_at = INITIAL_JIFFIES,
.ddsp_dsq_id = SCX_DSQ_INVALID,
.slice = SCX_SLICE_DFL,
},
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ef8fa67c58de..5b921725e6b2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5733,6 +5733,7 @@ void scheduler_tick(void)
calc_global_load_tick(rq);
sched_core_tick(rq);
task_tick_mm_cid(rq, curr);
+ scx_tick(rq);

rq_unlock(rq, &rf);

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index e017b79aa1e7..63f47a9e1262 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -8,6 +8,7 @@

enum scx_consts {
SCX_DSP_DFL_MAX_BATCH = 32,
+ SCX_WATCHDOG_MAX_TIMEOUT = 30 * HZ,

SCX_EXIT_BT_LEN = 64,
SCX_EXIT_MSG_LEN = 1024,
@@ -24,6 +25,7 @@ enum scx_exit_kind {

SCX_EXIT_ERROR = 1024, /* runtime error, error msg contains details */
SCX_EXIT_ERROR_BPF, /* ERROR but triggered through scx_bpf_error() */
+ SCX_EXIT_ERROR_STALL, /* watchdog detected stalled runnable tasks */
};

/*
@@ -319,6 +321,15 @@ struct sched_ext_ops {
*/
u64 flags;

+ /**
+ * timeout_ms - The maximum amount of time, in milliseconds, that a
+ * runnable task should be able to wait before being scheduled. The
+ * maximum timeout may not exceed the default timeout of 30 seconds.
+ *
+ * Defaults to the maximum allowed timeout value of 30 seconds.
+ */
+ u32 timeout_ms;
+
/**
* name - BPF scheduler's name
*
@@ -472,6 +483,23 @@ struct static_key_false scx_has_op[SCX_OPI_END] =
static atomic_t scx_exit_kind = ATOMIC_INIT(SCX_EXIT_DONE);
static struct scx_exit_info *scx_exit_info;

+/*
+ * The maximum amount of time in jiffies that a task may be runnable without
+ * being scheduled on a CPU. If this timeout is exceeded, it will trigger
+ * scx_ops_error().
+ */
+static unsigned long scx_watchdog_timeout;
+
+/*
+ * The last time the delayed work was run. This delayed work relies on
+ * ksoftirqd being able to run to service timer interrupts, so it's possible
+ * that this work itself could get wedged. To account for this, we check that
+ * it's not stalled in the timer tick, and trigger an error if it is.
+ */
+static unsigned long scx_watchdog_timestamp = INITIAL_JIFFIES;
+
+static struct delayed_work scx_watchdog_work;
+
/* idle tracking */
#ifdef CONFIG_SMP
#ifdef CONFIG_CPUMASK_OFFSTACK
@@ -1158,6 +1186,11 @@ static void set_task_runnable(struct rq *rq, struct task_struct *p)
{
lockdep_assert_rq_held(rq);

+ if (p->scx.flags & SCX_TASK_RESET_RUNNABLE_AT) {
+ p->scx.runnable_at = jiffies;
+ p->scx.flags &= ~SCX_TASK_RESET_RUNNABLE_AT;
+ }
+
/*
* list_add_tail() must be used. scx_ops_bypass() depends on tasks being
* appened to the runnable_list.
@@ -1165,9 +1198,11 @@ static void set_task_runnable(struct rq *rq, struct task_struct *p)
list_add_tail(&p->scx.runnable_node, &rq->scx.runnable_list);
}

-static void clr_task_runnable(struct task_struct *p)
+static void clr_task_runnable(struct task_struct *p, bool reset_runnable_at)
{
list_del_init(&p->scx.runnable_node);
+ if (reset_runnable_at)
+ p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT;
}

static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags)
@@ -1205,7 +1240,8 @@ static void ops_dequeue(struct task_struct *p, u64 deq_flags)
{
unsigned long opss;

- clr_task_runnable(p);
+ /* dequeue is always temporary, don't reset runnable_at */
+ clr_task_runnable(p, false);

/* acquire ensures that we see the preceding updates on QUEUED */
opss = atomic_long_read_acquire(&p->scx.ops_state);
@@ -1818,7 +1854,7 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)

p->se.exec_start = rq_clock_task(rq);

- clr_task_runnable(p);
+ clr_task_runnable(p, true);
}

static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
@@ -2168,9 +2204,71 @@ static void reset_idle_masks(void) {}

#endif /* CONFIG_SMP */

-static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued)
+static bool check_rq_for_timeouts(struct rq *rq)
+{
+ struct task_struct *p;
+ struct rq_flags rf;
+ bool timed_out = false;
+
+ rq_lock_irqsave(rq, &rf);
+ list_for_each_entry(p, &rq->scx.runnable_list, scx.runnable_node) {
+ unsigned long last_runnable = p->scx.runnable_at;
+
+ if (unlikely(time_after(jiffies,
+ last_runnable + scx_watchdog_timeout))) {
+ u32 dur_ms = jiffies_to_msecs(jiffies - last_runnable);
+
+ scx_ops_error_kind(SCX_EXIT_ERROR_STALL,
+ "%s[%d] failed to run for %u.%03us",
+ p->comm, p->pid,
+ dur_ms / 1000, dur_ms % 1000);
+ timed_out = true;
+ break;
+ }
+ }
+ rq_unlock_irqrestore(rq, &rf);
+
+ return timed_out;
+}
+
+static void scx_watchdog_workfn(struct work_struct *work)
+{
+ int cpu;
+
+ WRITE_ONCE(scx_watchdog_timestamp, jiffies);
+
+ for_each_online_cpu(cpu) {
+ if (unlikely(check_rq_for_timeouts(cpu_rq(cpu))))
+ break;
+
+ cond_resched();
+ }
+ queue_delayed_work(system_unbound_wq, to_delayed_work(work),
+ scx_watchdog_timeout / 2);
+}
+
+void scx_tick(struct rq *rq)
{
+ unsigned long last_check;
+
+ if (!scx_enabled())
+ return;
+
+ last_check = READ_ONCE(scx_watchdog_timestamp);
+ if (unlikely(time_after(jiffies,
+ last_check + READ_ONCE(scx_watchdog_timeout)))) {
+ u32 dur_ms = jiffies_to_msecs(jiffies - last_check);
+
+ scx_ops_error_kind(SCX_EXIT_ERROR_STALL,
+ "watchdog failed to check in for %u.%03us",
+ dur_ms / 1000, dur_ms % 1000);
+ }
+
update_other_load_avgs(rq);
+}
+
+static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued)
+{
update_curr_scx(rq);

/*
@@ -2240,6 +2338,7 @@ static int scx_ops_init_task(struct task_struct *p, struct task_group *tg, bool

scx_set_task_state(p, SCX_TASK_INIT);

+ p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT;
return 0;
}

@@ -2318,6 +2417,7 @@ void init_scx_entity(struct sched_ext_entity *scx)
scx->sticky_cpu = -1;
scx->holding_cpu = -1;
INIT_LIST_HEAD(&scx->runnable_node);
+ scx->runnable_at = jiffies;
scx->ddsp_dsq_id = SCX_DSQ_INVALID;
scx->slice = SCX_SLICE_DFL;
}
@@ -2774,6 +2874,8 @@ static const char *scx_exit_reason(enum scx_exit_kind kind)
return "runtime error";
case SCX_EXIT_ERROR_BPF:
return "scx_bpf_error";
+ case SCX_EXIT_ERROR_STALL:
+ return "runnable task stall";
default:
return "<UNKNOWN>";
}
@@ -2883,6 +2985,8 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
if (scx_ops.exit)
SCX_CALL_OP(SCX_KF_UNLOCKED, exit, ei);

+ cancel_delayed_work_sync(&scx_watchdog_work);
+
/*
* Delete the kobject from the hierarchy eagerly in addition to just
* dropping a reference. Otherwise, if the object is deleted
@@ -3005,6 +3109,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
{
struct scx_task_iter sti;
struct task_struct *p;
+ unsigned long timeout;
int i, ret;

mutex_lock(&scx_ops_enable_mutex);
@@ -3081,6 +3186,16 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
goto err_disable;
}

+ if (ops->timeout_ms)
+ timeout = msecs_to_jiffies(ops->timeout_ms);
+ else
+ timeout = SCX_WATCHDOG_MAX_TIMEOUT;
+
+ WRITE_ONCE(scx_watchdog_timeout, timeout);
+ WRITE_ONCE(scx_watchdog_timestamp, jiffies);
+ queue_delayed_work(system_unbound_wq, &scx_watchdog_work,
+ scx_watchdog_timeout / 2);
+
/*
* Lock out forks before opening the floodgate so that they don't wander
* into the operations prematurely.
@@ -3393,6 +3508,12 @@ static int bpf_scx_init_member(const struct btf_type *t,
if (ret == 0)
return -EINVAL;
return 1;
+ case offsetof(struct sched_ext_ops, timeout_ms):
+ if (msecs_to_jiffies(*(u32 *)(udata + moff)) >
+ SCX_WATCHDOG_MAX_TIMEOUT)
+ return -E2BIG;
+ ops->timeout_ms = *(u32 *)(udata + moff);
+ return 1;
}

return 0;
@@ -3549,6 +3670,7 @@ void __init init_sched_ext_class(void)
}

register_sysrq_key('S', &sysrq_sched_ext_reset_op);
+ INIT_DELAYED_WORK(&scx_watchdog_work, scx_watchdog_workfn);
}


diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 328abf5445b1..0a8717b306ba 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -29,6 +29,7 @@ static inline bool task_on_scx(const struct task_struct *p)
return scx_enabled() && p->sched_class == &ext_sched_class;
}

+void scx_tick(struct rq *rq);
void init_scx_entity(struct sched_ext_entity *scx);
void scx_pre_fork(struct task_struct *p);
int scx_fork(struct task_struct *p);
@@ -75,6 +76,7 @@ static inline const struct sched_class *next_active_class(const struct sched_cla
#define scx_enabled() false
#define scx_switched_all() false

+static inline void scx_tick(struct rq *rq) {}
static inline void scx_pre_fork(struct task_struct *p) {}
static inline int scx_fork(struct task_struct *p) { return 0; }
static inline void scx_post_fork(struct task_struct *p) {}
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index ad0328bb3c6a..927c4cd8b218 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -29,6 +29,8 @@ enum consts {
char _license[] SEC("license") = "GPL";

const volatile u64 slice_ns = SCX_SLICE_DFL;
+const volatile u32 stall_user_nth;
+const volatile u32 stall_kernel_nth;
const volatile u32 dsp_batch;
const volatile bool switch_partial;

@@ -130,11 +132,20 @@ static int weight_to_idx(u32 weight)

void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
{
+ static u32 user_cnt, kernel_cnt;
struct task_ctx *tctx;
u32 pid = p->pid;
int idx = weight_to_idx(p->scx.weight);
void *ring;

+ if (p->flags & PF_KTHREAD) {
+ if (stall_kernel_nth && !(++kernel_cnt % stall_kernel_nth))
+ return;
+ } else {
+ if (stall_user_nth && !(++user_cnt % stall_user_nth))
+ return;
+ }
+
if (test_error_cnt && !--test_error_cnt)
scx_bpf_error("test triggering error");

@@ -265,4 +276,5 @@ SCX_OPS_DEFINE(qmap_ops,
.init_task = (void *)qmap_init_task,
.init = (void *)qmap_init,
.exit = (void *)qmap_exit,
+ .timeout_ms = 5000U,
.name = "qmap");
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index 1c0455c64f64..bce3f826cd6f 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -19,10 +19,12 @@ const char help_fmt[] =
"\n"
"See the top-level comment in .bpf.c for more details.\n"
"\n"
-"Usage: %s [-s SLICE_US] [-e COUNT] [-b COUNT] [-p] [-v]\n"
+"Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-b COUNT] [-p] [-v]\n"
"\n"
" -s SLICE_US Override slice duration\n"
" -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n"
+" -t COUNT Stall every COUNT'th user thread\n"
+" -T COUNT Stall every COUNT'th kernel thread\n"
" -b COUNT Dispatch upto COUNT tasks together\n"
" -p Switch only tasks on SCHED_EXT policy intead of all\n"
" -v Print libbpf debug messages\n"
@@ -55,7 +57,7 @@ int main(int argc, char **argv)

skel = SCX_OPS_OPEN(qmap_ops, scx_qmap);

- while ((opt = getopt(argc, argv, "s:e:b:pvh")) != -1) {
+ while ((opt = getopt(argc, argv, "s:e:t:T:b:pvh")) != -1) {
switch (opt) {
case 's':
skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
@@ -63,6 +65,12 @@ int main(int argc, char **argv)
case 'e':
skel->bss->test_error_cnt = strtoul(optarg, NULL, 0);
break;
+ case 't':
+ skel->rodata->stall_user_nth = strtoul(optarg, NULL, 0);
+ break;
+ case 'T':
+ skel->rodata->stall_kernel_nth = strtoul(optarg, NULL, 0);
+ break;
case 'b':
skel->rodata->dsp_batch = strtoul(optarg, NULL, 0);
break;
--
2.44.0


2024-05-01 15:17:33

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 18/39] sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT

BPF schedulers might not want to schedule certain tasks - e.g. kernel
threads. This patch adds p->scx.disallow which can be set by BPF schedulers
in such cases. The field can be changed anytime and setting it in
ops.prep_enable() guarantees that the task can never be scheduled by
sched_ext.

scx_qmap is updated with the -d option to disallow a specific PID:

# echo $$
1092
# grep -E '(policy)|(ext\.enabled)' /proc/self/sched
policy : 0
ext.enabled : 0
# ./set-scx 1092
# grep -E '(policy)|(ext\.enabled)' /proc/self/sched
policy : 7
ext.enabled : 0

Run "scx_qmap -d 1092" in another terminal.

# cat /sys/kernel/sched_ext/nr_rejected
1
# grep -E '(policy)|(ext\.enabled)' /proc/self/sched
policy : 0
ext.enabled : 1
# ./set-scx 1092
setparam failed for 1092 (Permission denied)

- v3: Update description to reflect /sys/kernel/sched_ext interface change.

- v2: Use atomic_long_t instead of atomic64_t for scx_kick_cpus_pnt_seqs to
accommodate 32bit archs.

Signed-off-by: Tejun Heo <[email protected]>
Suggested-by: Barret Rhoden <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/sched/ext.h | 12 ++++++++
kernel/sched/core.c | 4 +++
kernel/sched/ext.c | 50 ++++++++++++++++++++++++++++++++++
kernel/sched/ext.h | 2 ++
tools/sched_ext/scx_qmap.bpf.c | 4 +++
tools/sched_ext/scx_qmap.c | 11 ++++++--
6 files changed, 81 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 6a5a10092fb3..2608a8f548db 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -137,6 +137,18 @@ struct sched_ext_entity {
*/
u64 slice;

+ /*
+ * If set, reject future sched_setscheduler(2) calls updating the policy
+ * to %SCHED_EXT with -%EACCES.
+ *
+ * If set from ops.init_task() and the task's policy is already
+ * %SCHED_EXT, which can happen while the BPF scheduler is being loaded
+ * or by inhering the parent's policy during fork, the task's policy is
+ * rejected and forcefully reverted to %SCHED_NORMAL. The number of
+ * such events are reported through /sys/kernel/debug/sched_ext::nr_rejected.
+ */
+ bool disallow; /* reject switching into SCX */
+
/* cold fields */
/* must be the last field, see init_scx_entity() */
struct list_head tasks_node;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b921725e6b2..aae9c1297622 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7866,6 +7866,10 @@ static int __sched_setscheduler(struct task_struct *p,
goto unlock;
}

+ retval = scx_check_setscheduler(p, policy);
+ if (retval)
+ goto unlock;
+
/*
* If not changing anything there's no need to proceed further,
* but store a possible modification of reset_on_fork.
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 63f47a9e1262..8d2ff81e8dd4 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -483,6 +483,8 @@ struct static_key_false scx_has_op[SCX_OPI_END] =
static atomic_t scx_exit_kind = ATOMIC_INIT(SCX_EXIT_DONE);
static struct scx_exit_info *scx_exit_info;

+static atomic_long_t scx_nr_rejected = ATOMIC_LONG_INIT(0);
+
/*
* The maximum amount of time in jiffies that a task may be runnable without
* being scheduled on a CPU. If this timeout is exceeded, it will trigger
@@ -2324,6 +2326,8 @@ static int scx_ops_init_task(struct task_struct *p, struct task_group *tg, bool
{
int ret;

+ p->scx.disallow = false;
+
if (SCX_HAS_OP(init_task)) {
struct scx_init_task_args args = {
.fork = fork,
@@ -2338,6 +2342,27 @@ static int scx_ops_init_task(struct task_struct *p, struct task_group *tg, bool

scx_set_task_state(p, SCX_TASK_INIT);

+ if (p->scx.disallow) {
+ struct rq *rq;
+ struct rq_flags rf;
+
+ rq = task_rq_lock(p, &rf);
+
+ /*
+ * We're either in fork or load path and @p->policy will be
+ * applied right after. Reverting @p->policy here and rejecting
+ * %SCHED_EXT transitions from scx_check_setscheduler()
+ * guarantees that if ops.init_task() sets @p->disallow, @p can
+ * never be in SCX.
+ */
+ if (p->policy == SCHED_EXT) {
+ p->policy = SCHED_NORMAL;
+ atomic_long_inc(&scx_nr_rejected);
+ }
+
+ task_rq_unlock(rq, p, &rf);
+ }
+
p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT;
return 0;
}
@@ -2541,6 +2566,18 @@ static void switched_from_scx(struct rq *rq, struct task_struct *p)
static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
static void switched_to_scx(struct rq *rq, struct task_struct *p) {}

+int scx_check_setscheduler(struct task_struct *p, int policy)
+{
+ lockdep_assert_rq_held(task_rq(p));
+
+ /* if disallow, reject transitioning into SCX */
+ if (scx_enabled() && READ_ONCE(p->scx.disallow) &&
+ p->policy != policy && policy == SCHED_EXT)
+ return -EACCES;
+
+ return 0;
+}
+
/*
* Omitted operations:
*
@@ -2695,9 +2732,17 @@ static ssize_t scx_attr_switch_all_show(struct kobject *kobj,
}
SCX_ATTR(switch_all);

+static ssize_t scx_attr_nr_rejected_show(struct kobject *kobj,
+ struct kobj_attribute *ka, char *buf)
+{
+ return sysfs_emit(buf, "%ld\n", atomic_long_read(&scx_nr_rejected));
+}
+SCX_ATTR(nr_rejected);
+
static struct attribute *scx_global_attrs[] = {
&scx_attr_state.attr,
&scx_attr_switch_all.attr,
+ &scx_attr_nr_rejected.attr,
NULL,
};

@@ -3157,6 +3202,8 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
atomic_set(&scx_exit_kind, SCX_EXIT_NONE);
scx_warned_zero_slice = false;

+ atomic_long_set(&scx_nr_rejected, 0);
+
/*
* Keep CPUs stable during enable so that the BPF scheduler can track
* online CPUs by watching ->on/offline_cpu() after ->init().
@@ -3456,6 +3503,9 @@ static int bpf_scx_btf_struct_access(struct bpf_verifier_log *log,
if (off >= offsetof(struct task_struct, scx.slice) &&
off + size <= offsetofend(struct task_struct, scx.slice))
return SCALAR_VALUE;
+ if (off >= offsetof(struct task_struct, scx.disallow) &&
+ off + size <= offsetofend(struct task_struct, scx.disallow))
+ return SCALAR_VALUE;
}

return -EACCES;
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 0a8717b306ba..2ea6c19d2462 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -35,6 +35,7 @@ void scx_pre_fork(struct task_struct *p);
int scx_fork(struct task_struct *p);
void scx_post_fork(struct task_struct *p);
void scx_cancel_fork(struct task_struct *p);
+int scx_check_setscheduler(struct task_struct *p, int policy);
bool task_should_scx(struct task_struct *p);
void init_sched_ext_class(void);

@@ -81,6 +82,7 @@ static inline void scx_pre_fork(struct task_struct *p) {}
static inline int scx_fork(struct task_struct *p) { return 0; }
static inline void scx_post_fork(struct task_struct *p) {}
static inline void scx_cancel_fork(struct task_struct *p) {}
+static inline int scx_check_setscheduler(struct task_struct *p, int policy) { return 0; }
static inline bool task_on_scx(const struct task_struct *p) { return false; }
static inline void init_sched_ext_class(void) {}
static inline u32 scx_cpuperf_target(s32 cpu) { return 0; }
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index 927c4cd8b218..e18f25017a0a 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -32,6 +32,7 @@ const volatile u64 slice_ns = SCX_SLICE_DFL;
const volatile u32 stall_user_nth;
const volatile u32 stall_kernel_nth;
const volatile u32 dsp_batch;
+const volatile s32 disallow_tgid;
const volatile bool switch_partial;

u32 test_error_cnt;
@@ -244,6 +245,9 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
s32 BPF_STRUCT_OPS(qmap_init_task, struct task_struct *p,
struct scx_init_task_args *args)
{
+ if (p->tgid == disallow_tgid)
+ p->scx.disallow = true;
+
/*
* @p is new. Let's ensure that its task_ctx is available. We can sleep
* in this function and the following will automatically use GFP_KERNEL.
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index bce3f826cd6f..d2b98ef3ead2 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -19,13 +19,15 @@ const char help_fmt[] =
"\n"
"See the top-level comment in .bpf.c for more details.\n"
"\n"
-"Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-b COUNT] [-p] [-v]\n"
+"Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-b COUNT]\n"
+" [-d PID] [-p] [-v]\n"
"\n"
" -s SLICE_US Override slice duration\n"
" -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n"
" -t COUNT Stall every COUNT'th user thread\n"
" -T COUNT Stall every COUNT'th kernel thread\n"
" -b COUNT Dispatch upto COUNT tasks together\n"
+" -d PID Disallow a process from switching into SCHED_EXT (-1 for self)\n"
" -p Switch only tasks on SCHED_EXT policy intead of all\n"
" -v Print libbpf debug messages\n"
" -h Display this help and exit\n";
@@ -57,7 +59,7 @@ int main(int argc, char **argv)

skel = SCX_OPS_OPEN(qmap_ops, scx_qmap);

- while ((opt = getopt(argc, argv, "s:e:t:T:b:pvh")) != -1) {
+ while ((opt = getopt(argc, argv, "s:e:t:T:b:d:pvh")) != -1) {
switch (opt) {
case 's':
skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
@@ -74,6 +76,11 @@ int main(int argc, char **argv)
case 'b':
skel->rodata->dsp_batch = strtoul(optarg, NULL, 0);
break;
+ case 'd':
+ skel->rodata->disallow_tgid = strtol(optarg, NULL, 0);
+ if (skel->rodata->disallow_tgid < 0)
+ skel->rodata->disallow_tgid = getpid();
+ break;
case 'p':
skel->rodata->switch_partial = true;
skel->struct_ops.qmap_ops->flags |= __COMPAT_SCX_OPS_SWITCH_PARTIAL;
--
2.44.0


2024-05-01 15:17:44

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 19/39] sched_ext: Print sched_ext info when dumping stack

From: David Vernet <[email protected]>

It would be useful to see what the sched_ext scheduler state is, and what
scheduler is running, when we're dumping a task's stack. This patch
therefore adds a new print_scx_info() function that's called in the same
context as print_worker_info() and print_stop_info(). An example dump
follows.

BUG: kernel NULL pointer dereference, address: 0000000000000999
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP
CPU: 13 PID: 2047 Comm: insmod Tainted: G O 6.6.0-work-10323-gb58d4cae8e99-dirty #34
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS unknown 2/2/2022
Sched_ext: qmap (enabled+all), task: runnable_at=-17ms
RIP: 0010:init_module+0x9/0x1000 [test_module]
...

v3: - scx_ops_enable_state_str[] definition moved to an earlier patch as
it's now used by core implementation.

- Convert jiffy delta to msecs using jiffies_to_msecs() instead of
multiplying by (HZ / MSEC_PER_SEC). The conversion is implemented in
jiffies_delta_msecs().

v2: - We are now using scx_ops_enable_state_str[] outside
CONFIG_SCHED_DEBUG. Move it outside of CONFIG_SCHED_DEBUG and to the
top. This was reported by Changwoo and Andrea.

Signed-off-by: David Vernet <[email protected]>
Reported-by: Changwoo Min <[email protected]>
Reported-by: Andrea Righi <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/sched/ext.h | 2 ++
kernel/sched/core.c | 1 +
kernel/sched/ext.c | 53 +++++++++++++++++++++++++++++++++++++++
lib/dump_stack.c | 1 +
4 files changed, 57 insertions(+)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 2608a8f548db..123d6dffdf26 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -155,10 +155,12 @@ struct sched_ext_entity {
};

void sched_ext_free(struct task_struct *p);
+void print_scx_info(const char *log_lvl, struct task_struct *p);

#else /* !CONFIG_SCHED_CLASS_EXT */

static inline void sched_ext_free(struct task_struct *p) {}
+static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {}

#endif /* CONFIG_SCHED_CLASS_EXT */
#endif /* _LINUX_SCHED_EXT_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aae9c1297622..42fe654bf946 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9272,6 +9272,7 @@ void sched_show_task(struct task_struct *p)

print_worker_info(KERN_INFO, p);
print_stop_info(KERN_INFO, p);
+ print_scx_info(KERN_INFO, p);
show_stack(p, NULL, KERN_INFO);
put_task_stack(p);
}
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 8d2ff81e8dd4..ff080b5f0330 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -577,6 +577,14 @@ static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind,

#define SCX_HAS_OP(op) static_branch_likely(&scx_has_op[SCX_OP_IDX(op)])

+static long jiffies_delta_msecs(unsigned long at, unsigned long now)
+{
+ if (time_after(at, now))
+ return jiffies_to_msecs(at - now);
+ else
+ return -(long)jiffies_to_msecs(now - at);
+}
+
/* if the highest set bit is N, return a mask with bits [N+1, 31] set */
static u32 higher_bits(u32 flags)
{
@@ -3695,6 +3703,51 @@ static const struct sysrq_key_op sysrq_sched_ext_reset_op = {
.enable_mask = SYSRQ_ENABLE_RTNICE,
};

+/**
+ * print_scx_info - print out sched_ext scheduler state
+ * @log_lvl: the log level to use when printing
+ * @p: target task
+ *
+ * If a sched_ext scheduler is enabled, print the name and state of the
+ * scheduler. If @p is on sched_ext, print further information about the task.
+ *
+ * This function can be safely called on any task as long as the task_struct
+ * itself is accessible. While safe, this function isn't synchronized and may
+ * print out mixups or garbages of limited length.
+ */
+void print_scx_info(const char *log_lvl, struct task_struct *p)
+{
+ enum scx_ops_enable_state state = scx_ops_enable_state();
+ const char *all = READ_ONCE(scx_switching_all) ? "+all" : "";
+ char runnable_at_buf[22] = "?";
+ struct sched_class *class;
+ unsigned long runnable_at;
+
+ if (state == SCX_OPS_DISABLED)
+ return;
+
+ /*
+ * Carefully check if the task was running on sched_ext, and then
+ * carefully copy the time it's been runnable, and its state.
+ */
+ if (copy_from_kernel_nofault(&class, &p->sched_class, sizeof(class)) ||
+ class != &ext_sched_class) {
+ printk("%sSched_ext: %s (%s%s)", log_lvl, scx_ops.name,
+ scx_ops_enable_state_str[state], all);
+ return;
+ }
+
+ if (!copy_from_kernel_nofault(&runnable_at, &p->scx.runnable_at,
+ sizeof(runnable_at)))
+ scnprintf(runnable_at_buf, sizeof(runnable_at_buf), "%+ldms",
+ jiffies_delta_msecs(runnable_at, jiffies));
+
+ /* print everything onto one line to conserve console space */
+ printk("%sSched_ext: %s (%s%s), task: runnable_at=%s",
+ log_lvl, scx_ops.name, scx_ops_enable_state_str[state], all,
+ runnable_at_buf);
+}
+
void __init init_sched_ext_class(void)
{
s32 cpu, v;
diff --git a/lib/dump_stack.c b/lib/dump_stack.c
index 222c6d6c8281..9581ef4efec5 100644
--- a/lib/dump_stack.c
+++ b/lib/dump_stack.c
@@ -68,6 +68,7 @@ void dump_stack_print_info(const char *log_lvl)

print_worker_info(log_lvl, current);
print_stop_info(log_lvl, current);
+ print_scx_info(log_lvl, current);
}

/**
--
2.44.0


2024-05-01 15:17:46

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 15/39] sched_ext: Add scx_simple and scx_example_qmap example schedulers

Add two simple example BPF schedulers - simple and qmap.

* simple: In terms of scheduling, it behaves identical to not having any
operation implemented at all. The two operations it implements are only to
improve visibility and exit handling. On certain homogeneous
configurations, this actually can perform pretty well.

* qmap: A fixed five level priority scheduler to demonstrate queueing PIDs
on BPF maps for scheduling. While not very practical, this is useful as a
simple example and will be used to demonstrate different features.

v6: - Common header files reorganized and cleaned up. Compat helpers are
added to demonstrate how schedulers can maintain backward
compatibility with older kernels while making use of newly added
features.

- simple_select_cpu() added to keep track of the number of local
dispatches. This is needed because the default ops.select_cpu()
implementation is updated to dispatch directly and won't call
ops.enqueue().

- Updated to reflect the sched_ext API changes. Switching all tasks is
the default behavior now and scx_qmap supports partial switching when
`-p` is specified.

- tools/sched_ext/Kconfig dropped. This will be included in the doc
instead.

v5: - Improve Makefile. Build artifects are now collected into a separate
dir which change be changed. Install and help targets are added and
clean actually cleans everything.

- MEMBER_VPTR() improved to improve access to structs. ARRAY_ELEM_PTR()
and RESIZEABLE_ARRAY() are added to support resizable arrays in .bss.

- Add scx_common.h which provides common utilities to user code such as
SCX_BUG[_ON]() and RESIZE_ARRAY().

- Use SCX_BUG[_ON]() to simplify error handling.

v4: - Dropped _example prefix from scheduler names.

v3: - Rename scx_example_dummy to scx_example_simple and restructure a bit
to ease later additions. Comment updates.

- Added declarations for BPF inline iterators. In the future, hopefully,
these will be consolidated into a generic BPF header so that they
don't need to be replicated here.

v2: - Updated with the generic BPF cpumask helpers.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
Makefile | 8 +-
tools/Makefile | 10 +-
tools/sched_ext/.gitignore | 2 +
tools/sched_ext/Makefile | 246 ++++++++++++++++
.../sched_ext/include/bpf-compat/gnu/stubs.h | 11 +
tools/sched_ext/include/scx/common.bpf.h | 275 ++++++++++++++++++
tools/sched_ext/include/scx/common.h | 71 +++++
tools/sched_ext/include/scx/compat.bpf.h | 53 ++++
tools/sched_ext/include/scx/compat.h | 155 ++++++++++
tools/sched_ext/include/scx/user_exit_info.h | 64 ++++
tools/sched_ext/scx_qmap.bpf.c | 268 +++++++++++++++++
tools/sched_ext/scx_qmap.c | 100 +++++++
tools/sched_ext/scx_simple.bpf.c | 63 ++++
tools/sched_ext/scx_simple.c | 99 +++++++
14 files changed, 1423 insertions(+), 2 deletions(-)
create mode 100644 tools/sched_ext/.gitignore
create mode 100644 tools/sched_ext/Makefile
create mode 100644 tools/sched_ext/include/bpf-compat/gnu/stubs.h
create mode 100644 tools/sched_ext/include/scx/common.bpf.h
create mode 100644 tools/sched_ext/include/scx/common.h
create mode 100644 tools/sched_ext/include/scx/compat.bpf.h
create mode 100644 tools/sched_ext/include/scx/compat.h
create mode 100644 tools/sched_ext/include/scx/user_exit_info.h
create mode 100644 tools/sched_ext/scx_qmap.bpf.c
create mode 100644 tools/sched_ext/scx_qmap.c
create mode 100644 tools/sched_ext/scx_simple.bpf.c
create mode 100644 tools/sched_ext/scx_simple.c

diff --git a/Makefile b/Makefile
index 763b6792d3d5..3e8a804efa27 100644
--- a/Makefile
+++ b/Makefile
@@ -1344,6 +1344,12 @@ ifneq ($(wildcard $(resolve_btfids_O)),)
$(Q)$(MAKE) -sC $(srctree)/tools/bpf/resolve_btfids O=$(resolve_btfids_O) clean
endif

+tools-clean-targets := sched_ext
+PHONY += $(tools-clean-targets)
+$(tools-clean-targets):
+ $(Q)$(MAKE) -sC tools $@_clean
+tools_clean: $(tools-clean-targets)
+
# Clear a bunch of variables before executing the submake
ifeq ($(quiet),silent_)
tools_silent=s
@@ -1513,7 +1519,7 @@ PHONY += $(mrproper-dirs) mrproper
$(mrproper-dirs):
$(Q)$(MAKE) $(clean)=$(patsubst _mrproper_%,%,$@)

-mrproper: clean $(mrproper-dirs)
+mrproper: clean $(mrproper-dirs) tools_clean
$(call cmd,rmfiles)
@find . $(RCS_FIND_IGNORE) \
\( -name '*.rmeta' \) \
diff --git a/tools/Makefile b/tools/Makefile
index 276f5d0d53a4..278d24723b74 100644
--- a/tools/Makefile
+++ b/tools/Makefile
@@ -28,6 +28,7 @@ include scripts/Makefile.include
@echo ' pci - PCI tools'
@echo ' perf - Linux performance measurement and analysis tool'
@echo ' selftests - various kernel selftests'
+ @echo ' sched_ext - sched_ext example schedulers'
@echo ' bootconfig - boot config tool'
@echo ' spi - spi tools'
@echo ' tmon - thermal monitoring and tuning tool'
@@ -91,6 +92,9 @@ perf: FORCE
$(Q)mkdir -p $(PERF_O) .
$(Q)$(MAKE) --no-print-directory -C perf O=$(PERF_O) subdir=

+sched_ext: FORCE
+ $(call descend,sched_ext)
+
selftests: FORCE
$(call descend,testing/$@)

@@ -184,6 +188,9 @@ install: acpi_install counter_install cpupower_install gpio_install \
$(Q)mkdir -p $(PERF_O) .
$(Q)$(MAKE) --no-print-directory -C perf O=$(PERF_O) subdir= clean

+sched_ext_clean:
+ $(call descend,sched_ext,clean)
+
selftests_clean:
$(call descend,testing/$(@:_clean=),clean)

@@ -213,6 +220,7 @@ clean: acpi_clean counter_clean cpupower_clean hv_clean firewire_clean \
mm_clean bpf_clean iio_clean x86_energy_perf_policy_clean tmon_clean \
freefall_clean build_clean libbpf_clean libsubcmd_clean \
gpio_clean objtool_clean leds_clean wmi_clean pci_clean firmware_clean debugging_clean \
- intel-speed-select_clean tracing_clean thermal_clean thermometer_clean thermal-engine_clean
+ intel-speed-select_clean tracing_clean thermal_clean thermometer_clean thermal-engine_clean \
+ sched_ext_clean

.PHONY: FORCE
diff --git a/tools/sched_ext/.gitignore b/tools/sched_ext/.gitignore
new file mode 100644
index 000000000000..d6264fe1c8cd
--- /dev/null
+++ b/tools/sched_ext/.gitignore
@@ -0,0 +1,2 @@
+tools/
+build/
diff --git a/tools/sched_ext/Makefile b/tools/sched_ext/Makefile
new file mode 100644
index 000000000000..626782a21375
--- /dev/null
+++ b/tools/sched_ext/Makefile
@@ -0,0 +1,246 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+include ../build/Build.include
+include ../scripts/Makefile.arch
+include ../scripts/Makefile.include
+
+all: all_targets
+
+ifneq ($(LLVM),)
+ifneq ($(filter %/,$(LLVM)),)
+LLVM_PREFIX := $(LLVM)
+else ifneq ($(filter -%,$(LLVM)),)
+LLVM_SUFFIX := $(LLVM)
+endif
+
+CLANG_TARGET_FLAGS_arm := arm-linux-gnueabi
+CLANG_TARGET_FLAGS_arm64 := aarch64-linux-gnu
+CLANG_TARGET_FLAGS_hexagon := hexagon-linux-musl
+CLANG_TARGET_FLAGS_m68k := m68k-linux-gnu
+CLANG_TARGET_FLAGS_mips := mipsel-linux-gnu
+CLANG_TARGET_FLAGS_powerpc := powerpc64le-linux-gnu
+CLANG_TARGET_FLAGS_riscv := riscv64-linux-gnu
+CLANG_TARGET_FLAGS_s390 := s390x-linux-gnu
+CLANG_TARGET_FLAGS_x86 := x86_64-linux-gnu
+CLANG_TARGET_FLAGS := $(CLANG_TARGET_FLAGS_$(ARCH))
+
+ifeq ($(CROSS_COMPILE),)
+ifeq ($(CLANG_TARGET_FLAGS),)
+$(error Specify CROSS_COMPILE or add '--target=' option to lib.mk)
+else
+CLANG_FLAGS += --target=$(CLANG_TARGET_FLAGS)
+endif # CLANG_TARGET_FLAGS
+else
+CLANG_FLAGS += --target=$(notdir $(CROSS_COMPILE:%-=%))
+endif # CROSS_COMPILE
+
+CC := $(LLVM_PREFIX)clang$(LLVM_SUFFIX) $(CLANG_FLAGS) -fintegrated-as
+else
+CC := $(CROSS_COMPILE)gcc
+endif # LLVM
+
+CURDIR := $(abspath .)
+TOOLSDIR := $(abspath ..)
+LIBDIR := $(TOOLSDIR)/lib
+BPFDIR := $(LIBDIR)/bpf
+TOOLSINCDIR := $(TOOLSDIR)/include
+BPFTOOLDIR := $(TOOLSDIR)/bpf/bpftool
+APIDIR := $(TOOLSINCDIR)/uapi
+GENDIR := $(abspath ../../include/generated)
+GENHDR := $(GENDIR)/autoconf.h
+
+ifeq ($(O),)
+OUTPUT_DIR := $(CURDIR)/build
+else
+OUTPUT_DIR := $(O)/build
+endif # O
+OBJ_DIR := $(OUTPUT_DIR)/obj
+INCLUDE_DIR := $(OUTPUT_DIR)/include
+BPFOBJ_DIR := $(OBJ_DIR)/libbpf
+SCXOBJ_DIR := $(OBJ_DIR)/sched_ext
+BINDIR := $(OUTPUT_DIR)/bin
+BPFOBJ := $(BPFOBJ_DIR)/libbpf.a
+ifneq ($(CROSS_COMPILE),)
+HOST_BUILD_DIR := $(OBJ_DIR)/host
+HOST_OUTPUT_DIR := host-tools
+HOST_INCLUDE_DIR := $(HOST_OUTPUT_DIR)/include
+else
+HOST_BUILD_DIR := $(OBJ_DIR)
+HOST_OUTPUT_DIR := $(OUTPUT_DIR)
+HOST_INCLUDE_DIR := $(INCLUDE_DIR)
+endif
+HOST_BPFOBJ := $(HOST_BUILD_DIR)/libbpf/libbpf.a
+RESOLVE_BTFIDS := $(HOST_BUILD_DIR)/resolve_btfids/resolve_btfids
+DEFAULT_BPFTOOL := $(HOST_OUTPUT_DIR)/sbin/bpftool
+
+VMLINUX_BTF_PATHS ?= $(if $(O),$(O)/vmlinux) \
+ $(if $(KBUILD_OUTPUT),$(KBUILD_OUTPUT)/vmlinux) \
+ ../../vmlinux \
+ /sys/kernel/btf/vmlinux \
+ /boot/vmlinux-$(shell uname -r)
+VMLINUX_BTF ?= $(abspath $(firstword $(wildcard $(VMLINUX_BTF_PATHS))))
+ifeq ($(VMLINUX_BTF),)
+$(error Cannot find a vmlinux for VMLINUX_BTF at any of "$(VMLINUX_BTF_PATHS)")
+endif
+
+BPFTOOL ?= $(DEFAULT_BPFTOOL)
+
+ifneq ($(wildcard $(GENHDR)),)
+ GENFLAGS := -DHAVE_GENHDR
+endif
+
+CFLAGS += -g -O2 -rdynamic -pthread -Wall -Werror $(GENFLAGS) \
+ -I$(INCLUDE_DIR) -I$(GENDIR) -I$(LIBDIR) \
+ -I$(TOOLSINCDIR) -I$(APIDIR) -I$(CURDIR)/include
+
+# Silence some warnings when compiled with clang
+ifneq ($(LLVM),)
+CFLAGS += -Wno-unused-command-line-argument
+endif
+
+LDFLAGS = -lelf -lz -lpthread
+
+IS_LITTLE_ENDIAN = $(shell $(CC) -dM -E - </dev/null | \
+ grep 'define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__')
+
+# Get Clang's default includes on this system, as opposed to those seen by
+# '-target bpf'. This fixes "missing" files on some architectures/distros,
+# such as asm/byteorder.h, asm/socket.h, asm/sockios.h, sys/cdefs.h etc.
+#
+# Use '-idirafter': Don't interfere with include mechanics except where the
+# build would have failed anyways.
+define get_sys_includes
+$(shell $(1) -v -E - </dev/null 2>&1 \
+ | sed -n '/<...> search starts here:/,/End of search list./{ s| \(/.*\)|-idirafter \1|p }') \
+$(shell $(1) -dM -E - </dev/null | grep '__riscv_xlen ' | awk '{printf("-D__riscv_xlen=%d -D__BITS_PER_LONG=%d", $$3, $$3)}')
+endef
+
+BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH) \
+ $(if $(IS_LITTLE_ENDIAN),-mlittle-endian,-mbig-endian) \
+ -I$(CURDIR)/include -I$(CURDIR)/include/bpf-compat \
+ -I$(INCLUDE_DIR) -I$(APIDIR) \
+ -I../../include \
+ $(call get_sys_includes,$(CLANG)) \
+ -Wall -Wno-compare-distinct-pointer-types \
+ -O2 -mcpu=v3
+
+# sort removes libbpf duplicates when not cross-building
+MAKE_DIRS := $(sort $(OBJ_DIR)/libbpf $(HOST_BUILD_DIR)/libbpf \
+ $(HOST_BUILD_DIR)/bpftool $(HOST_BUILD_DIR)/resolve_btfids \
+ $(INCLUDE_DIR) $(SCXOBJ_DIR) $(BINDIR))
+
+$(MAKE_DIRS):
+ $(call msg,MKDIR,,$@)
+ $(Q)mkdir -p $@
+
+$(BPFOBJ): $(wildcard $(BPFDIR)/*.[ch] $(BPFDIR)/Makefile) \
+ $(APIDIR)/linux/bpf.h \
+ | $(OBJ_DIR)/libbpf
+ $(Q)$(MAKE) $(submake_extras) -C $(BPFDIR) OUTPUT=$(OBJ_DIR)/libbpf/ \
+ EXTRA_CFLAGS='-g -O0 -fPIC' \
+ DESTDIR=$(OUTPUT_DIR) prefix= all install_headers
+
+$(DEFAULT_BPFTOOL): $(wildcard $(BPFTOOLDIR)/*.[ch] $(BPFTOOLDIR)/Makefile) \
+ $(HOST_BPFOBJ) | $(HOST_BUILD_DIR)/bpftool
+ $(Q)$(MAKE) $(submake_extras) -C $(BPFTOOLDIR) \
+ ARCH= CROSS_COMPILE= CC=$(HOSTCC) LD=$(HOSTLD) \
+ EXTRA_CFLAGS='-g -O0' \
+ OUTPUT=$(HOST_BUILD_DIR)/bpftool/ \
+ LIBBPF_OUTPUT=$(HOST_BUILD_DIR)/libbpf/ \
+ LIBBPF_DESTDIR=$(HOST_OUTPUT_DIR)/ \
+ prefix= DESTDIR=$(HOST_OUTPUT_DIR)/ install-bin
+
+$(INCLUDE_DIR)/vmlinux.h: $(VMLINUX_BTF) $(BPFTOOL) | $(INCLUDE_DIR)
+ifeq ($(VMLINUX_H),)
+ $(call msg,GEN,,$@)
+ $(Q)$(BPFTOOL) btf dump file $(VMLINUX_BTF) format c > $@
+else
+ $(call msg,CP,,$@)
+ $(Q)cp "$(VMLINUX_H)" $@
+endif
+
+$(SCXOBJ_DIR)/%.bpf.o: %.bpf.c $(INCLUDE_DIR)/vmlinux.h include/scx/*.h \
+ | $(BPFOBJ) $(SCXOBJ_DIR)
+ $(call msg,CLNG-BPF,,$(notdir $@))
+ $(Q)$(CLANG) $(BPF_CFLAGS) -target bpf -c $< -o $@
+
+$(INCLUDE_DIR)/%.bpf.skel.h: $(SCXOBJ_DIR)/%.bpf.o $(INCLUDE_DIR)/vmlinux.h $(BPFTOOL)
+ $(eval sched=$(notdir $@))
+ $(call msg,GEN-SKEL,,$(sched))
+ $(Q)$(BPFTOOL) gen object $(<:.o=.linked1.o) $<
+ $(Q)$(BPFTOOL) gen object $(<:.o=.linked2.o) $(<:.o=.linked1.o)
+ $(Q)$(BPFTOOL) gen object $(<:.o=.linked3.o) $(<:.o=.linked2.o)
+ $(Q)diff $(<:.o=.linked2.o) $(<:.o=.linked3.o)
+ $(Q)$(BPFTOOL) gen skeleton $(<:.o=.linked3.o) name $(subst .bpf.skel.h,,$(sched)) > $@
+ $(Q)$(BPFTOOL) gen subskeleton $(<:.o=.linked3.o) name $(subst .bpf.skel.h,,$(sched)) > $(@:.skel.h=.subskel.h)
+
+SCX_COMMON_DEPS := include/scx/common.h include/scx/user_exit_info.h | $(BINDIR)
+
+c-sched-targets = scx_simple scx_qmap
+
+$(addprefix $(BINDIR)/,$(c-sched-targets)): \
+ $(BINDIR)/%: \
+ $(filter-out %.bpf.c,%.c) \
+ $(INCLUDE_DIR)/%.bpf.skel.h \
+ $(SCX_COMMON_DEPS)
+ $(eval sched=$(notdir $@))
+ $(CC) $(CFLAGS) -c $(sched).c -o $(SCXOBJ_DIR)/$(sched).o
+ $(CC) -o $@ $(SCXOBJ_DIR)/$(sched).o $(HOST_BPFOBJ) $(LDFLAGS)
+
+$(c-sched-targets): %: $(BINDIR)/%
+
+install: all
+ $(Q)mkdir -p $(DESTDIR)/usr/local/bin/
+ $(Q)cp $(BINDIR)/* $(DESTDIR)/usr/local/bin/
+
+clean:
+ rm -rf $(OUTPUT_DIR) $(HOST_OUTPUT_DIR)
+ rm -f *.o *.bpf.o *.bpf.skel.h *.bpf.subskel.h
+ rm -f $(c-sched-targets)
+
+help:
+ @echo 'Building targets'
+ @echo '================'
+ @echo ''
+ @echo ' all - Compile all schedulers'
+ @echo ''
+ @echo 'Alternatively, you may compile individual schedulers:'
+ @echo ''
+ @printf ' %s\n' $(c-sched-targets)
+ @echo ''
+ @echo 'For any scheduler build target, you may specify an alternative'
+ @echo 'build output path with the O= environment variable. For example:'
+ @echo ''
+ @echo ' O=/tmp/sched_ext make all'
+ @echo ''
+ @echo 'will compile all schedulers, and emit the build artifacts to'
+ @echo '/tmp/sched_ext/build.'
+ @echo ''
+ @echo ''
+ @echo 'Installing targets'
+ @echo '=================='
+ @echo ''
+ @echo ' install - Compile and install all schedulers to /usr/bin.'
+ @echo ' You may specify the DESTDIR= environment variable'
+ @echo ' to indicate a prefix for /usr/bin. For example:'
+ @echo ''
+ @echo ' DESTDIR=/tmp/sched_ext make install'
+ @echo ''
+ @echo ' will build the schedulers in CWD/build, and'
+ @echo ' install the schedulers to /tmp/sched_ext/usr/bin.'
+ @echo ''
+ @echo ''
+ @echo 'Cleaning targets'
+ @echo '================'
+ @echo ''
+ @echo ' clean - Remove all generated files'
+
+all_targets: $(c-sched-targets)
+
+.PHONY: all all_targets $(c-sched-targets) clean help
+
+# delete failed targets
+.DELETE_ON_ERROR:
+
+# keep intermediate (.bpf.skel.h, .bpf.o, etc) targets
+.SECONDARY:
diff --git a/tools/sched_ext/include/bpf-compat/gnu/stubs.h b/tools/sched_ext/include/bpf-compat/gnu/stubs.h
new file mode 100644
index 000000000000..ad7d139ce907
--- /dev/null
+++ b/tools/sched_ext/include/bpf-compat/gnu/stubs.h
@@ -0,0 +1,11 @@
+/*
+ * Dummy gnu/stubs.h. clang can end up including /usr/include/gnu/stubs.h when
+ * compiling BPF files although its content doesn't play any role. The file in
+ * turn includes stubs-64.h or stubs-32.h depending on whether __x86_64__ is
+ * defined. When compiling a BPF source, __x86_64__ isn't set and thus
+ * stubs-32.h is selected. However, the file is not there if the system doesn't
+ * have 32bit glibc devel package installed leading to a build failure.
+ *
+ * The problem is worked around by making this file available in the include
+ * search paths before the system one when building BPF.
+ */
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
new file mode 100644
index 000000000000..6b355899f67d
--- /dev/null
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -0,0 +1,275 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#ifndef __SCX_COMMON_BPF_H
+#define __SCX_COMMON_BPF_H
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <asm-generic/errno.h>
+#include "user_exit_info.h"
+
+#define PF_WQ_WORKER 0x00000020 /* I'm a workqueue worker */
+#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
+#define PF_EXITING 0x00000004
+#define CLOCK_MONOTONIC 1
+
+/*
+ * Earlier versions of clang/pahole lost upper 32bits in 64bit enums which can
+ * lead to really confusing misbehaviors. Let's trigger a build failure.
+ */
+static inline void ___vmlinux_h_sanity_check___(void)
+{
+ _Static_assert(SCX_DSQ_FLAG_BUILTIN,
+ "bpftool generated vmlinux.h is missing high bits for 64bit enums, upgrade clang and pahole");
+}
+
+s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) __ksym;
+s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool *is_idle) __ksym;
+void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice, u64 enq_flags) __ksym;
+u32 scx_bpf_dispatch_nr_slots(void) __ksym;
+void scx_bpf_dispatch_cancel(void) __ksym;
+bool scx_bpf_consume(u64 dsq_id) __ksym;
+s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym;
+void scx_bpf_destroy_dsq(u64 dsq_id) __ksym;
+void scx_bpf_exit_bstr(s64 exit_code, char *fmt, unsigned long long *data, u32 data__sz) __ksym __weak;
+void scx_bpf_error_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym;
+u32 scx_bpf_nr_cpu_ids(void) __ksym __weak;
+const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak;
+const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak;
+void scx_bpf_put_cpumask(const struct cpumask *cpumask) __ksym __weak;
+const struct cpumask *scx_bpf_get_idle_cpumask(void) __ksym;
+const struct cpumask *scx_bpf_get_idle_smtmask(void) __ksym;
+void scx_bpf_put_idle_cpumask(const struct cpumask *cpumask) __ksym;
+bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) __ksym;
+s32 scx_bpf_pick_idle_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
+s32 scx_bpf_pick_any_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
+bool scx_bpf_task_running(const struct task_struct *p) __ksym;
+s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym;
+
+static inline __attribute__((format(printf, 1, 2)))
+void ___scx_bpf_exit_format_checker(const char *fmt, ...) {}
+
+/*
+ * Helper macro for initializing the fmt and variadic argument inputs to both
+ * bstr exit kfuncs. Callers to this function should use ___fmt and ___param to
+ * refer to the initialized list of inputs to the bstr kfunc.
+ */
+#define scx_bpf_exit_preamble(fmt, args...) \
+ static char ___fmt[] = fmt; \
+ /* \
+ * Note that __param[] must have at least one \
+ * element to keep the verifier happy. \
+ */ \
+ unsigned long long ___param[___bpf_narg(args) ?: 1] = {}; \
+ \
+ _Pragma("GCC diagnostic push") \
+ _Pragma("GCC diagnostic ignored \"-Wint-conversion\"") \
+ ___bpf_fill(___param, args); \
+ _Pragma("GCC diagnostic pop") \
+
+/*
+ * scx_bpf_exit() wraps the scx_bpf_exit_bstr() kfunc with variadic arguments
+ * instead of an array of u64. Using this macro will cause the scheduler to
+ * exit cleanly with the specified exit code being passed to user space.
+ */
+#define scx_bpf_exit(code, fmt, args...) \
+({ \
+ scx_bpf_exit_preamble(fmt, args) \
+ scx_bpf_exit_bstr(code, ___fmt, ___param, sizeof(___param)); \
+ ___scx_bpf_exit_format_checker(fmt, ##args); \
+})
+
+/*
+ * scx_bpf_error() wraps the scx_bpf_error_bstr() kfunc with variadic arguments
+ * instead of an array of u64. Invoking this macro will cause the scheduler to
+ * exit in an erroneous state, with diagnostic information being passed to the
+ * user.
+ */
+#define scx_bpf_error(fmt, args...) \
+({ \
+ scx_bpf_exit_preamble(fmt, args) \
+ scx_bpf_error_bstr(___fmt, ___param, sizeof(___param)); \
+ ___scx_bpf_exit_format_checker(fmt, ##args); \
+})
+
+#define BPF_STRUCT_OPS(name, args...) \
+SEC("struct_ops/"#name) \
+BPF_PROG(name, ##args)
+
+#define BPF_STRUCT_OPS_SLEEPABLE(name, args...) \
+SEC("struct_ops.s/"#name) \
+BPF_PROG(name, ##args)
+
+/**
+ * RESIZABLE_ARRAY - Generates annotations for an array that may be resized
+ * @elfsec: the data section of the BPF program in which to place the array
+ * @arr: the name of the array
+ *
+ * libbpf has an API for setting map value sizes. Since data sections (i.e.
+ * bss, data, rodata) themselves are maps, a data section can be resized. If
+ * a data section has an array as its last element, the BTF info for that
+ * array will be adjusted so that length of the array is extended to meet the
+ * new length of the data section. This macro annotates an array to have an
+ * element count of one with the assumption that this array can be resized
+ * within the userspace program. It also annotates the section specifier so
+ * this array exists in a custom sub data section which can be resized
+ * independently.
+ *
+ * See RESIZE_ARRAY() for the userspace convenience macro for resizing an
+ * array declared with RESIZABLE_ARRAY().
+ */
+#define RESIZABLE_ARRAY(elfsec, arr) arr[1] SEC("."#elfsec"."#arr)
+
+/**
+ * MEMBER_VPTR - Obtain the verified pointer to a struct or array member
+ * @base: struct or array to index
+ * @member: dereferenced member (e.g. .field, [idx0][idx1], .field[idx0] ...)
+ *
+ * The verifier often gets confused by the instruction sequence the compiler
+ * generates for indexing struct fields or arrays. This macro forces the
+ * compiler to generate a code sequence which first calculates the byte offset,
+ * checks it against the struct or array size and add that byte offset to
+ * generate the pointer to the member to help the verifier.
+ *
+ * Ideally, we want to abort if the calculated offset is out-of-bounds. However,
+ * BPF currently doesn't support abort, so evaluate to %NULL instead. The caller
+ * must check for %NULL and take appropriate action to appease the verifier. To
+ * avoid confusing the verifier, it's best to check for %NULL and dereference
+ * immediately.
+ *
+ * vptr = MEMBER_VPTR(my_array, [i][j]);
+ * if (!vptr)
+ * return error;
+ * *vptr = new_value;
+ *
+ * sizeof(@base) should encompass the memory area to be accessed and thus can't
+ * be a pointer to the area. Use `MEMBER_VPTR(*ptr, .member)` instead of
+ * `MEMBER_VPTR(ptr, ->member)`.
+ */
+#define MEMBER_VPTR(base, member) (typeof((base) member) *) \
+({ \
+ u64 __base = (u64)&(base); \
+ u64 __addr = (u64)&((base) member) - __base; \
+ _Static_assert(sizeof(base) >= sizeof((base) member), \
+ "@base is smaller than @member, is @base a pointer?"); \
+ asm volatile ( \
+ "if %0 <= %[max] goto +2\n" \
+ "%0 = 0\n" \
+ "goto +1\n" \
+ "%0 += %1\n" \
+ : "+r"(__addr) \
+ : "r"(__base), \
+ [max]"i"(sizeof(base) - sizeof((base) member))); \
+ __addr; \
+})
+
+/**
+ * ARRAY_ELEM_PTR - Obtain the verified pointer to an array element
+ * @arr: array to index into
+ * @i: array index
+ * @n: number of elements in array
+ *
+ * Similar to MEMBER_VPTR() but is intended for use with arrays where the
+ * element count needs to be explicit.
+ * It can be used in cases where a global array is defined with an initial
+ * size but is intended to be be resized before loading the BPF program.
+ * Without this version of the macro, MEMBER_VPTR() will use the compile time
+ * size of the array to compute the max, which will result in rejection by
+ * the verifier.
+ */
+#define ARRAY_ELEM_PTR(arr, i, n) (typeof(arr[i]) *) \
+({ \
+ u64 __base = (u64)arr; \
+ u64 __addr = (u64)&(arr[i]) - __base; \
+ asm volatile ( \
+ "if %0 <= %[max] goto +2\n" \
+ "%0 = 0\n" \
+ "goto +1\n" \
+ "%0 += %1\n" \
+ : "+r"(__addr) \
+ : "r"(__base), \
+ [max]"r"(sizeof(arr[0]) * ((n) - 1))); \
+ __addr; \
+})
+
+/*
+ * BPF core and other generic helpers
+ */
+
+/* list and rbtree */
+#define __contains(name, node) __attribute__((btf_decl_tag("contains:" #name ":" #node)))
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+
+void *bpf_obj_new_impl(__u64 local_type_id, void *meta) __ksym;
+void bpf_obj_drop_impl(void *kptr, void *meta) __ksym;
+
+#define bpf_obj_new(type) ((type *)bpf_obj_new_impl(bpf_core_type_id_local(type), NULL))
+#define bpf_obj_drop(kptr) bpf_obj_drop_impl(kptr, NULL)
+
+void bpf_list_push_front(struct bpf_list_head *head, struct bpf_list_node *node) __ksym;
+void bpf_list_push_back(struct bpf_list_head *head, struct bpf_list_node *node) __ksym;
+struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head) __ksym;
+struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head) __ksym;
+struct bpf_rb_node *bpf_rbtree_remove(struct bpf_rb_root *root,
+ struct bpf_rb_node *node) __ksym;
+int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *node,
+ bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b),
+ void *meta, __u64 off) __ksym;
+#define bpf_rbtree_add(head, node, less) bpf_rbtree_add_impl(head, node, less, NULL, 0)
+
+struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root) __ksym;
+
+void *bpf_refcount_acquire_impl(void *kptr, void *meta) __ksym;
+#define bpf_refcount_acquire(kptr) bpf_refcount_acquire_impl(kptr, NULL)
+
+/* task */
+struct task_struct *bpf_task_from_pid(s32 pid) __ksym;
+struct task_struct *bpf_task_acquire(struct task_struct *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* cgroup */
+struct cgroup *bpf_cgroup_ancestor(struct cgroup *cgrp, int level) __ksym;
+void bpf_cgroup_release(struct cgroup *cgrp) __ksym;
+struct cgroup *bpf_cgroup_from_id(u64 cgid) __ksym;
+
+/* cpumask */
+struct bpf_cpumask *bpf_cpumask_create(void) __ksym;
+struct bpf_cpumask *bpf_cpumask_acquire(struct bpf_cpumask *cpumask) __ksym;
+void bpf_cpumask_release(struct bpf_cpumask *cpumask) __ksym;
+u32 bpf_cpumask_first(const struct cpumask *cpumask) __ksym;
+u32 bpf_cpumask_first_zero(const struct cpumask *cpumask) __ksym;
+void bpf_cpumask_set_cpu(u32 cpu, struct bpf_cpumask *cpumask) __ksym;
+void bpf_cpumask_clear_cpu(u32 cpu, struct bpf_cpumask *cpumask) __ksym;
+bool bpf_cpumask_test_cpu(u32 cpu, const struct cpumask *cpumask) __ksym;
+bool bpf_cpumask_test_and_set_cpu(u32 cpu, struct bpf_cpumask *cpumask) __ksym;
+bool bpf_cpumask_test_and_clear_cpu(u32 cpu, struct bpf_cpumask *cpumask) __ksym;
+void bpf_cpumask_setall(struct bpf_cpumask *cpumask) __ksym;
+void bpf_cpumask_clear(struct bpf_cpumask *cpumask) __ksym;
+bool bpf_cpumask_and(struct bpf_cpumask *dst, const struct cpumask *src1,
+ const struct cpumask *src2) __ksym;
+void bpf_cpumask_or(struct bpf_cpumask *dst, const struct cpumask *src1,
+ const struct cpumask *src2) __ksym;
+void bpf_cpumask_xor(struct bpf_cpumask *dst, const struct cpumask *src1,
+ const struct cpumask *src2) __ksym;
+bool bpf_cpumask_equal(const struct cpumask *src1, const struct cpumask *src2) __ksym;
+bool bpf_cpumask_intersects(const struct cpumask *src1, const struct cpumask *src2) __ksym;
+bool bpf_cpumask_subset(const struct cpumask *src1, const struct cpumask *src2) __ksym;
+bool bpf_cpumask_empty(const struct cpumask *cpumask) __ksym;
+bool bpf_cpumask_full(const struct cpumask *cpumask) __ksym;
+void bpf_cpumask_copy(struct bpf_cpumask *dst, const struct cpumask *src) __ksym;
+u32 bpf_cpumask_any_distribute(const struct cpumask *cpumask) __ksym;
+u32 bpf_cpumask_any_and_distribute(const struct cpumask *src1,
+ const struct cpumask *src2) __ksym;
+
+/* rcu */
+void bpf_rcu_read_lock(void) __ksym;
+void bpf_rcu_read_unlock(void) __ksym;
+
+#include "compat.bpf.h"
+
+#endif /* __SCX_COMMON_BPF_H */
diff --git a/tools/sched_ext/include/scx/common.h b/tools/sched_ext/include/scx/common.h
new file mode 100644
index 000000000000..8d5a6775f64d
--- /dev/null
+++ b/tools/sched_ext/include/scx/common.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ */
+#ifndef __SCHED_EXT_COMMON_H
+#define __SCHED_EXT_COMMON_H
+
+#ifdef __KERNEL__
+#error "Should not be included by BPF programs"
+#endif
+
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <errno.h>
+
+typedef uint8_t u8;
+typedef uint16_t u16;
+typedef uint32_t u32;
+typedef uint64_t u64;
+typedef int8_t s8;
+typedef int16_t s16;
+typedef int32_t s32;
+typedef int64_t s64;
+
+#define SCX_BUG(__fmt, ...) \
+ do { \
+ fprintf(stderr, "%s:%d [scx panic]: %s\n", __FILE__, __LINE__, \
+ strerror(errno)); \
+ fprintf(stderr, __fmt __VA_OPT__(,) __VA_ARGS__); \
+ fprintf(stderr, "\n"); \
+ \
+ exit(EXIT_FAILURE); \
+ } while (0)
+
+#define SCX_BUG_ON(__cond, __fmt, ...) \
+ do { \
+ if (__cond) \
+ SCX_BUG((__fmt) __VA_OPT__(,) __VA_ARGS__); \
+ } while (0)
+
+/**
+ * RESIZE_ARRAY - Convenience macro for resizing a BPF array
+ * @elfsec: the data section of the BPF program in which to the array exists
+ * @arr: the name of the array
+ * @n: the desired array element count
+ *
+ * For BPF arrays declared with RESIZABLE_ARRAY(), this macro performs two
+ * operations. It resizes the map which corresponds to the custom data
+ * section that contains the target array. As a side effect, the BTF info for
+ * the array is adjusted so that the array length is sized to cover the new
+ * data section size. The second operation is reassigning the skeleton pointer
+ * for that custom data section so that it points to the newly memory mapped
+ * region.
+ */
+#define RESIZE_ARRAY(elfsec, arr, n) \
+ do { \
+ size_t __sz; \
+ bpf_map__set_value_size(skel->maps.elfsec##_##arr, \
+ sizeof(skel->elfsec##_##arr->arr[0]) * (n)); \
+ skel->elfsec##_##arr = \
+ bpf_map__initial_value(skel->maps.elfsec##_##arr, &__sz); \
+ } while (0)
+
+#include "user_exit_info.h"
+#include "compat.h"
+
+#endif /* __SCHED_EXT_COMMON_H */
diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
new file mode 100644
index 000000000000..c32a6a0f994c
--- /dev/null
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 Tejun Heo <[email protected]>
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ */
+#ifndef __SCX_COMPAT_BPF_H
+#define __SCX_COMPAT_BPF_H
+
+/*
+ * scx_switch_all() was replaced by %SCX_OPS_SWITCH_PARTIAL. See
+ * %__COMPAT_SCX_OPS_SWITCH_PARTIAL in compat.h. This can be dropped in the
+ * future.
+ */
+void scx_bpf_switch_all(void) __ksym __weak;
+
+static inline void __COMPAT_scx_bpf_switch_all(void)
+{
+ if (!bpf_core_enum_value_exists(enum scx_ops_flags, SCX_OPS_SWITCH_PARTIAL))
+ scx_bpf_switch_all();
+}
+
+/*
+ * scx_bpf_exit() is a new addition. Fall back to scx_bpf_error() if
+ * unavailable. Users can use scx_bpf_exit() directly in the future.
+ */
+#define __COMPAT_scx_bpf_exit(code, fmt, args...) \
+({ \
+ if (bpf_ksym_exists(scx_bpf_exit_bstr)) \
+ scx_bpf_exit((code), fmt, args); \
+ else \
+ scx_bpf_error(fmt, args); \
+})
+
+/*
+ * scx_bpf_nr_cpu_ids(), scx_bpf_get_possible/online_cpumask() are new. No good
+ * way to noop these kfuncs. Provide a test macro. Users can assume existence in
+ * the future.
+ */
+#define __COMPAT_HAS_CPUMASKS \
+ bpf_ksym_exists(scx_bpf_nr_cpu_ids)
+
+/*
+ * Define sched_ext_ops. This may be expanded to define multiple variants for
+ * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH().
+ */
+#define SCX_OPS_DEFINE(__name, ...) \
+ SEC(".struct_ops.link") \
+ struct sched_ext_ops __name = { \
+ __VA_ARGS__, \
+ };
+
+#endif /* __SCX_COMPAT_BPF_H */
diff --git a/tools/sched_ext/include/scx/compat.h b/tools/sched_ext/include/scx/compat.h
new file mode 100644
index 000000000000..2a66f3eb87a9
--- /dev/null
+++ b/tools/sched_ext/include/scx/compat.h
@@ -0,0 +1,155 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 Tejun Heo <[email protected]>
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ */
+#ifndef __SCX_COMPAT_H
+#define __SCX_COMPAT_H
+
+#include <bpf/btf.h>
+
+struct btf *__COMPAT_vmlinux_btf __attribute__((weak));
+
+static inline void __COMPAT_load_vmlinux_btf(void)
+{
+ if (!__COMPAT_vmlinux_btf) {
+ __COMPAT_vmlinux_btf = btf__load_vmlinux_btf();
+ SCX_BUG_ON(!__COMPAT_vmlinux_btf, "btf__load_vmlinux_btf()");
+ }
+}
+
+static inline bool __COMPAT_read_enum(const char *type, const char *name, u64 *v)
+{
+ const struct btf_type *t;
+ const char *n;
+ s32 tid;
+ int i;
+
+ __COMPAT_load_vmlinux_btf();
+
+ tid = btf__find_by_name(__COMPAT_vmlinux_btf, type);
+ if (tid < 0)
+ return false;
+
+ t = btf__type_by_id(__COMPAT_vmlinux_btf, tid);
+ SCX_BUG_ON(!t, "btf__type_by_id(%d)", tid);
+
+ if (btf_is_enum(t)) {
+ struct btf_enum *e = btf_enum(t);
+
+ for (i = 0; i < BTF_INFO_VLEN(t->info); i++) {
+ n = btf__name_by_offset(__COMPAT_vmlinux_btf, e[i].name_off);
+ SCX_BUG_ON(!n, "btf__name_by_offset()");
+ if (!strcmp(n, name)) {
+ *v = e[i].val;
+ return true;
+ }
+ }
+ } else if (btf_is_enum64(t)) {
+ struct btf_enum64 *e = btf_enum64(t);
+
+ for (i = 0; i < BTF_INFO_VLEN(t->info); i++) {
+ n = btf__name_by_offset(__COMPAT_vmlinux_btf, e[i].name_off);
+ SCX_BUG_ON(!n, "btf__name_by_offset()");
+ if (!strcmp(n, name)) {
+ *v = btf_enum64_value(&e[i]);
+ return true;
+ }
+ }
+ }
+
+ return false;
+}
+
+#define __COMPAT_ENUM_OR_ZERO(__type, __ent) \
+({ \
+ u64 __val = 0; \
+ __COMPAT_read_enum(__type, __ent, &__val); \
+ __val; \
+})
+
+static inline bool __COMPAT_has_ksym(const char *ksym)
+{
+ __COMPAT_load_vmlinux_btf();
+ return btf__find_by_name(__COMPAT_vmlinux_btf, ksym) >= 0;
+}
+
+static inline bool __COMPAT_struct_has_field(const char *type, const char *field)
+{
+ const struct btf_type *t;
+ const struct btf_member *m;
+ const char *n;
+ s32 tid;
+ int i;
+
+ __COMPAT_load_vmlinux_btf();
+ tid = btf__find_by_name_kind(__COMPAT_vmlinux_btf, type, BTF_KIND_STRUCT);
+ if (tid < 0)
+ return false;
+
+ t = btf__type_by_id(__COMPAT_vmlinux_btf, tid);
+ SCX_BUG_ON(!t, "btf__type_by_id(%d)", tid);
+
+ m = btf_members(t);
+
+ for (i = 0; i < BTF_INFO_VLEN(t->info); i++) {
+ n = btf__name_by_offset(__COMPAT_vmlinux_btf, m[i].name_off);
+ SCX_BUG_ON(!n, "btf__name_by_offset()");
+ if (!strcmp(n, field))
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * An ops flag, %SCX_OPS_SWITCH_PARTIAL, replaced scx_bpf_switch_all() which had
+ * to be called from ops.init(). To support both before and after, use both
+ * %__COMPAT_SCX_OPS_SWITCH_PARTIAL and %__COMPAT_scx_bpf_switch_all() defined
+ * in compat.bpf.h. Users can switch to directly using %SCX_OPS_SWITCH_PARTIAL
+ * in the future.
+ */
+#define __COMPAT_SCX_OPS_SWITCH_PARTIAL \
+ __COMPAT_ENUM_OR_ZERO("scx_ops_flags", "SCX_OPS_SWITCH_PARTIAL")
+
+/*
+ * scx_bpf_nr_cpu_ids(), scx_bpf_get_possible/online_cpumask() are new. Users
+ * will be able to assume existence in the future.
+ */
+#define __COMPAT_HAS_CPUMASKS \
+ __COMPAT_has_ksym("scx_bpf_nr_cpu_ids")
+
+/*
+ * struct sched_ext_ops can change over time. If compat.bpf.h::SCX_OPS_DEFINE()
+ * is used to define ops and compat.h::SCX_OPS_LOAD/ATTACH() are used to load
+ * and attach it, backward compatibility is automatically maintained where
+ * reasonable.
+ *
+ * - sched_ext_ops.tick(): Ignored on older kernels with a warning.
+ */
+#define SCX_OPS_OPEN(__ops_name, __scx_name) ({ \
+ struct __scx_name *__skel; \
+ \
+ __skel = __scx_name##__open(); \
+ SCX_BUG_ON(!__skel, "Could not open " #__scx_name); \
+ __skel; \
+})
+
+#define SCX_OPS_LOAD(__skel, __ops_name, __scx_name) ({ \
+ if (!__COMPAT_struct_has_field("sched_ext_ops", "tick") && \
+ (__skel)->struct_ops.__ops_name->tick) { \
+ fprintf(stderr, "WARNING: kernel doesn't support ops.tick()\n"); \
+ (__skel)->struct_ops.__ops_name->tick = NULL; \
+ } \
+ SCX_BUG_ON(__scx_name##__load((__skel)), "Failed to load skel"); \
+})
+
+#define SCX_OPS_ATTACH(__skel, __ops_name) ({ \
+ struct bpf_link *__link; \
+ __link = bpf_map__attach_struct_ops((__skel)->maps.__ops_name); \
+ SCX_BUG_ON(!__link, "Failed to attach struct_ops"); \
+ __link; \
+})
+
+#endif /* __SCX_COMPAT_H */
diff --git a/tools/sched_ext/include/scx/user_exit_info.h b/tools/sched_ext/include/scx/user_exit_info.h
new file mode 100644
index 000000000000..8c3b7fac4d05
--- /dev/null
+++ b/tools/sched_ext/include/scx/user_exit_info.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Define struct user_exit_info which is shared between BPF and userspace parts
+ * to communicate exit status and other information.
+ *
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#ifndef __USER_EXIT_INFO_H
+#define __USER_EXIT_INFO_H
+
+enum uei_sizes {
+ UEI_REASON_LEN = 128,
+ UEI_MSG_LEN = 1024,
+};
+
+struct user_exit_info {
+ int kind;
+ s64 exit_code;
+ char reason[UEI_REASON_LEN];
+ char msg[UEI_MSG_LEN];
+};
+
+#ifdef __bpf__
+
+#include "vmlinux.h"
+#include <bpf/bpf_core_read.h>
+
+#define UEI_DEFINE(__name) \
+ struct user_exit_info __name SEC(".data")
+
+#define UEI_RECORD(__uei_name, __ei) ({ \
+ bpf_probe_read_kernel_str(__uei_name.reason, \
+ sizeof(__uei_name.reason), (__ei)->reason); \
+ bpf_probe_read_kernel_str(__uei_name.msg, \
+ sizeof(__uei_name.msg), (__ei)->msg); \
+ if (bpf_core_field_exists((__ei)->exit_code)) \
+ __uei_name.exit_code = (__ei)->exit_code; \
+ /* use __sync to force memory barrier */ \
+ __sync_val_compare_and_swap(&__uei_name.kind, __uei_name.kind, \
+ (__ei)->kind); \
+})
+
+#else /* !__bpf__ */
+
+#include <stdio.h>
+#include <stdbool.h>
+
+#define UEI_EXITED(__skel, __uei_name) ({ \
+ /* use __sync to force memory barrier */ \
+ __sync_val_compare_and_swap(&(__skel)->data->__uei_name.kind, -1, -1); \
+})
+
+#define UEI_REPORT(__skel, __uei_name) ({ \
+ struct user_exit_info *__uei = &(__skel)->data->__uei_name; \
+ fprintf(stderr, "EXIT: %s", __uei->reason); \
+ if (__uei->msg[0] != '\0') \
+ fprintf(stderr, " (%s)", __uei->msg); \
+ fputs("\n", stderr); \
+})
+
+#endif /* __bpf__ */
+#endif /* __USER_EXIT_INFO_H */
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
new file mode 100644
index 000000000000..ad0328bb3c6a
--- /dev/null
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -0,0 +1,268 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A simple five-level FIFO queue scheduler.
+ *
+ * There are five FIFOs implemented using BPF_MAP_TYPE_QUEUE. A task gets
+ * assigned to one depending on its compound weight. Each CPU round robins
+ * through the FIFOs and dispatches more from FIFOs with higher indices - 1 from
+ * queue0, 2 from queue1, 4 from queue2 and so on.
+ *
+ * This scheduler demonstrates:
+ *
+ * - BPF-side queueing using PIDs.
+ * - Sleepable per-task storage allocation using ops.prep_enable().
+ *
+ * This scheduler is primarily for demonstration and testing of sched_ext
+ * features and unlikely to be useful for actual workloads.
+ *
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#include <scx/common.bpf.h>
+
+enum consts {
+ ONE_SEC_IN_NS = 1000000000,
+ SHARED_DSQ = 0,
+};
+
+char _license[] SEC("license") = "GPL";
+
+const volatile u64 slice_ns = SCX_SLICE_DFL;
+const volatile u32 dsp_batch;
+const volatile bool switch_partial;
+
+u32 test_error_cnt;
+
+UEI_DEFINE(uei);
+
+struct qmap {
+ __uint(type, BPF_MAP_TYPE_QUEUE);
+ __uint(max_entries, 4096);
+ __type(value, u32);
+} queue0 SEC(".maps"),
+ queue1 SEC(".maps"),
+ queue2 SEC(".maps"),
+ queue3 SEC(".maps"),
+ queue4 SEC(".maps");
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
+ __uint(max_entries, 5);
+ __type(key, int);
+ __array(values, struct qmap);
+} queue_arr SEC(".maps") = {
+ .values = {
+ [0] = &queue0,
+ [1] = &queue1,
+ [2] = &queue2,
+ [3] = &queue3,
+ [4] = &queue4,
+ },
+};
+
+/* Per-task scheduling context */
+struct task_ctx {
+ bool force_local; /* Dispatch directly to local_dsq */
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+ __type(key, int);
+ __type(value, struct task_ctx);
+} task_ctx_stor SEC(".maps");
+
+struct cpu_ctx {
+ u64 dsp_idx; /* dispatch index */
+ u64 dsp_cnt; /* remaining count */
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+ __uint(max_entries, 1);
+ __type(key, u32);
+ __type(value, struct cpu_ctx);
+} cpu_ctx_stor SEC(".maps");
+
+/* Statistics */
+u64 nr_enqueued, nr_dispatched, nr_dequeued;
+
+s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
+ s32 prev_cpu, u64 wake_flags)
+{
+ struct task_ctx *tctx;
+ s32 cpu;
+
+ tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
+ if (!tctx) {
+ scx_bpf_error("task_ctx lookup failed");
+ return -ESRCH;
+ }
+
+ if (p->nr_cpus_allowed == 1 ||
+ scx_bpf_test_and_clear_cpu_idle(prev_cpu)) {
+ tctx->force_local = true;
+ return prev_cpu;
+ }
+
+ cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
+ if (cpu >= 0)
+ return cpu;
+
+ return prev_cpu;
+}
+
+static int weight_to_idx(u32 weight)
+{
+ /* Coarsely map the compound weight to a FIFO. */
+ if (weight <= 25)
+ return 0;
+ else if (weight <= 50)
+ return 1;
+ else if (weight < 200)
+ return 2;
+ else if (weight < 400)
+ return 3;
+ else
+ return 4;
+}
+
+void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
+{
+ struct task_ctx *tctx;
+ u32 pid = p->pid;
+ int idx = weight_to_idx(p->scx.weight);
+ void *ring;
+
+ if (test_error_cnt && !--test_error_cnt)
+ scx_bpf_error("test triggering error");
+
+ tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
+ if (!tctx) {
+ scx_bpf_error("task_ctx lookup failed");
+ return;
+ }
+
+ /* Is select_cpu() is telling us to enqueue locally? */
+ if (tctx->force_local) {
+ tctx->force_local = false;
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL, slice_ns, enq_flags);
+ return;
+ }
+
+ ring = bpf_map_lookup_elem(&queue_arr, &idx);
+ if (!ring) {
+ scx_bpf_error("failed to find ring %d", idx);
+ return;
+ }
+
+ /* Queue on the selected FIFO. If the FIFO overflows, punt to global. */
+ if (bpf_map_push_elem(ring, &pid, 0)) {
+ scx_bpf_dispatch(p, SHARED_DSQ, slice_ns, enq_flags);
+ return;
+ }
+
+ __sync_fetch_and_add(&nr_enqueued, 1);
+}
+
+/*
+ * The BPF queue map doesn't support removal and sched_ext can handle spurious
+ * dispatches. qmap_dequeue() is only used to collect statistics.
+ */
+void BPF_STRUCT_OPS(qmap_dequeue, struct task_struct *p, u64 deq_flags)
+{
+ __sync_fetch_and_add(&nr_dequeued, 1);
+}
+
+void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
+{
+ struct task_struct *p;
+ struct cpu_ctx *cpuc;
+ u32 zero = 0, batch = dsp_batch ?: 1;
+ void *fifo;
+ s32 i, pid;
+
+ if (scx_bpf_consume(SHARED_DSQ))
+ return;
+
+ if (!(cpuc = bpf_map_lookup_elem(&cpu_ctx_stor, &zero))) {
+ scx_bpf_error("failed to look up cpu_ctx");
+ return;
+ }
+
+ for (i = 0; i < 5; i++) {
+ /* Advance the dispatch cursor and pick the fifo. */
+ if (!cpuc->dsp_cnt) {
+ cpuc->dsp_idx = (cpuc->dsp_idx + 1) % 5;
+ cpuc->dsp_cnt = 1 << cpuc->dsp_idx;
+ }
+
+ fifo = bpf_map_lookup_elem(&queue_arr, &cpuc->dsp_idx);
+ if (!fifo) {
+ scx_bpf_error("failed to find ring %llu", cpuc->dsp_idx);
+ return;
+ }
+
+ /* Dispatch or advance. */
+ bpf_repeat(BPF_MAX_LOOPS) {
+ if (bpf_map_pop_elem(fifo, &pid))
+ break;
+
+ p = bpf_task_from_pid(pid);
+ if (!p)
+ continue;
+
+ __sync_fetch_and_add(&nr_dispatched, 1);
+ scx_bpf_dispatch(p, SHARED_DSQ, slice_ns, 0);
+ bpf_task_release(p);
+ batch--;
+ cpuc->dsp_cnt--;
+ if (!batch || !scx_bpf_dispatch_nr_slots()) {
+ scx_bpf_consume(SHARED_DSQ);
+ return;
+ }
+ if (!cpuc->dsp_cnt)
+ break;
+ }
+
+ cpuc->dsp_cnt = 0;
+ }
+}
+
+s32 BPF_STRUCT_OPS(qmap_init_task, struct task_struct *p,
+ struct scx_init_task_args *args)
+{
+ /*
+ * @p is new. Let's ensure that its task_ctx is available. We can sleep
+ * in this function and the following will automatically use GFP_KERNEL.
+ */
+ if (bpf_task_storage_get(&task_ctx_stor, p, 0,
+ BPF_LOCAL_STORAGE_GET_F_CREATE))
+ return 0;
+ else
+ return -ENOMEM;
+}
+
+s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
+{
+ if (!switch_partial)
+ __COMPAT_scx_bpf_switch_all();
+
+ return scx_bpf_create_dsq(SHARED_DSQ, -1);
+}
+
+void BPF_STRUCT_OPS(qmap_exit, struct scx_exit_info *ei)
+{
+ UEI_RECORD(uei, ei);
+}
+
+SCX_OPS_DEFINE(qmap_ops,
+ .select_cpu = (void *)qmap_select_cpu,
+ .enqueue = (void *)qmap_enqueue,
+ .dequeue = (void *)qmap_dequeue,
+ .dispatch = (void *)qmap_dispatch,
+ .init_task = (void *)qmap_init_task,
+ .init = (void *)qmap_init,
+ .exit = (void *)qmap_exit,
+ .name = "qmap");
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
new file mode 100644
index 000000000000..1c0455c64f64
--- /dev/null
+++ b/tools/sched_ext/scx_qmap.c
@@ -0,0 +1,100 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <inttypes.h>
+#include <signal.h>
+#include <libgen.h>
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include "scx_qmap.bpf.skel.h"
+
+const char help_fmt[] =
+"A simple five-level FIFO queue sched_ext scheduler.\n"
+"\n"
+"See the top-level comment in .bpf.c for more details.\n"
+"\n"
+"Usage: %s [-s SLICE_US] [-e COUNT] [-b COUNT] [-p] [-v]\n"
+"\n"
+" -s SLICE_US Override slice duration\n"
+" -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n"
+" -b COUNT Dispatch upto COUNT tasks together\n"
+" -p Switch only tasks on SCHED_EXT policy intead of all\n"
+" -v Print libbpf debug messages\n"
+" -h Display this help and exit\n";
+
+static bool verbose;
+static volatile int exit_req;
+
+static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
+{
+ if (level == LIBBPF_DEBUG && !verbose)
+ return 0;
+ return vfprintf(stderr, format, args);
+}
+
+static void sigint_handler(int dummy)
+{
+ exit_req = 1;
+}
+
+int main(int argc, char **argv)
+{
+ struct scx_qmap *skel;
+ struct bpf_link *link;
+ int opt;
+
+ libbpf_set_print(libbpf_print_fn);
+ signal(SIGINT, sigint_handler);
+ signal(SIGTERM, sigint_handler);
+
+ skel = SCX_OPS_OPEN(qmap_ops, scx_qmap);
+
+ while ((opt = getopt(argc, argv, "s:e:b:pvh")) != -1) {
+ switch (opt) {
+ case 's':
+ skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
+ break;
+ case 'e':
+ skel->bss->test_error_cnt = strtoul(optarg, NULL, 0);
+ break;
+ case 'b':
+ skel->rodata->dsp_batch = strtoul(optarg, NULL, 0);
+ break;
+ case 'p':
+ skel->rodata->switch_partial = true;
+ skel->struct_ops.qmap_ops->flags |= __COMPAT_SCX_OPS_SWITCH_PARTIAL;
+ break;
+ case 'v':
+ verbose = true;
+ break;
+ default:
+ fprintf(stderr, help_fmt, basename(argv[0]));
+ return opt != 'h';
+ }
+ }
+
+ SCX_OPS_LOAD(skel, qmap_ops, scx_qmap);
+ link = SCX_OPS_ATTACH(skel, qmap_ops);
+
+ while (!exit_req && !UEI_EXITED(skel, uei)) {
+ long nr_enqueued = skel->bss->nr_enqueued;
+ long nr_dispatched = skel->bss->nr_dispatched;
+
+ printf("stats : enq=%lu dsp=%lu delta=%ld deq=%"PRIu64"\n",
+ nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
+ skel->bss->nr_dequeued);
+ fflush(stdout);
+ sleep(1);
+ }
+
+ bpf_link__destroy(link);
+ UEI_REPORT(skel, uei);
+ scx_qmap__destroy(skel);
+ return 0;
+}
diff --git a/tools/sched_ext/scx_simple.bpf.c b/tools/sched_ext/scx_simple.bpf.c
new file mode 100644
index 000000000000..6bb13a3c801b
--- /dev/null
+++ b/tools/sched_ext/scx_simple.bpf.c
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A simple scheduler.
+ *
+ * A simple global FIFO scheduler. It also demonstrates the following niceties.
+ *
+ * - Statistics tracking how many tasks are queued to local and global dsq's.
+ * - Termination notification for userspace.
+ *
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+UEI_DEFINE(uei);
+
+struct {
+ __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+ __uint(key_size, sizeof(u32));
+ __uint(value_size, sizeof(u64));
+ __uint(max_entries, 2); /* [local, global] */
+} stats SEC(".maps");
+
+static void stat_inc(u32 idx)
+{
+ u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx);
+ if (cnt_p)
+ (*cnt_p)++;
+}
+
+s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags)
+{
+ bool is_idle = false;
+ s32 cpu;
+
+ cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &is_idle);
+ if (is_idle) {
+ stat_inc(0); /* count local queueing */
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
+ }
+
+ return cpu;
+}
+
+void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
+{
+ stat_inc(1); /* count global queueing */
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+}
+
+void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei)
+{
+ UEI_RECORD(uei, ei);
+}
+
+SCX_OPS_DEFINE(simple_ops,
+ .select_cpu = (void *)simple_select_cpu,
+ .enqueue = (void *)simple_enqueue,
+ .exit = (void *)simple_exit,
+ .name = "simple");
diff --git a/tools/sched_ext/scx_simple.c b/tools/sched_ext/scx_simple.c
new file mode 100644
index 000000000000..08c741f56685
--- /dev/null
+++ b/tools/sched_ext/scx_simple.c
@@ -0,0 +1,99 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#include <stdio.h>
+#include <unistd.h>
+#include <signal.h>
+#include <libgen.h>
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include "scx_simple.bpf.skel.h"
+
+const char help_fmt[] =
+"A simple sched_ext scheduler.\n"
+"\n"
+"See the top-level comment in .bpf.c for more details.\n"
+"\n"
+"Usage: %s [-v]\n"
+"\n"
+" -v Print libbpf debug messages\n"
+" -h Display this help and exit\n";
+
+static bool verbose;
+static volatile int exit_req;
+
+static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
+{
+ if (level == LIBBPF_DEBUG && !verbose)
+ return 0;
+ return vfprintf(stderr, format, args);
+}
+
+static void sigint_handler(int simple)
+{
+ exit_req = 1;
+}
+
+static void read_stats(struct scx_simple *skel, __u64 *stats)
+{
+ int nr_cpus = libbpf_num_possible_cpus();
+ __u64 cnts[2][nr_cpus];
+ __u32 idx;
+
+ memset(stats, 0, sizeof(stats[0]) * 2);
+
+ for (idx = 0; idx < 2; idx++) {
+ int ret, cpu;
+
+ ret = bpf_map_lookup_elem(bpf_map__fd(skel->maps.stats),
+ &idx, cnts[idx]);
+ if (ret < 0)
+ continue;
+ for (cpu = 0; cpu < nr_cpus; cpu++)
+ stats[idx] += cnts[idx][cpu];
+ }
+}
+
+int main(int argc, char **argv)
+{
+ struct scx_simple *skel;
+ struct bpf_link *link;
+ __u32 opt;
+
+ libbpf_set_print(libbpf_print_fn);
+ signal(SIGINT, sigint_handler);
+ signal(SIGTERM, sigint_handler);
+
+ skel = SCX_OPS_OPEN(simple_ops, scx_simple);
+
+ while ((opt = getopt(argc, argv, "vh")) != -1) {
+ switch (opt) {
+ case 'v':
+ verbose = true;
+ break;
+ default:
+ fprintf(stderr, help_fmt, basename(argv[0]));
+ return opt != 'h';
+ }
+ }
+
+ SCX_OPS_LOAD(skel, simple_ops, scx_simple);
+ link = SCX_OPS_ATTACH(skel, simple_ops);
+
+ while (!exit_req && !UEI_EXITED(skel, uei)) {
+ __u64 stats[2];
+
+ read_stats(skel, stats);
+ printf("local=%llu global=%llu\n", stats[0], stats[1]);
+ fflush(stdout);
+ sleep(1);
+ }
+
+ bpf_link__destroy(link);
+ UEI_REPORT(skel, uei);
+ scx_simple__destroy(skel);
+ return 0;
+}
--
2.44.0


2024-05-01 15:18:13

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 21/39] tools/sched_ext: Add scx_show_state.py

There are states which are interesting but don't quite fit the interface
exposed under /sys/kernel/sched_ext. Add tools/scx_show_state.py to show
them.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
---
tools/sched_ext/scx_show_state.py | 39 +++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)
create mode 100644 tools/sched_ext/scx_show_state.py

diff --git a/tools/sched_ext/scx_show_state.py b/tools/sched_ext/scx_show_state.py
new file mode 100644
index 000000000000..d457d2a74e1e
--- /dev/null
+++ b/tools/sched_ext/scx_show_state.py
@@ -0,0 +1,39 @@
+#!/usr/bin/env drgn
+#
+# Copyright (C) 2024 Tejun Heo <[email protected]>
+# Copyright (C) 2024 Meta Platforms, Inc. and affiliates.
+
+desc = """
+This is a drgn script to show the current sched_ext state.
+For more info on drgn, visit https://github.com/osandov/drgn.
+"""
+
+import drgn
+import sys
+
+def err(s):
+ print(s, file=sys.stderr, flush=True)
+ sys.exit(1)
+
+def read_int(name):
+ return int(prog[name].value_())
+
+def read_atomic(name):
+ return prog[name].counter.value_()
+
+def read_static_key(name):
+ return prog[name].key.enabled.counter.value_()
+
+def ops_state_str(state):
+ return prog['scx_ops_enable_state_str'][state].string_().decode()
+
+ops = prog['scx_ops']
+enable_state = read_atomic("scx_ops_enable_state_var")
+
+print(f'ops : {ops.name.string_().decode()}')
+print(f'enabled : {read_static_key("__scx_ops_enabled")}')
+print(f'switching_all : {read_int("scx_switching_all")}')
+print(f'switched_all : {read_static_key("__scx_switched_all")}')
+print(f'enable_state : {ops_state_str(enable_state)} ({enable_state})')
+print(f'bypass_depth : {read_atomic("scx_ops_bypass_depth")}')
+print(f'nr_rejected : {read_atomic("scx_nr_rejected")}')
--
2.44.0


2024-05-01 15:18:44

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 20/39] sched_ext: Print debug dump after an error exit

If a BPF scheduler triggers an error, the scheduler is aborted and the
system is reverted to the built-in scheduler. In the process, a lot of
information which may be useful for figuring out what happened can be lost.

This patch adds debug dump which captures information which may be useful
for debugging including runqueue and runnable thread states at the time of
failure. The following shows a debug dump after triggering the watchdog:

# os/work/tools/sched_ext/build/bin/scx_qmap -t 100
enq=0, dsp=0, delta=0, deq=0
enq=21, dsp=21, delta=0, deq=0
enq=96, dsp=96, delta=0, deq=1

DEBUG DUMP
================================================================================

kworker/u16:0[11] triggered exit kind 1026:
runnable task stall (scx_qmap[1524] failed to run for 5.659s)

Backtrace:
scx_watchdog_workfn+0x138/0x1c0
process_scheduled_works+0x245/0x4e0
worker_thread+0x270/0x360
kthread+0xeb/0x110
ret_from_fork+0x36/0x40
ret_from_fork_asm+0x11/0x20

Runqueue states
---------------

CPU 2 : nr_run=1 ops_qseq=34
curr=kworker/u16:0[11] class=ext_sched_class

*R kworker/u16:0[11] +0ms
scx_state/flags=3/0xd ops_state/qseq=0/0
sticky/holding_cpu=-1/-1 dsq_id=(n/a)
cpus=ff

scx_ops_error_irq_workfn+0x25d/0x340
irq_work_run_list+0x7d/0xc0
irq_work_run+0x18/0x30
__sysvec_irq_work+0x38/0x100
sysvec_irq_work+0x69/0x80
asm_sysvec_irq_work+0x1b/0x20
scx_watchdog_workfn+0x15d/0x1c0
process_scheduled_works+0x245/0x4e0
worker_thread+0x270/0x360
kthread+0xeb/0x110
ret_from_fork+0x36/0x40
ret_from_fork_asm+0x11/0x20

CPU 7 : nr_run=1 ops_qseq=11
curr=swapper/7[0] class=idle_sched_class

R scx_qmap[1524] -5659ms
scx_state/flags=3/0x9 ops_state/qseq=2/3
sticky/holding_cpu=-1/-1 dsq_id=(n/a)
cpus=ff

common_nsleep+0x34/0x50
__x64_sys_clock_nanosleep+0xd9/0x120
do_syscall_64+0x7e/0x150
entry_SYSCALL_64_after_hwframe+0x46/0x4e

================================================================================

EXIT: runnable task stall (scx_qmap[1524] failed to run for 5.659s)

It shows that CPU 2 was running the watchdog when it triggered the error
condition and the scx_qmap thread has been queued on CPU 7 for over 5
seconds but failed to run. This dump has proved pretty useful for developing
and debugging BPF schedulers.

Currently, it uses fixed 32k buffer and doesn't provide any way for the BPF
scheduler to add additional information. These will be improved in the
future.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
---
kernel/sched/ext.c | 122 ++++++++++++++++++-
tools/sched_ext/include/scx/compat.h | 11 +-
tools/sched_ext/include/scx/user_exit_info.h | 19 +++
tools/sched_ext/scx_qmap.c | 10 +-
tools/sched_ext/scx_simple.c | 2 +-
5 files changed, 155 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index ff080b5f0330..4ffa42e5d7dd 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -12,6 +12,7 @@ enum scx_consts {

SCX_EXIT_BT_LEN = 64,
SCX_EXIT_MSG_LEN = 1024,
+ SCX_EXIT_DUMP_DFL_LEN = 32768,
};

enum scx_exit_kind {
@@ -48,6 +49,9 @@ struct scx_exit_info {

/* informational message */
char *msg;
+
+ /* debug dump */
+ char *dump;
};

/* sched_ext_ops.flags */
@@ -330,6 +334,12 @@ struct sched_ext_ops {
*/
u32 timeout_ms;

+ /**
+ * exit_dump_len - scx_exit_info.dump buffer length. If 0, the default
+ * value of 32768 is used.
+ */
+ u32 exit_dump_len;
+
/**
* name - BPF scheduler's name
*
@@ -2888,12 +2898,13 @@ static void scx_ops_bypass(bool bypass)

static void free_exit_info(struct scx_exit_info *ei)
{
+ kfree(ei->dump);
kfree(ei->msg);
kfree(ei->bt);
kfree(ei);
}

-static struct scx_exit_info *alloc_exit_info(void)
+static struct scx_exit_info *alloc_exit_info(size_t exit_dump_len)
{
struct scx_exit_info *ei;

@@ -2903,8 +2914,9 @@ static struct scx_exit_info *alloc_exit_info(void)

ei->bt = kcalloc(sizeof(ei->bt[0]), SCX_EXIT_BT_LEN, GFP_KERNEL);
ei->msg = kzalloc(SCX_EXIT_MSG_LEN, GFP_KERNEL);
+ ei->dump = kzalloc(exit_dump_len, GFP_KERNEL);

- if (!ei->bt || !ei->msg) {
+ if (!ei->bt || !ei->msg || !ei->dump) {
free_exit_info(ei);
return NULL;
}
@@ -3104,8 +3116,101 @@ static void scx_ops_disable(enum scx_exit_kind kind)
schedule_scx_ops_disable_work();
}

+static void scx_dump_task(struct seq_buf *s, struct task_struct *p, char marker,
+ unsigned long now)
+{
+ static unsigned long bt[SCX_EXIT_BT_LEN];
+ char dsq_id_buf[19] = "(n/a)";
+ unsigned long ops_state = atomic_long_read(&p->scx.ops_state);
+ unsigned int bt_len;
+ size_t avail, used;
+ char *buf;
+
+ if (p->scx.dsq)
+ scnprintf(dsq_id_buf, sizeof(dsq_id_buf), "0x%llx",
+ (unsigned long long)p->scx.dsq->id);
+
+ seq_buf_printf(s, "\n %c%c %s[%d] %+ldms\n",
+ marker, task_state_to_char(p), p->comm, p->pid,
+ jiffies_delta_msecs(p->scx.runnable_at, now));
+ seq_buf_printf(s, " scx_state/flags=%u/0x%x ops_state/qseq=%lu/%lu\n",
+ scx_get_task_state(p),
+ p->scx.flags & ~SCX_TASK_STATE_MASK,
+ ops_state & SCX_OPSS_STATE_MASK,
+ ops_state >> SCX_OPSS_QSEQ_SHIFT);
+ seq_buf_printf(s, " sticky/holding_cpu=%d/%d dsq_id=%s\n",
+ p->scx.sticky_cpu, p->scx.holding_cpu, dsq_id_buf);
+ seq_buf_printf(s, " cpus=%*pb\n\n", cpumask_pr_args(p->cpus_ptr));
+
+ bt_len = stack_trace_save_tsk(p, bt, SCX_EXIT_BT_LEN, 1);
+
+ avail = seq_buf_get_buf(s, &buf);
+ used = stack_trace_snprint(buf, avail, bt, bt_len, 3);
+ seq_buf_commit(s, used < avail ? used : -1);
+}
+
+static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
+{
+ const char trunc_marker[] = "\n\n~~~~ TRUNCATED ~~~~\n";
+ unsigned long now = jiffies;
+ struct seq_buf s;
+ size_t avail, used;
+ char *buf;
+ int cpu;
+
+ if (dump_len <= sizeof(trunc_marker))
+ return;
+
+ seq_buf_init(&s, ei->dump, dump_len - sizeof(trunc_marker));
+
+ seq_buf_printf(&s, "%s[%d] triggered exit kind %d:\n %s (%s)\n\n",
+ current->comm, current->pid, ei->kind, ei->reason, ei->msg);
+ seq_buf_printf(&s, "Backtrace:\n");
+ avail = seq_buf_get_buf(&s, &buf);
+ used = stack_trace_snprint(buf, avail, ei->bt, ei->bt_len, 1);
+ seq_buf_commit(&s, used < avail ? used : -1);
+
+ seq_buf_printf(&s, "\nRunqueue states\n");
+ seq_buf_printf(&s, "---------------\n");
+
+ for_each_possible_cpu(cpu) {
+ struct rq *rq = cpu_rq(cpu);
+ struct rq_flags rf;
+ struct task_struct *p;
+
+ rq_lock(rq, &rf);
+
+ if (list_empty(&rq->scx.runnable_list) &&
+ rq->curr->sched_class == &idle_sched_class)
+ goto next;
+
+ seq_buf_printf(&s, "\nCPU %-4d: nr_run=%u ops_qseq=%lu\n",
+ cpu, rq->scx.nr_running, rq->scx.ops_qseq);
+ seq_buf_printf(&s, " curr=%s[%d] class=%ps\n",
+ rq->curr->comm, rq->curr->pid,
+ rq->curr->sched_class);
+
+ if (rq->curr->sched_class == &ext_sched_class)
+ scx_dump_task(&s, rq->curr, '*', now);
+
+ list_for_each_entry(p, &rq->scx.runnable_list, scx.runnable_node)
+ scx_dump_task(&s, p, ' ', now);
+ next:
+ rq_unlock(rq, &rf);
+ }
+
+ if (seq_buf_has_overflowed(&s))
+ memcpy(ei->dump + seq_buf_used(&s) - 1, trunc_marker,
+ sizeof(trunc_marker));
+}
+
static void scx_ops_error_irq_workfn(struct irq_work *irq_work)
{
+ struct scx_exit_info *ei = scx_exit_info;
+
+ if (ei->kind >= SCX_EXIT_ERROR)
+ scx_dump_state(ei, scx_ops.exit_dump_len);
+
schedule_scx_ops_disable_work();
}

@@ -3131,6 +3236,13 @@ static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind,
vscnprintf(ei->msg, SCX_EXIT_MSG_LEN, fmt, args);
va_end(args);

+ /*
+ * Set ei->kind and ->reason for scx_dump_state(). They'll be set again
+ * in scx_ops_disable_workfn().
+ */
+ ei->kind = kind;
+ ei->reason = scx_exit_reason(ei->kind);
+
irq_work_queue(&scx_ops_error_irq_work);
}

@@ -3192,7 +3304,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
if (ret < 0)
goto err;

- scx_exit_info = alloc_exit_info();
+ scx_exit_info = alloc_exit_info(ops->exit_dump_len);
if (!scx_exit_info) {
ret = -ENOMEM;
goto err_del;
@@ -3572,6 +3684,10 @@ static int bpf_scx_init_member(const struct btf_type *t,
return -E2BIG;
ops->timeout_ms = *(u32 *)(udata + moff);
return 1;
+ case offsetof(struct sched_ext_ops, exit_dump_len):
+ ops->exit_dump_len =
+ *(u32 *)(udata + moff) ?: SCX_EXIT_DUMP_DFL_LEN;
+ return 1;
}

return 0;
diff --git a/tools/sched_ext/include/scx/compat.h b/tools/sched_ext/include/scx/compat.h
index 2a66f3eb87a9..2be79bd88a25 100644
--- a/tools/sched_ext/include/scx/compat.h
+++ b/tools/sched_ext/include/scx/compat.h
@@ -126,7 +126,8 @@ static inline bool __COMPAT_struct_has_field(const char *type, const char *field
* and attach it, backward compatibility is automatically maintained where
* reasonable.
*
- * - sched_ext_ops.tick(): Ignored on older kernels with a warning.
+ * - ops.tick(): Ignored on older kernels with a warning.
+ * - ops.exit_dump_len: Cleared to zero on older kernels with a warning.
*/
#define SCX_OPS_OPEN(__ops_name, __scx_name) ({ \
struct __scx_name *__skel; \
@@ -136,7 +137,13 @@ static inline bool __COMPAT_struct_has_field(const char *type, const char *field
__skel; \
})

-#define SCX_OPS_LOAD(__skel, __ops_name, __scx_name) ({ \
+#define SCX_OPS_LOAD(__skel, __ops_name, __scx_name, __uei_name) ({ \
+ UEI_SET_SIZE(__skel, __ops_name, __uei_name); \
+ if (!__COMPAT_struct_has_field("sched_ext_ops", "exit_dump_len") && \
+ (__skel)->struct_ops.__ops_name->exit_dump_len) { \
+ fprintf(stderr, "WARNING: kernel doesn't support setting exit dump len\n"); \
+ (__skel)->struct_ops.__ops_name->exit_dump_len = 0; \
+ } \
if (!__COMPAT_struct_has_field("sched_ext_ops", "tick") && \
(__skel)->struct_ops.__ops_name->tick) { \
fprintf(stderr, "WARNING: kernel doesn't support ops.tick()\n"); \
diff --git a/tools/sched_ext/include/scx/user_exit_info.h b/tools/sched_ext/include/scx/user_exit_info.h
index 8c3b7fac4d05..cf4293cb250e 100644
--- a/tools/sched_ext/include/scx/user_exit_info.h
+++ b/tools/sched_ext/include/scx/user_exit_info.h
@@ -13,6 +13,7 @@
enum uei_sizes {
UEI_REASON_LEN = 128,
UEI_MSG_LEN = 1024,
+ UEI_DUMP_DFL_LEN = 32768,
};

struct user_exit_info {
@@ -28,6 +29,8 @@ struct user_exit_info {
#include <bpf/bpf_core_read.h>

#define UEI_DEFINE(__name) \
+ char RESIZABLE_ARRAY(data, __name##_dump); \
+ const volatile u32 __name##_dump_len; \
struct user_exit_info __name SEC(".data")

#define UEI_RECORD(__uei_name, __ei) ({ \
@@ -35,6 +38,8 @@ struct user_exit_info {
sizeof(__uei_name.reason), (__ei)->reason); \
bpf_probe_read_kernel_str(__uei_name.msg, \
sizeof(__uei_name.msg), (__ei)->msg); \
+ bpf_probe_read_kernel_str(__uei_name##_dump, \
+ __uei_name##_dump_len, (__ei)->dump); \
if (bpf_core_field_exists((__ei)->exit_code)) \
__uei_name.exit_code = (__ei)->exit_code; \
/* use __sync to force memory barrier */ \
@@ -47,6 +52,13 @@ struct user_exit_info {
#include <stdio.h>
#include <stdbool.h>

+/* no need to call the following explicitly if SCX_OPS_LOAD() is used */
+#define UEI_SET_SIZE(__skel, __ops_name, __uei_name) ({ \
+ u32 __len = (__skel)->struct_ops.__ops_name->exit_dump_len ?: UEI_DUMP_DFL_LEN; \
+ (__skel)->rodata->__uei_name##_dump_len = __len; \
+ RESIZE_ARRAY(data, __uei_name##_dump, __len); \
+})
+
#define UEI_EXITED(__skel, __uei_name) ({ \
/* use __sync to force memory barrier */ \
__sync_val_compare_and_swap(&(__skel)->data->__uei_name.kind, -1, -1); \
@@ -54,6 +66,13 @@ struct user_exit_info {

#define UEI_REPORT(__skel, __uei_name) ({ \
struct user_exit_info *__uei = &(__skel)->data->__uei_name; \
+ char *__uei_dump = (__skel)->data_##__uei_name##_dump->__uei_name##_dump; \
+ if (__uei_dump[0] != '\0') { \
+ fputs("\nDEBUG DUMP\n", stderr); \
+ fputs("================================================================================\n\n", stderr); \
+ fputs(__uei_dump, stderr); \
+ fputs("\n================================================================================\n\n", stderr); \
+ } \
fprintf(stderr, "EXIT: %s", __uei->reason); \
if (__uei->msg[0] != '\0') \
fprintf(stderr, " (%s)", __uei->msg); \
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index d2b98ef3ead2..28fd5aa4e62c 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -20,7 +20,7 @@ const char help_fmt[] =
"See the top-level comment in .bpf.c for more details.\n"
"\n"
"Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-b COUNT]\n"
-" [-d PID] [-p] [-v]\n"
+" [-d PID] [-D LEN] [-p] [-v]\n"
"\n"
" -s SLICE_US Override slice duration\n"
" -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n"
@@ -28,6 +28,7 @@ const char help_fmt[] =
" -T COUNT Stall every COUNT'th kernel thread\n"
" -b COUNT Dispatch upto COUNT tasks together\n"
" -d PID Disallow a process from switching into SCHED_EXT (-1 for self)\n"
+" -D LEN Set scx_exit_info.dump buffer length\n"
" -p Switch only tasks on SCHED_EXT policy intead of all\n"
" -v Print libbpf debug messages\n"
" -h Display this help and exit\n";
@@ -59,7 +60,7 @@ int main(int argc, char **argv)

skel = SCX_OPS_OPEN(qmap_ops, scx_qmap);

- while ((opt = getopt(argc, argv, "s:e:t:T:b:d:pvh")) != -1) {
+ while ((opt = getopt(argc, argv, "s:e:t:T:b:d:D:pvh")) != -1) {
switch (opt) {
case 's':
skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
@@ -81,6 +82,9 @@ int main(int argc, char **argv)
if (skel->rodata->disallow_tgid < 0)
skel->rodata->disallow_tgid = getpid();
break;
+ case 'D':
+ skel->struct_ops.qmap_ops->exit_dump_len = strtoul(optarg, NULL, 0);
+ break;
case 'p':
skel->rodata->switch_partial = true;
skel->struct_ops.qmap_ops->flags |= __COMPAT_SCX_OPS_SWITCH_PARTIAL;
@@ -94,7 +98,7 @@ int main(int argc, char **argv)
}
}

- SCX_OPS_LOAD(skel, qmap_ops, scx_qmap);
+ SCX_OPS_LOAD(skel, qmap_ops, scx_qmap, uei);
link = SCX_OPS_ATTACH(skel, qmap_ops);

while (!exit_req && !UEI_EXITED(skel, uei)) {
diff --git a/tools/sched_ext/scx_simple.c b/tools/sched_ext/scx_simple.c
index 08c741f56685..9ffa8d084228 100644
--- a/tools/sched_ext/scx_simple.c
+++ b/tools/sched_ext/scx_simple.c
@@ -80,7 +80,7 @@ int main(int argc, char **argv)
}
}

- SCX_OPS_LOAD(skel, simple_ops, scx_simple);
+ SCX_OPS_LOAD(skel, simple_ops, scx_simple, uei);
link = SCX_OPS_ATTACH(skel, simple_ops);

while (!exit_req && !UEI_EXITED(skel, uei)) {
--
2.44.0


2024-05-01 15:18:44

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 22/39] sched_ext: Implement scx_bpf_kick_cpu() and task preemption support

It's often useful to wake up and/or trigger reschedule on other CPUs. This
patch adds scx_bpf_kick_cpu() kfunc helper that BPF scheduler can call to
kick the target CPU into the scheduling path.

As a sched_ext task relinquishes its CPU only after its slice is depleted,
this patch also adds SCX_KICK_PREEMPT and SCX_ENQ_PREEMPT which clears the
slice of the target CPU's current task to guarantee that sched_ext's
scheduling path runs on the CPU.

If SCX_KICK_IDLE is specified, the target CPU is kicked iff the CPU is idle
to guarantee that the target CPU will go through at least one full sched_ext
scheduling cycle after the kicking. This can be used to wake up idle CPUs
without incurring unnecessary overhead if it isn't currently idle.

As a demonstration of how backward compatibility can be supported using BPF
CO-RE, tools/sched_ext/include/scx/compat.bpf.h is added. It provides
__COMPAT_scx_bpf_kick_cpu_IDLE() which uses SCX_KICK_IDLE if available or
becomes a regular kicking otherwise. This allows schedulers to use the new
SCX_KICK_IDLE while maintaining support for older kernels. The plan is to
temporarily use compat helpers to ease API updates and drop them after a few
kernel releases.

v5: - SCX_KICK_IDLE added. Note that this also adds a compat mechanism for
schedulers so that they can support kernels without SCX_KICK_IDLE.
This is useful as a demonstration of how new feature flags can be
added in a backward compatible way.

- kick_cpus_irq_workfn() reimplemented so that it touches the pending
cpumasks only as necessary to reduce kicking overhead on machines with
a lot of CPUs.

- tools/sched_ext/include/scx/compat.bpf.h added.

v4: - Move example scheduler to its own patch.

v3: - Make scx_example_central switch all tasks by default.

- Convert to BPF inline iterators.

v2: - Julia Lawall reported that scx_example_central can overflow the
dispatch buffer and malfunction. As scheduling for other CPUs can't be
handled by the automatic retry mechanism, fix by implementing an
explicit overflow and retry handling.

- Updated to use generic BPF cpumask helpers.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/sched/ext.h | 4 +
kernel/sched/ext.c | 225 +++++++++++++++++++++--
kernel/sched/sched.h | 10 +
tools/sched_ext/include/scx/common.bpf.h | 1 +
tools/sched_ext/include/scx/compat.bpf.h | 16 ++
5 files changed, 243 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 123d6dffdf26..4be270d02b98 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -134,6 +134,10 @@ struct sched_ext_entity {
* scx_bpf_dispatch() but can also be modified directly by the BPF
* scheduler. Automatically decreased by SCX as the task executes. On
* depletion, a scheduling event is triggered.
+ *
+ * This value is cleared to zero if the task is preempted by
+ * %SCX_KICK_PREEMPT and shouldn't be used to determine how long the
+ * task ran. Use p->se.sum_exec_runtime instead.
*/
u64 slice;

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 4ffa42e5d7dd..26c6a0b1e909 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -371,6 +371,14 @@ enum scx_enq_flags {

/* high 32bits are SCX specific */

+ /*
+ * Set the following to trigger preemption when calling
+ * scx_bpf_dispatch() with a local dsq as the target. The slice of the
+ * current task is cleared to zero and the CPU is kicked into the
+ * scheduling path. Implies %SCX_ENQ_HEAD.
+ */
+ SCX_ENQ_PREEMPT = 1LLU << 32,
+
/*
* The task being enqueued is the only task available for the cpu. By
* default, ext core keeps executing such tasks but when
@@ -400,6 +408,24 @@ enum scx_pick_idle_cpu_flags {
SCX_PICK_IDLE_CORE = 1LLU << 0, /* pick a CPU whose SMT siblings are also idle */
};

+enum scx_kick_flags {
+ /*
+ * Kick the target CPU if idle. Guarantees that the target CPU goes
+ * through at least one full scheduling cycle before going idle. If the
+ * target CPU can be determined to be currently not idle and going to go
+ * through a scheduling cycle before going idle, noop.
+ */
+ SCX_KICK_IDLE = 1LLU << 0,
+
+ /*
+ * Preempt the current task and execute the dispatch path. If the
+ * current task of the target CPU is an SCX task, its ->scx.slice is
+ * cleared to zero before the scheduling path is invoked so that the
+ * task expires and the dispatch path is invoked.
+ */
+ SCX_KICK_PREEMPT = 1LLU << 1,
+};
+
enum scx_ops_enable_state {
SCX_OPS_PREPPING,
SCX_OPS_ENABLING,
@@ -944,7 +970,7 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,
}
}

- if (enq_flags & SCX_ENQ_HEAD)
+ if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
list_add(&p->scx.dsq_node, &dsq->list);
else
list_add_tail(&p->scx.dsq_node, &dsq->list);
@@ -970,8 +996,16 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,

if (is_local) {
struct rq *rq = container_of(dsq, struct rq, scx.local_dsq);
+ bool preempt = false;
+
+ if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->curr &&
+ rq->curr->sched_class == &ext_sched_class) {
+ rq->curr->scx.slice = 0;
+ preempt = true;
+ }

- if (sched_class_above(&ext_sched_class, rq->curr->sched_class))
+ if (preempt || sched_class_above(&ext_sched_class,
+ rq->curr->sched_class))
resched_curr(rq);
} else {
raw_spin_unlock(&dsq->lock);
@@ -1806,8 +1840,10 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
struct scx_rq *scx_rq = &rq->scx;
struct scx_dsp_ctx *dspc = this_cpu_ptr(&scx_dsp_ctx);
bool prev_on_scx = prev->sched_class == &ext_sched_class;
+ bool has_tasks = false;

lockdep_assert_rq_held(rq);
+ scx_rq->flags |= SCX_RQ_BALANCING;

if (prev_on_scx) {
WARN_ON_ONCE(prev->scx.flags & SCX_TASK_BAL_KEEP);
@@ -1824,19 +1860,19 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
if ((prev->scx.flags & SCX_TASK_QUEUED) &&
prev->scx.slice && !scx_ops_bypassing()) {
prev->scx.flags |= SCX_TASK_BAL_KEEP;
- return 1;
+ goto has_tasks;
}
}

/* if there already are tasks to run, nothing to do */
if (scx_rq->local_dsq.nr)
- return 1;
+ goto has_tasks;

if (consume_dispatch_q(rq, rf, &scx_dsq_global))
- return 1;
+ goto has_tasks;

if (!SCX_HAS_OP(dispatch) || scx_ops_bypassing())
- return 0;
+ goto out;

dspc->rq = rq;
dspc->rf = rf;
@@ -1857,12 +1893,18 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
flush_dispatch_buf(rq, rf);

if (scx_rq->local_dsq.nr)
- return 1;
+ goto has_tasks;
if (consume_dispatch_q(rq, rf, &scx_dsq_global))
- return 1;
+ goto has_tasks;
} while (dspc->nr_tasks);

- return 0;
+ goto out;
+
+has_tasks:
+ has_tasks = true;
+out:
+ scx_rq->flags &= ~SCX_RQ_BALANCING;
+ return has_tasks;
}

static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
@@ -2600,7 +2642,8 @@ int scx_check_setscheduler(struct task_struct *p, int policy)
* Omitted operations:
*
* - wakeup_preempt: NOOP as it isn't useful in the wakeup path because the task
- * isn't tied to the CPU at that point.
+ * isn't tied to the CPU at that point. Preemption is implemented by resetting
+ * the victim task's slice to 0 and triggering reschedule on the target CPU.
*
* - migrate_task_rq: Unncessary as task to cpu mapping is transient.
*
@@ -2836,6 +2879,9 @@ bool task_should_scx(struct task_struct *p)
* of the queue.
*
* d. pick_next_task() suppresses zero slice warning.
+ *
+ * e. scx_bpf_kick_cpu() is disabled to avoid irq_work malfunction during PM
+ * operations.
*/
static void scx_ops_bypass(bool bypass)
{
@@ -3184,11 +3230,21 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
rq->curr->sched_class == &idle_sched_class)
goto next;

- seq_buf_printf(&s, "\nCPU %-4d: nr_run=%u ops_qseq=%lu\n",
- cpu, rq->scx.nr_running, rq->scx.ops_qseq);
+ seq_buf_printf(&s, "\nCPU %-4d: nr_run=%u flags=0x%x ops_qseq=%lu\n",
+ cpu, rq->scx.nr_running, rq->scx.flags,
+ rq->scx.ops_qseq);
seq_buf_printf(&s, " curr=%s[%d] class=%ps\n",
rq->curr->comm, rq->curr->pid,
rq->curr->sched_class);
+ if (!cpumask_empty(rq->scx.cpus_to_kick))
+ seq_buf_printf(&s, " cpus_to_kick : %*pb\n",
+ cpumask_pr_args(rq->scx.cpus_to_kick));
+ if (!cpumask_empty(rq->scx.cpus_to_kick_if_idle))
+ seq_buf_printf(&s, " idle_to_kick : %*pb\n",
+ cpumask_pr_args(rq->scx.cpus_to_kick_if_idle));
+ if (!cpumask_empty(rq->scx.cpus_to_preempt))
+ seq_buf_printf(&s, " cpus_to_preempt: %*pb\n",
+ cpumask_pr_args(rq->scx.cpus_to_preempt));

if (rq->curr->sched_class == &ext_sched_class)
scx_dump_task(&s, rq->curr, '*', now);
@@ -3819,6 +3875,82 @@ static const struct sysrq_key_op sysrq_sched_ext_reset_op = {
.enable_mask = SYSRQ_ENABLE_RTNICE,
};

+static bool can_skip_idle_kick(struct rq *rq)
+{
+ lockdep_assert_rq_held(rq);
+
+ /*
+ * We can skip idle kicking if @rq is going to go through at least one
+ * full SCX scheduling cycle before going idle. Just checking whether
+ * curr is not idle is insufficient because we could be racing
+ * balance_one() trying to pull the next task from a remote rq, which
+ * may fail, and @rq may become idle afterwards.
+ *
+ * The race window is small and we don't and can't guarantee that @rq is
+ * only kicked while idle anyway. Skip only when sure.
+ */
+ return !is_idle_task(rq->curr) && !(rq->scx.flags & SCX_RQ_BALANCING);
+}
+
+static void kick_one_cpu(s32 cpu, struct rq *this_rq)
+{
+ struct rq *rq = cpu_rq(cpu);
+ struct scx_rq *this_scx = &this_rq->scx;
+ unsigned long flags;
+
+ raw_spin_rq_lock_irqsave(rq, flags);
+
+ /*
+ * During CPU hotplug, a CPU may depend on kicking itself to make
+ * forward progress. Allow kicking self regardless of online state.
+ */
+ if (cpu_online(cpu) || cpu == cpu_of(this_rq)) {
+ if (cpumask_test_cpu(cpu, this_scx->cpus_to_preempt)) {
+ if (rq->curr->sched_class == &ext_sched_class)
+ rq->curr->scx.slice = 0;
+ cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt);
+ }
+
+ resched_curr(rq);
+ } else {
+ cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt);
+ }
+
+ raw_spin_rq_unlock_irqrestore(rq, flags);
+}
+
+static void kick_one_cpu_if_idle(s32 cpu, struct rq *this_rq)
+{
+ struct rq *rq = cpu_rq(cpu);
+ unsigned long flags;
+
+ raw_spin_rq_lock_irqsave(rq, flags);
+
+ if (!can_skip_idle_kick(rq) &&
+ (cpu_online(cpu) || cpu == cpu_of(this_rq)))
+ resched_curr(rq);
+
+ raw_spin_rq_unlock_irqrestore(rq, flags);
+}
+
+static void kick_cpus_irq_workfn(struct irq_work *irq_work)
+{
+ struct rq *this_rq = this_rq();
+ struct scx_rq *this_scx = &this_rq->scx;
+ s32 cpu;
+
+ for_each_cpu(cpu, this_scx->cpus_to_kick) {
+ kick_one_cpu(cpu, this_rq);
+ cpumask_clear_cpu(cpu, this_scx->cpus_to_kick);
+ cpumask_clear_cpu(cpu, this_scx->cpus_to_kick_if_idle);
+ }
+
+ for_each_cpu(cpu, this_scx->cpus_to_kick_if_idle) {
+ kick_one_cpu_if_idle(cpu, this_rq);
+ cpumask_clear_cpu(cpu, this_scx->cpus_to_kick_if_idle);
+ }
+}
+
/**
* print_scx_info - print out sched_ext scheduler state
* @log_lvl: the log level to use when printing
@@ -3873,7 +4005,7 @@ void __init init_sched_ext_class(void)
* definitions so that BPF scheduler implementations can use them
* through the generated vmlinux.h.
*/
- WRITE_ONCE(v, SCX_ENQ_WAKEUP | SCX_DEQ_SLEEP);
+ WRITE_ONCE(v, SCX_ENQ_WAKEUP | SCX_DEQ_SLEEP | SCX_KICK_PREEMPT);

BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params));
init_dsq(&scx_dsq_global, SCX_DSQ_GLOBAL);
@@ -3886,6 +4018,11 @@ void __init init_sched_ext_class(void)

init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL);
INIT_LIST_HEAD(&rq->scx.runnable_list);
+
+ BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick, GFP_KERNEL));
+ BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick_if_idle, GFP_KERNEL));
+ BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_preempt, GFP_KERNEL));
+ init_irq_work(&rq->scx.kick_cpus_irq_work, kick_cpus_irq_workfn);
}

register_sysrq_key('S', &sysrq_sched_ext_reset_op);
@@ -4173,6 +4310,67 @@ static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = {

__bpf_kfunc_start_defs();

+/**
+ * scx_bpf_kick_cpu - Trigger reschedule on a CPU
+ * @cpu: cpu to kick
+ * @flags: %SCX_KICK_* flags
+ *
+ * Kick @cpu into rescheduling. This can be used to wake up an idle CPU or
+ * trigger rescheduling on a busy CPU. This can be called from any online
+ * scx_ops operation and the actual kicking is performed asynchronously through
+ * an irq work.
+ */
+__bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags)
+{
+ struct rq *this_rq;
+ unsigned long irq_flags;
+
+ if (!ops_cpu_valid(cpu, NULL))
+ return;
+
+ /*
+ * While bypassing for PM ops, IRQ handling may not be online which can
+ * lead to irq_work_queue() malfunction such as infinite busy wait for
+ * IRQ status update. Suppress kicking.
+ */
+ if (scx_ops_bypassing())
+ return;
+
+ local_irq_save(irq_flags);
+
+ this_rq = this_rq();
+
+ /*
+ * Actual kicking is bounced to kick_cpus_irq_workfn() to avoid nesting
+ * rq locks. We can probably be smarter and avoid bouncing if called
+ * from ops which don't hold a rq lock.
+ */
+ if (flags & SCX_KICK_IDLE) {
+ struct rq *target_rq = cpu_rq(cpu);
+
+ if (unlikely(flags & SCX_KICK_PREEMPT))
+ scx_ops_error("PREEMPT cannot be used with SCX_KICK_IDLE");
+
+ if (raw_spin_rq_trylock(target_rq)) {
+ if (can_skip_idle_kick(target_rq)) {
+ raw_spin_rq_unlock(target_rq);
+ goto out;
+ }
+ raw_spin_rq_unlock(target_rq);
+ }
+ cpumask_set_cpu(cpu, this_rq->scx.cpus_to_kick_if_idle);
+ } else {
+ cpumask_set_cpu(cpu, this_rq->scx.cpus_to_kick);
+
+ if (flags & SCX_KICK_PREEMPT)
+ cpumask_set_cpu(cpu, this_rq->scx.cpus_to_preempt);
+ }
+
+ irq_work_queue(&this_rq->scx.kick_cpus_irq_work);
+out:
+ local_irq_restore(irq_flags);
+}
+
/**
* scx_bpf_dsq_nr_queued - Return the number of queued tasks
* @dsq_id: id of the DSQ
@@ -4520,6 +4718,7 @@ __bpf_kfunc s32 scx_bpf_task_cpu(const struct task_struct *p)
__bpf_kfunc_end_defs();

BTF_KFUNCS_START(scx_kfunc_ids_any)
+BTF_ID_FLAGS(func, scx_bpf_kick_cpu)
BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued)
BTF_ID_FLAGS(func, scx_bpf_destroy_dsq)
BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a55a31250ab..2ce8cd64fa65 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -709,12 +709,22 @@ struct cfs_rq {
};

#ifdef CONFIG_SCHED_CLASS_EXT
+/* scx_rq->flags, protected by the rq lock */
+enum scx_rq_flags {
+ SCX_RQ_BALANCING = 1 << 0,
+};
+
struct scx_rq {
struct scx_dispatch_q local_dsq;
struct list_head runnable_list; /* runnable tasks on this rq */
unsigned long ops_qseq;
u64 extra_enq_flags; /* see move_task_to_local_dsq() */
u32 nr_running;
+ u32 flags;
+ cpumask_var_t cpus_to_kick;
+ cpumask_var_t cpus_to_kick_if_idle;
+ cpumask_var_t cpus_to_preempt;
+ struct irq_work kick_cpus_irq_work;
};
#endif /* CONFIG_SCHED_CLASS_EXT */

diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 6b355899f67d..8b4052034f93 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -34,6 +34,7 @@ void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice, u64 enq_flag
u32 scx_bpf_dispatch_nr_slots(void) __ksym;
void scx_bpf_dispatch_cancel(void) __ksym;
bool scx_bpf_consume(u64 dsq_id) __ksym;
+void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym;
s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym;
void scx_bpf_destroy_dsq(u64 dsq_id) __ksym;
void scx_bpf_exit_bstr(s64 exit_code, char *fmt, unsigned long long *data, u32 data__sz) __ksym __weak;
diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
index c32a6a0f994c..0729aa9bb03e 100644
--- a/tools/sched_ext/include/scx/compat.bpf.h
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@@ -7,6 +7,22 @@
#ifndef __SCX_COMPAT_BPF_H
#define __SCX_COMPAT_BPF_H

+#define __COMPAT_ENUM_OR_ZERO(__type, __ent) \
+({ \
+ __type __ret = 0; \
+ if (bpf_core_enum_value_exists(__type, __ent)) \
+ __ret = __ent; \
+ __ret; \
+})
+
+/*
+ * %SCX_KICK_IDLE is a later addition. To support both before and after, use
+ * %__COMPAT_SCX_KICK_IDLE which becomes 0 on kernels which don't support it.
+ * Users can use %SCX_KICK_IDLE directly in the future.
+ */
+#define __COMPAT_SCX_KICK_IDLE \
+ __COMPAT_ENUM_OR_ZERO(enum scx_kick_flags, SCX_KICK_IDLE)
+
/*
* scx_switch_all() was replaced by %SCX_OPS_SWITCH_PARTIAL. See
* %__COMPAT_SCX_OPS_SWITCH_PARTIAL in compat.h. This can be dropped in the
--
2.44.0


2024-05-01 15:18:51

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 23/39] sched_ext: Add a central scheduler which makes all scheduling decisions on one CPU

This patch adds a new example scheduler, scx_central, which demonstrates
central scheduling where one CPU is responsible for making all scheduling
decisions in the system using scx_bpf_kick_cpu(). The central CPU makes
scheduling decisions for all CPUs in the system, queues tasks on the
appropriate local dsq's and preempts the worker CPUs. The worker CPUs in
turn preempt the central CPU when it needs tasks to run.

Currently, every CPU depends on its own tick to expire the current task. A
follow-up patch implementing tickless support for sched_ext will allow the
worker CPUs to go full tickless so that they can run completely undisturbed.

v3: - Kumar fixed a bug where the dispatch path could overflow the dispatch
buffer if too many are dispatched to the fallback DSQ.

- Use the new SCX_KICK_IDLE to wake up non-central CPUs.

- Dropped '-p' option.

v2: - Use RESIZABLE_ARRAY() instead of fixed MAX_CPUS and use SCX_BUG[_ON]()
to simplify error handling.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
Cc: Kumar Kartikeya Dwivedi <[email protected]>
Cc: Julia Lawall <[email protected]>
---
tools/sched_ext/Makefile | 2 +-
tools/sched_ext/scx_central.bpf.c | 214 ++++++++++++++++++++++++++++++
tools/sched_ext/scx_central.c | 105 +++++++++++++++
3 files changed, 320 insertions(+), 1 deletion(-)
create mode 100644 tools/sched_ext/scx_central.bpf.c
create mode 100644 tools/sched_ext/scx_central.c

diff --git a/tools/sched_ext/Makefile b/tools/sched_ext/Makefile
index 626782a21375..bf7e108f5ae1 100644
--- a/tools/sched_ext/Makefile
+++ b/tools/sched_ext/Makefile
@@ -176,7 +176,7 @@ $(INCLUDE_DIR)/%.bpf.skel.h: $(SCXOBJ_DIR)/%.bpf.o $(INCLUDE_DIR)/vmlinux.h $(BP

SCX_COMMON_DEPS := include/scx/common.h include/scx/user_exit_info.h | $(BINDIR)

-c-sched-targets = scx_simple scx_qmap
+c-sched-targets = scx_simple scx_qmap scx_central

$(addprefix $(BINDIR)/,$(c-sched-targets)): \
$(BINDIR)/%: \
diff --git a/tools/sched_ext/scx_central.bpf.c b/tools/sched_ext/scx_central.bpf.c
new file mode 100644
index 000000000000..3d980375a058
--- /dev/null
+++ b/tools/sched_ext/scx_central.bpf.c
@@ -0,0 +1,214 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A central FIFO sched_ext scheduler which demonstrates the followings:
+ *
+ * a. Making all scheduling decisions from one CPU:
+ *
+ * The central CPU is the only one making scheduling decisions. All other
+ * CPUs kick the central CPU when they run out of tasks to run.
+ *
+ * There is one global BPF queue and the central CPU schedules all CPUs by
+ * dispatching from the global queue to each CPU's local dsq from dispatch().
+ * This isn't the most straightforward. e.g. It'd be easier to bounce
+ * through per-CPU BPF queues. The current design is chosen to maximally
+ * utilize and verify various SCX mechanisms such as LOCAL_ON dispatching.
+ *
+ * b. Preemption
+ *
+ * SCX_KICK_PREEMPT is used to trigger scheduling and CPUs to move to the
+ * next tasks.
+ *
+ * This scheduler is designed to maximize usage of various SCX mechanisms. A
+ * more practical implementation would likely put the scheduling loop outside
+ * the central CPU's dispatch() path and add some form of priority mechanism.
+ *
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+enum {
+ FALLBACK_DSQ_ID = 0,
+};
+
+const volatile s32 central_cpu;
+const volatile u32 nr_cpu_ids = 1; /* !0 for veristat, set during init */
+const volatile u64 slice_ns = SCX_SLICE_DFL;
+
+u64 nr_total, nr_locals, nr_queued, nr_lost_pids;
+u64 nr_dispatches, nr_mismatches, nr_retries;
+u64 nr_overflows;
+
+UEI_DEFINE(uei);
+
+struct {
+ __uint(type, BPF_MAP_TYPE_QUEUE);
+ __uint(max_entries, 4096);
+ __type(value, s32);
+} central_q SEC(".maps");
+
+/* can't use percpu map due to bad lookups */
+bool RESIZABLE_ARRAY(data, cpu_gimme_task);
+
+s32 BPF_STRUCT_OPS(central_select_cpu, struct task_struct *p,
+ s32 prev_cpu, u64 wake_flags)
+{
+ /*
+ * Steer wakeups to the central CPU as much as possible to avoid
+ * disturbing other CPUs. It's safe to blindly return the central cpu as
+ * select_cpu() is a hint and if @p can't be on it, the kernel will
+ * automatically pick a fallback CPU.
+ */
+ return central_cpu;
+}
+
+void BPF_STRUCT_OPS(central_enqueue, struct task_struct *p, u64 enq_flags)
+{
+ s32 pid = p->pid;
+
+ __sync_fetch_and_add(&nr_total, 1);
+
+ if (bpf_map_push_elem(&central_q, &pid, 0)) {
+ __sync_fetch_and_add(&nr_overflows, 1);
+ scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_DFL, enq_flags);
+ return;
+ }
+
+ __sync_fetch_and_add(&nr_queued, 1);
+
+ if (!scx_bpf_task_running(p))
+ scx_bpf_kick_cpu(central_cpu, SCX_KICK_PREEMPT);
+}
+
+static bool dispatch_to_cpu(s32 cpu)
+{
+ struct task_struct *p;
+ s32 pid;
+
+ bpf_repeat(BPF_MAX_LOOPS) {
+ if (bpf_map_pop_elem(&central_q, &pid))
+ break;
+
+ __sync_fetch_and_sub(&nr_queued, 1);
+
+ p = bpf_task_from_pid(pid);
+ if (!p) {
+ __sync_fetch_and_add(&nr_lost_pids, 1);
+ continue;
+ }
+
+ /*
+ * If we can't run the task at the top, do the dumb thing and
+ * bounce it to the fallback dsq.
+ */
+ if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) {
+ __sync_fetch_and_add(&nr_mismatches, 1);
+ scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_DFL, 0);
+ bpf_task_release(p);
+ /*
+ * We might run out of dispatch buffer slots if we continue dispatching
+ * to the fallback DSQ, without dispatching to the local DSQ of the
+ * target CPU. In such a case, break the loop now as will fail the
+ * next dispatch operation.
+ */
+ if (!scx_bpf_dispatch_nr_slots())
+ break;
+ continue;
+ }
+
+ /* dispatch to local and mark that @cpu doesn't need more */
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, SCX_SLICE_DFL, 0);
+
+ if (cpu != central_cpu)
+ scx_bpf_kick_cpu(cpu, __COMPAT_SCX_KICK_IDLE);
+
+ bpf_task_release(p);
+ return true;
+ }
+
+ return false;
+}
+
+void BPF_STRUCT_OPS(central_dispatch, s32 cpu, struct task_struct *prev)
+{
+ if (cpu == central_cpu) {
+ /* dispatch for all other CPUs first */
+ __sync_fetch_and_add(&nr_dispatches, 1);
+
+ bpf_for(cpu, 0, nr_cpu_ids) {
+ bool *gimme;
+
+ if (!scx_bpf_dispatch_nr_slots())
+ break;
+
+ /* central's gimme is never set */
+ gimme = ARRAY_ELEM_PTR(cpu_gimme_task, cpu, nr_cpu_ids);
+ if (gimme && !*gimme)
+ continue;
+
+ if (dispatch_to_cpu(cpu))
+ *gimme = false;
+ }
+
+ /*
+ * Retry if we ran out of dispatch buffer slots as we might have
+ * skipped some CPUs and also need to dispatch for self. The ext
+ * core automatically retries if the local dsq is empty but we
+ * can't rely on that as we're dispatching for other CPUs too.
+ * Kick self explicitly to retry.
+ */
+ if (!scx_bpf_dispatch_nr_slots()) {
+ __sync_fetch_and_add(&nr_retries, 1);
+ scx_bpf_kick_cpu(central_cpu, SCX_KICK_PREEMPT);
+ return;
+ }
+
+ /* look for a task to run on the central CPU */
+ if (scx_bpf_consume(FALLBACK_DSQ_ID))
+ return;
+ dispatch_to_cpu(central_cpu);
+ } else {
+ bool *gimme;
+
+ if (scx_bpf_consume(FALLBACK_DSQ_ID))
+ return;
+
+ gimme = ARRAY_ELEM_PTR(cpu_gimme_task, cpu, nr_cpu_ids);
+ if (gimme)
+ *gimme = true;
+
+ /*
+ * Force dispatch on the scheduling CPU so that it finds a task
+ * to run for us.
+ */
+ scx_bpf_kick_cpu(central_cpu, SCX_KICK_PREEMPT);
+ }
+}
+
+int BPF_STRUCT_OPS_SLEEPABLE(central_init)
+{
+ return scx_bpf_create_dsq(FALLBACK_DSQ_ID, -1);
+}
+
+void BPF_STRUCT_OPS(central_exit, struct scx_exit_info *ei)
+{
+ UEI_RECORD(uei, ei);
+}
+
+SCX_OPS_DEFINE(central_ops,
+ /*
+ * We are offloading all scheduling decisions to the central CPU
+ * and thus being the last task on a given CPU doesn't mean
+ * anything special. Enqueue the last tasks like any other tasks.
+ */
+ .flags = SCX_OPS_ENQ_LAST,
+
+ .select_cpu = (void *)central_select_cpu,
+ .enqueue = (void *)central_enqueue,
+ .dispatch = (void *)central_dispatch,
+ .init = (void *)central_init,
+ .exit = (void *)central_exit,
+ .name = "central");
diff --git a/tools/sched_ext/scx_central.c b/tools/sched_ext/scx_central.c
new file mode 100644
index 000000000000..02cd983a5287
--- /dev/null
+++ b/tools/sched_ext/scx_central.c
@@ -0,0 +1,105 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <inttypes.h>
+#include <signal.h>
+#include <libgen.h>
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include "scx_central.bpf.skel.h"
+
+const char help_fmt[] =
+"A central FIFO sched_ext scheduler.\n"
+"\n"
+"See the top-level comment in .bpf.c for more details.\n"
+"\n"
+"Usage: %s [-s SLICE_US] [-c CPU]\n"
+"\n"
+" -s SLICE_US Override slice duration\n"
+" -c CPU Override the central CPU (default: 0)\n"
+" -v Print libbpf debug messages\n"
+" -h Display this help and exit\n";
+
+static bool verbose;
+static volatile int exit_req;
+
+static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
+{
+ if (level == LIBBPF_DEBUG && !verbose)
+ return 0;
+ return vfprintf(stderr, format, args);
+}
+
+static void sigint_handler(int dummy)
+{
+ exit_req = 1;
+}
+
+int main(int argc, char **argv)
+{
+ struct scx_central *skel;
+ struct bpf_link *link;
+ __u64 seq = 0;
+ __s32 opt;
+
+ libbpf_set_print(libbpf_print_fn);
+ signal(SIGINT, sigint_handler);
+ signal(SIGTERM, sigint_handler);
+
+ skel = SCX_OPS_OPEN(central_ops, scx_central);
+
+ skel->rodata->central_cpu = 0;
+ skel->rodata->nr_cpu_ids = libbpf_num_possible_cpus();
+
+ while ((opt = getopt(argc, argv, "s:c:pvh")) != -1) {
+ switch (opt) {
+ case 's':
+ skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
+ break;
+ case 'c':
+ skel->rodata->central_cpu = strtoul(optarg, NULL, 0);
+ break;
+ case 'v':
+ verbose = true;
+ break;
+ default:
+ fprintf(stderr, help_fmt, basename(argv[0]));
+ return opt != 'h';
+ }
+ }
+
+ /* Resize arrays so their element count is equal to cpu count. */
+ RESIZE_ARRAY(data, cpu_gimme_task, skel->rodata->nr_cpu_ids);
+
+ SCX_OPS_LOAD(skel, central_ops, scx_central, uei);
+ link = SCX_OPS_ATTACH(skel, central_ops);
+
+ while (!exit_req && !UEI_EXITED(skel, uei)) {
+ printf("[SEQ %llu]\n", seq++);
+ printf("total :%10" PRIu64 " local:%10" PRIu64 " queued:%10" PRIu64 " lost:%10" PRIu64 "\n",
+ skel->bss->nr_total,
+ skel->bss->nr_locals,
+ skel->bss->nr_queued,
+ skel->bss->nr_lost_pids);
+ printf(" dispatch:%10" PRIu64 " mismatch:%10" PRIu64 " retry:%10" PRIu64 "\n",
+ skel->bss->nr_dispatches,
+ skel->bss->nr_mismatches,
+ skel->bss->nr_retries);
+ printf("overflow:%10" PRIu64 "\n",
+ skel->bss->nr_overflows);
+ fflush(stdout);
+ sleep(1);
+ }
+
+ bpf_link__destroy(link);
+ UEI_REPORT(skel, uei);
+ scx_central__destroy(skel);
+ return 0;
+}
--
2.44.0


2024-05-01 15:19:12

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 24/39] sched_ext: Make watchdog handle ops.dispatch() looping stall

The dispatch path retries if the local DSQ is still empty after
ops.dispatch() either dispatched or consumed a task. This is both out of
necessity and for convenience. It has to retry because the dispatch path
might lose the tasks to dequeue while the rq lock is released while trying
to migrate tasks across CPUs, and the retry mechanism makes ops.dispatch()
implementation easier as it only needs to make some forward progress each
iteration.

However, this makes it possible for ops.dispatch() to stall CPUs by
repeatedly dispatching ineligible tasks. If all CPUs are stalled that way,
the watchdog or sysrq handler can't run and the system can't be saved. Let's
address the issue by breaking out of the dispatch loop after 32 iterations.

It is unlikely but not impossible for ops.dispatch() to legitimately go over
the iteration limit. We want to come back to the dispatch path in such cases
as not doing so risks stalling the CPU by idling with runnable tasks
pending. As the previous task is still current in balance_scx(),
resched_curr() doesn't do anything - it will just get cleared. Let's instead
use scx_kick_bpf() which will trigger reschedule after switching to the next
task which will likely be the idle task.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
---
kernel/sched/ext.c | 17 +++++++++++++++++
tools/sched_ext/scx_qmap.bpf.c | 15 +++++++++++++++
tools/sched_ext/scx_qmap.c | 8 ++++++--
3 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 26c6a0b1e909..495210cd12f9 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -8,6 +8,7 @@

enum scx_consts {
SCX_DSP_DFL_MAX_BATCH = 32,
+ SCX_DSP_MAX_LOOPS = 32,
SCX_WATCHDOG_MAX_TIMEOUT = 30 * HZ,

SCX_EXIT_BT_LEN = 64,
@@ -598,6 +599,7 @@ static DEFINE_PER_CPU(struct scx_dsp_ctx, scx_dsp_ctx);
static struct kset *scx_kset;
static struct kobject *scx_root_kobj;

+static void scx_bpf_kick_cpu(s32 cpu, u64 flags);
static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind,
s64 exit_code,
const char *fmt, ...);
@@ -1840,6 +1842,7 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
struct scx_rq *scx_rq = &rq->scx;
struct scx_dsp_ctx *dspc = this_cpu_ptr(&scx_dsp_ctx);
bool prev_on_scx = prev->sched_class == &ext_sched_class;
+ int nr_loops = SCX_DSP_MAX_LOOPS;
bool has_tasks = false;

lockdep_assert_rq_held(rq);
@@ -1896,6 +1899,20 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
goto has_tasks;
if (consume_dispatch_q(rq, rf, &scx_dsq_global))
goto has_tasks;
+
+ /*
+ * ops.dispatch() can trap us in this loop by repeatedly
+ * dispatching ineligible tasks. Break out once in a while to
+ * allow the watchdog to run. As IRQ can't be enabled in
+ * balance(), we want to complete this scheduling cycle and then
+ * start a new one. IOW, we want to call resched_curr() on the
+ * next, most likely idle, task, not the current one. Use
+ * scx_bpf_kick_cpu() for deferred kicking.
+ */
+ if (unlikely(!--nr_loops)) {
+ scx_bpf_kick_cpu(cpu_of(rq), 0);
+ break;
+ }
} while (dspc->nr_tasks);

goto out;
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index e18f25017a0a..812004bf027a 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -31,6 +31,7 @@ char _license[] SEC("license") = "GPL";
const volatile u64 slice_ns = SCX_SLICE_DFL;
const volatile u32 stall_user_nth;
const volatile u32 stall_kernel_nth;
+const volatile u32 dsp_inf_loop_after;
const volatile u32 dsp_batch;
const volatile s32 disallow_tgid;
const volatile bool switch_partial;
@@ -198,6 +199,20 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
if (scx_bpf_consume(SHARED_DSQ))
return;

+ if (dsp_inf_loop_after && nr_dispatched > dsp_inf_loop_after) {
+ /*
+ * PID 2 should be kthreadd which should mostly be idle and off
+ * the scheduler. Let's keep dispatching it to force the kernel
+ * to call this function over and over again.
+ */
+ p = bpf_task_from_pid(2);
+ if (p) {
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL, slice_ns, 0);
+ bpf_task_release(p);
+ return;
+ }
+ }
+
if (!(cpuc = bpf_map_lookup_elem(&cpu_ctx_stor, &zero))) {
scx_bpf_error("failed to look up cpu_ctx");
return;
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index 28fd5aa4e62c..36254631589e 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -19,13 +19,14 @@ const char help_fmt[] =
"\n"
"See the top-level comment in .bpf.c for more details.\n"
"\n"
-"Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-b COUNT]\n"
+"Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-l COUNT] [-b COUNT]\n"
" [-d PID] [-D LEN] [-p] [-v]\n"
"\n"
" -s SLICE_US Override slice duration\n"
" -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n"
" -t COUNT Stall every COUNT'th user thread\n"
" -T COUNT Stall every COUNT'th kernel thread\n"
+" -l COUNT Trigger dispatch infinite looping after COUNT dispatches\n"
" -b COUNT Dispatch upto COUNT tasks together\n"
" -d PID Disallow a process from switching into SCHED_EXT (-1 for self)\n"
" -D LEN Set scx_exit_info.dump buffer length\n"
@@ -60,7 +61,7 @@ int main(int argc, char **argv)

skel = SCX_OPS_OPEN(qmap_ops, scx_qmap);

- while ((opt = getopt(argc, argv, "s:e:t:T:b:d:D:pvh")) != -1) {
+ while ((opt = getopt(argc, argv, "s:e:t:T:l:b:d:D:pvh")) != -1) {
switch (opt) {
case 's':
skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
@@ -74,6 +75,9 @@ int main(int argc, char **argv)
case 'T':
skel->rodata->stall_kernel_nth = strtoul(optarg, NULL, 0);
break;
+ case 'l':
+ skel->rodata->dsp_inf_loop_after = strtoul(optarg, NULL, 0);
+ break;
case 'b':
skel->rodata->dsp_batch = strtoul(optarg, NULL, 0);
break;
--
2.44.0


2024-05-01 15:19:09

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 14/39] sched_ext: Implement BPF extensible scheduler class

Implement a new scheduler class sched_ext (SCX), which allows scheduling
policies to be implemented as BPF programs to achieve the following:

1. Ease of experimentation and exploration: Enabling rapid iteration of new
scheduling policies.

2. Customization: Building application-specific schedulers which implement
policies that are not applicable to general-purpose schedulers.

3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling
policies in production environments.

sched_ext leverages BPF’s struct_ops feature to define a structure which
exports function callbacks and flags to BPF programs that wish to implement
scheduling policies. The struct_ops structure exported by sched_ext is
struct sched_ext_ops, and is conceptually similar to struct sched_class. The
role of sched_ext is to map the complex sched_class callbacks to the more
simple and ergonomic struct sched_ext_ops callbacks.

For more detailed discussion on the motivations and overview, please refer
to the cover letter.

Later patches will also add several example schedulers and documentation.

This patch implements the minimum core framework to enable implementation of
BPF schedulers. Subsequent patches will gradually add functionalities
including safety guarantee mechanisms, nohz and cgroup support.

include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on
top, each operation should be self-explanatory. The followings are worth
noting:

- Both "sched_ext" and its shorthand "scx" are used. If the identifier
already has "sched" in it, "ext" is used; otherwise, "scx".

- In sched_ext_ops, only .name is mandatory. Every operation is optional and
if omitted a simple but functional default behavior is provided.

- A new policy constant SCHED_EXT is added and a task can select sched_ext
by invoking sched_setscheduler(2) with the new policy constant. However,
if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL
and the task is scheduled by CFS. When the BPF scheduler is loaded, all
tasks which have the SCHED_EXT policy are switched to sched_ext.

- To bridge the workflow imbalance between the scheduler core and
sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch
queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and
one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for
convenience and need not be used by a scheduler that doesn't require it.
SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting
the next task on the CPU. The BPF scheduler can manage an arbitrary number
of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().

- sched_ext guarantees system integrity no matter what the BPF scheduler
does. To enable this, each task's ownership is tracked through
p->scx.ops_state and all tasks are put on scx_tasks list. The disable path
can always recover and revert all tasks back to CFS. See p->scx.ops_state
and scx_tasks.

- A task is not tied to its rq while enqueued. This decouples CPU selection
from queueing and allows sharing a scheduling queue across an arbitrary
subset of CPUs. This adds some complexities as a task may need to be
bounced between rq's right before it starts executing. See
dispatch_to_local_dsq() and move_task_to_local_dsq().

- One complication that arises from the above weak association between task
and rq is that synchronizing with dequeue() gets complicated as dequeue()
may happen anytime while the task is enqueued and the dispatch path might
need to release the rq lock to transfer the task. Solving this requires a
bit of complexity. See the logic around p->scx.sticky_cpu and
p->scx.ops_qseq.

- Both enable and disable paths are a bit complicated. The enable path
switches all tasks without blocking to avoid issues which can arise from
partially switched states (e.g. the switching task itself being starved).
The disable path can't trust the BPF scheduler at all, so it also has to
guarantee forward progress without blocking. See scx_ops_enable() and
scx_ops_disable_workfn().

- When sched_ext is disabled, static_branches are used to shut down the
entry points from hot paths.

v6: - SCX_NR_ONLINE_OPS replaced with SCX_OPI_*_BEGIN/END so that multiple
groups can be expressed. Later CPU hotplug operations are put into
their own group.

- SCX_OPS_DISABLING state is replaced with the new bypass mechanism
which allows temporarily putting the system into simple FIFO
scheduling mode bypassing the BPF scheduler. In addition to the shut
down path, this will also be used to isolate the BPF scheduler across
PM events. Enabling and disabling the bypass mode requires iterating
all runnable tasks. rq->scx.runnable_list addition is moved from the
later watchdog patch.

- ops.prep_enable() is replaced with ops.init_task() and
ops.enable/disable() are now called whenever the task enters and
leaves sched_ext instead of when the task becomes schedulable on
sched_ext and stops being so. A new operation - ops.exit_task() - is
called when the task stops being schedulable on sched_ext.

- scx_bpf_dispatch() can now be called from ops.select_cpu() too. This
removes the need for communicating local dispatch decision made by
ops.select_cpu() to ops.enqueue() via per-task storage.
SCX_KF_SELECT_CPU is added to support the change.

- SCX_TASK_ENQ_LOCAL which told the BPF scheudler that
scx_select_cpu_dfl() wants the task to be dispatched to the local DSQ
was removed. Instead, scx_bpf_select_cpu_dfl() now dispatches directly
if it finds a suitable idle CPU. If such behavior is not desired,
users can use scx_bpf_select_cpu_dfl() which returns the verdict in a
bool out param.

- scx_select_cpu_dfl() was mishandling WAKE_SYNC and could end up
queueing many tasks on a local DSQ which makes tasks to execute in
order while other CPUs stay idle which made some hackbench numbers
really bad. Fixed.

- The current state of sched_ext can now be monitored through files
under /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is
to enable monitoring on kernels which don't enable debugfs.

- sched_ext wasn't telling BPF that ops.dispatch()'s @prev argument may
be NULL and a BPF scheduler which derefs the pointer without checking
could crash the kernel. Tell BPF. This is currently a bit ugly. A
better way to annotate this is expected in the future.

- scx_exit_info updated to carry pointers to message buffers instead of
embedding them directly. This decouples buffer sizes from API so that
they can be changed without breaking compatibility.

- exit_code added to scx_exit_info. This is used to indicate different
exit conditions on non-error exits and will be used to handle e.g. CPU
hotplugs.

- The patch "sched_ext: Allow BPF schedulers to switch all eligible
tasks into sched_ext" is folded in and the interface is changed so
that partial switching is indicated with a new ops flag
%SCX_OPS_SWITCH_PARTIAL. This makes scx_bpf_switch_all() unnecessasry
and in turn SCX_KF_INIT. ops.init() is now called with
SCX_KF_SLEEPABLE.

- Code reorganized so that only the parts necessary to integrate with
the rest of the kernel are in the header files.

- Changes to reflect the BPF and other kernel changes including the
addition of bpf_sched_ext_ops.cfi_stubs.

v5: - To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t
instead of atomic64_t and scx_dsp_buf_ent.qseq which uses
load_acquire/store_release is now unsigned long instead of u64.

- Fix the bug where bpf_scx_btf_struct_access() was allowing write
access to arbitrary fields.

- Distinguish kfuncs which can be called from any sched_ext ops and from
anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from
sched_ext ops.

- Rename "type" to "kind" in scx_exit_info to make it easier to use on
languages in which "type" is a reserved keyword.

- Since cff9b2332ab7 ("kernel/sched: Modify initial boot task idle
setup"), PF_IDLE is not set on idle tasks which haven't been online
yet which made scx_task_iter_next_filtered() include those idle tasks
in iterations leading to oopses. Update scx_task_iter_next_filtered()
to directly test p->sched_class against idle_sched_class instead of
using is_idle_task() which tests PF_IDLE.

- Other updates to match upstream changes such as adding const to
set_cpumask() param and renaming check_preempt_curr() to
wakeup_preempt().

v4: - SCHED_CHANGE_BLOCK replaced with the previous
sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is
because upstream is adaopting a different generic cleanup mechanism.
Once that lands, the code will be adapted accordingly.

- task_on_scx() used to test whether a task should be switched into SCX,
which is confusing. Renamed to task_should_scx(). task_on_scx() now
tests whether a task is currently on SCX.

- scx_has_idle_cpus is barely used anymore and replaced with direct
check on the idle cpumask.

- SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer
fully idle cores.

- ops.enable() now sees up-to-date p->scx.weight value.

- ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF
schedulers expecting ->select_cpu() call.

- Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest
of the scheduler.

v3: - ops.set_weight() added to allow BPF schedulers to track weight changes
without polling p->scx.weight.

- move_task_to_local_dsq() was losing SCX-specific enq_flags when
enqueueing the task on the target dsq because it goes through
activate_task() which loses the upper 32bit of the flags. Carry the
flags through rq->scx.extra_enq_flags.

- scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running()
and scx_bpf_task_cpu() now use the new KF_RCU instead of
KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them.

- The kfunc helper access control mechanism implemented through
sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always
used when invoking scx_ops operations.

v2: - balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is
called from put_prev_taks_scx() and pick_next_task_scx() as necessary.
To determine whether balance_scx() should be called from
put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the
comment in put_prev_task_scx() for details.

- sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced
with SCHED_CHANGE_BLOCK().

- Unused all_dsqs list removed. This was a left-over from previous
iterations.

- p->scx.kf_mask is added to track and enforce which kfunc helpers are
allowed. Also, init/exit sequences are updated to make some kfuncs
always safe to call regardless of the current BPF scheduler state.
Combined, this should make all the kfuncs safe.

- BPF now supports sleepable struct_ops operations. Hacky workaround
removed and operations and kfunc helpers are tagged appropriately.

- BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask()
and friends are added so that BPF schedulers can use the idle masks
with the generic helpers. This replaces the hacky kfunc helpers added
by a separate patch in V1.

- CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is
enabled. This restriction will be removed by a later patch which adds
core-sched support.

- Add MAINTAINERS entries and other misc changes.

Signed-off-by: Tejun Heo <[email protected]>
Co-authored-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
Cc: Andrea Righi <[email protected]>
---
MAINTAINERS | 13 +
include/asm-generic/vmlinux.lds.h | 1 +
include/linux/sched.h | 5 +
include/linux/sched/ext.h | 141 +-
include/uapi/linux/sched.h | 1 +
init/init_task.c | 11 +
kernel/Kconfig.preempt | 22 +-
kernel/sched/build_policy.c | 7 +
kernel/sched/core.c | 70 +-
kernel/sched/debug.c | 3 +
kernel/sched/ext.c | 4238 +++++++++++++++++++++++++++++
kernel/sched/ext.h | 82 +-
kernel/sched/sched.h | 17 +
13 files changed, 4603 insertions(+), 8 deletions(-)
create mode 100644 kernel/sched/ext.c

diff --git a/MAINTAINERS b/MAINTAINERS
index c9f887fbb477..83f36314f675 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -19635,6 +19635,19 @@ F: include/linux/wait.h
F: include/uapi/linux/sched.h
F: kernel/sched/

+SCHEDULER - SCHED_EXT
+R: Tejun Heo <[email protected]>
+R: David Vernet <[email protected]>
+L: [email protected]
+S: Maintained
+W: https://github.com/sched-ext/scx
+T: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git
+F: include/linux/sched/ext.h
+F: kernel/sched/ext.h
+F: kernel/sched/ext.c
+F: tools/sched_ext/
+F: tools/testing/selftests/sched_ext
+
SCSI LIBSAS SUBSYSTEM
R: John Garry <[email protected]>
R: Jason Yan <[email protected]>
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index f7749d0f2562..05bfe4acba1d 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -131,6 +131,7 @@
*(__dl_sched_class) \
*(__rt_sched_class) \
*(__fair_sched_class) \
+ *(__ext_sched_class) \
*(__idle_sched_class) \
__sched_class_lowest = .;

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3c2abbc587b4..dc07eb0d3290 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -80,6 +80,8 @@ struct task_group;
struct task_struct;
struct user_event_mm;

+#include <linux/sched/ext.h>
+
/*
* Task state bitmask. NOTE! These bits are also
* encoded in fs/proc/array.c: get_task_state().
@@ -798,6 +800,9 @@ struct task_struct {
struct sched_rt_entity rt;
struct sched_dl_entity dl;
struct sched_dl_entity *dl_server;
+#ifdef CONFIG_SCHED_CLASS_EXT
+ struct sched_ext_entity scx;
+#endif
const struct sched_class *sched_class;

#ifdef CONFIG_SCHED_CORE
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index a05dfcf533b0..65b1740a9a07 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -1,9 +1,148 @@
/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
#ifndef _LINUX_SCHED_EXT_H
#define _LINUX_SCHED_EXT_H

#ifdef CONFIG_SCHED_CLASS_EXT
-#error "NOT IMPLEMENTED YET"
+
+#include <linux/rhashtable.h>
+#include <linux/llist.h>
+
+enum scx_public_consts {
+ SCX_OPS_NAME_LEN = 128,
+
+ SCX_SLICE_DFL = 20 * NSEC_PER_MSEC,
+};
+
+/*
+ * DSQ (dispatch queue) IDs are 64bit of the format:
+ *
+ * Bits: [63] [62 .. 0]
+ * [ B] [ ID ]
+ *
+ * B: 1 for IDs for built-in DSQs, 0 for ops-created user DSQs
+ * ID: 63 bit ID
+ *
+ * Built-in IDs:
+ *
+ * Bits: [63] [62] [61..32] [31 .. 0]
+ * [ 1] [ L] [ R ] [ V ]
+ *
+ * 1: 1 for built-in DSQs.
+ * L: 1 for LOCAL_ON DSQ IDs, 0 for others
+ * V: For LOCAL_ON DSQ IDs, a CPU number. For others, a pre-defined value.
+ */
+enum scx_dsq_id_flags {
+ SCX_DSQ_FLAG_BUILTIN = 1LLU << 63,
+ SCX_DSQ_FLAG_LOCAL_ON = 1LLU << 62,
+
+ SCX_DSQ_INVALID = SCX_DSQ_FLAG_BUILTIN | 0,
+ SCX_DSQ_GLOBAL = SCX_DSQ_FLAG_BUILTIN | 1,
+ SCX_DSQ_LOCAL = SCX_DSQ_FLAG_BUILTIN | 2,
+ SCX_DSQ_LOCAL_ON = SCX_DSQ_FLAG_BUILTIN | SCX_DSQ_FLAG_LOCAL_ON,
+ SCX_DSQ_LOCAL_CPU_MASK = 0xffffffffLLU,
+};
+
+/*
+ * Dispatch queue (dsq) is a simple FIFO which is used to buffer between the
+ * scheduler core and the BPF scheduler. See the documentation for more details.
+ */
+struct scx_dispatch_q {
+ raw_spinlock_t lock;
+ struct list_head list; /* tasks in dispatch order */
+ u32 nr;
+ u64 id;
+ struct rhash_head hash_node;
+ struct llist_node free_node;
+ struct rcu_head rcu;
+};
+
+/* scx_entity.flags */
+enum scx_ent_flags {
+ SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */
+ SCX_TASK_BAL_KEEP = 1 << 1, /* balance decided to keep current */
+ SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
+ SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */
+
+ SCX_TASK_STATE_SHIFT = 8, /* bit 8 and 9 are used to carry scx_task_state */
+ SCX_TASK_STATE_BITS = 2,
+ SCX_TASK_STATE_MASK = ((1 << SCX_TASK_STATE_BITS) - 1) << SCX_TASK_STATE_SHIFT,
+
+ SCX_TASK_CURSOR = 1 << 31, /* iteration cursor, not a task */
+};
+
+/* scx_entity.flags & SCX_TASK_STATE_MASK */
+enum scx_task_state {
+ SCX_TASK_NONE, /* ops.init_task() not called yet */
+ SCX_TASK_INIT, /* ops.init_task() succeeded, but task can be cancelled */
+ SCX_TASK_READY, /* fully initialized, but not in sched_ext */
+ SCX_TASK_ENABLED, /* fully initialized and in sched_ext */
+
+ SCX_TASK_NR_STATES,
+};
+
+/*
+ * Mask bits for scx_entity.kf_mask. Not all kfuncs can be called from
+ * everywhere and the following bits track which kfunc sets are currently
+ * allowed for %current. This simple per-task tracking works because SCX ops
+ * nest in a limited way. BPF will likely implement a way to allow and disallow
+ * kfuncs depending on the calling context which will replace this manual
+ * mechanism. See scx_kf_allow().
+ */
+enum scx_kf_mask {
+ SCX_KF_UNLOCKED = 0, /* not sleepable, not rq locked */
+ /* all non-sleepables may be nested inside SLEEPABLE */
+ SCX_KF_SLEEPABLE = 1 << 0, /* sleepable init operations */
+ /* ops.dequeue (in REST) may be nested inside DISPATCH */
+ SCX_KF_DISPATCH = 1 << 2, /* ops.dispatch() */
+ SCX_KF_ENQUEUE = 1 << 3, /* ops.enqueue() and ops.select_cpu() */
+ SCX_KF_SELECT_CPU = 1 << 4, /* ops.select_cpu() */
+ SCX_KF_REST = 1 << 5, /* other rq-locked operations */
+
+ __SCX_KF_RQ_LOCKED = SCX_KF_DISPATCH |
+ SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST,
+};
+
+/*
+ * The following is embedded in task_struct and contains all fields necessary
+ * for a task to be scheduled by SCX.
+ */
+struct sched_ext_entity {
+ struct scx_dispatch_q *dsq;
+ struct list_head dsq_node;
+ u32 flags; /* protected by rq lock */
+ u32 weight;
+ s32 sticky_cpu;
+ s32 holding_cpu;
+ u32 kf_mask; /* see scx_kf_mask above */
+ atomic_long_t ops_state;
+
+ struct list_head runnable_node; /* rq->scx.runnable_list */
+
+ u64 ddsp_dsq_id;
+ u64 ddsp_enq_flags;
+
+ /* BPF scheduler modifiable fields */
+
+ /*
+ * Runtime budget in nsecs. This is usually set through
+ * scx_bpf_dispatch() but can also be modified directly by the BPF
+ * scheduler. Automatically decreased by SCX as the task executes. On
+ * depletion, a scheduling event is triggered.
+ */
+ u64 slice;
+
+ /* cold fields */
+ /* must be the last field, see init_scx_entity() */
+ struct list_head tasks_node;
+};
+
+void sched_ext_free(struct task_struct *p);
+
#else /* !CONFIG_SCHED_CLASS_EXT */

static inline void sched_ext_free(struct task_struct *p) {}
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 3bac0a8ceab2..359a14cc76a4 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -118,6 +118,7 @@ struct clone_args {
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE 5
#define SCHED_DEADLINE 6
+#define SCHED_EXT 7

/* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
#define SCHED_RESET_ON_FORK 0x40000000
diff --git a/init/init_task.c b/init/init_task.c
index 4daee6d761c8..ef349ebe52f8 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -6,6 +6,7 @@
#include <linux/sched/sysctl.h>
#include <linux/sched/rt.h>
#include <linux/sched/task.h>
+#include <linux/sched/ext.h>
#include <linux/init.h>
#include <linux/fs.h>
#include <linux/mm.h>
@@ -97,6 +98,16 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
#endif
#ifdef CONFIG_CGROUP_SCHED
.sched_task_group = &root_task_group,
+#endif
+#ifdef CONFIG_SCHED_CLASS_EXT
+ .scx = {
+ .dsq_node = LIST_HEAD_INIT(init_task.scx.dsq_node),
+ .sticky_cpu = -1,
+ .holding_cpu = -1,
+ .runnable_node = LIST_HEAD_INIT(init_task.scx.runnable_node),
+ .ddsp_dsq_id = SCX_DSQ_INVALID,
+ .slice = SCX_SLICE_DFL,
+ },
#endif
.ptraced = LIST_HEAD_INIT(init_task.ptraced),
.ptrace_entry = LIST_HEAD_INIT(init_task.ptrace_entry),
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index c2f1fd95a821..0afcda19bc50 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -133,4 +133,24 @@ config SCHED_CORE
which is the likely usage by Linux distributions, there should
be no measurable impact on performance.

-
+config SCHED_CLASS_EXT
+ bool "Extensible Scheduling Class"
+ depends on BPF_SYSCALL && BPF_JIT && !SCHED_CORE
+ help
+ This option enables a new scheduler class sched_ext (SCX), which
+ allows scheduling policies to be implemented as BPF programs to
+ achieve the following:
+
+ - Ease of experimentation and exploration: Enabling rapid
+ iteration of new scheduling policies.
+ - Customization: Building application-specific schedulers which
+ implement policies that are not applicable to general-purpose
+ schedulers.
+ - Rapid scheduler deployments: Non-disruptive swap outs of
+ scheduling policies in production environments.
+
+ sched_ext leverages BPF’s struct_ops feature to define a structure
+ which exports function callbacks and flags to BPF programs that
+ wish to implement scheduling policies. The struct_ops structure
+ exported by sched_ext is struct sched_ext_ops, and is conceptually
+ similar to struct sched_class.
diff --git a/kernel/sched/build_policy.c b/kernel/sched/build_policy.c
index d9dc9ab3773f..2a2f10367ceb 100644
--- a/kernel/sched/build_policy.c
+++ b/kernel/sched/build_policy.c
@@ -21,13 +21,17 @@

#include <linux/cpuidle.h>
#include <linux/jiffies.h>
+#include <linux/kobject.h>
#include <linux/livepatch.h>
+#include <linux/pm.h>
#include <linux/psi.h>
+#include <linux/seq_buf.h>
#include <linux/seqlock_api.h>
#include <linux/slab.h>
#include <linux/suspend.h>
#include <linux/tsacct_kern.h>
#include <linux/vtime.h>
+#include <linux/percpu-rwsem.h>

#include <uapi/linux/sched/types.h>

@@ -52,3 +56,6 @@
#include "cputime.c"
#include "deadline.c"

+#ifdef CONFIG_SCHED_CLASS_EXT
+# include "ext.c"
+#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0231905ec827..ef8fa67c58de 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1259,7 +1259,7 @@ bool sched_can_stop_tick(struct rq *rq)
* if there's more than one we need the tick for involuntary
* preemption.
*/
- if (rq->nr_running > 1)
+ if (!scx_switched_all() && rq->nr_running > 1)
return false;

/*
@@ -3997,6 +3997,15 @@ bool cpus_share_resources(int this_cpu, int that_cpu)

static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
{
+ /*
+ * The BPF scheduler may depend on select_task_rq() being invoked during
+ * wakeups. In addition, @p may end up executing on a different CPU
+ * regardless of what happens in the wakeup path making the ttwu_queue
+ * optimization less meaningful. Skip if on SCX.
+ */
+ if (task_on_scx(p))
+ return false;
+
/*
* Do not complicate things with the async wake_list while the CPU is
* in hotplug state.
@@ -4564,6 +4573,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->rt.on_rq = 0;
p->rt.on_list = 0;

+#ifdef CONFIG_SCHED_CLASS_EXT
+ init_scx_entity(&p->scx);
+#endif
+
#ifdef CONFIG_PREEMPT_NOTIFIERS
INIT_HLIST_HEAD(&p->preempt_notifiers);
#endif
@@ -4812,6 +4825,10 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
goto out_cancel;
} else if (rt_prio(p->prio)) {
p->sched_class = &rt_sched_class;
+#ifdef CONFIG_SCHED_CLASS_EXT
+ } else if (task_should_scx(p)) {
+ p->sched_class = &ext_sched_class;
+#endif
} else {
p->sched_class = &fair_sched_class;
}
@@ -5728,8 +5745,10 @@ void scheduler_tick(void)
wq_worker_tick(curr);

#ifdef CONFIG_SMP
- rq->idle_balance = idle_cpu(cpu);
- trigger_load_balance(rq);
+ if (!scx_switched_all()) {
+ rq->idle_balance = idle_cpu(cpu);
+ trigger_load_balance(rq);
+ }
#endif
}

@@ -7119,6 +7138,10 @@ void __setscheduler_prio(struct task_struct *p, int prio)
p->sched_class = &dl_sched_class;
else if (rt_prio(prio))
p->sched_class = &rt_sched_class;
+#ifdef CONFIG_SCHED_CLASS_EXT
+ else if (task_should_scx(p))
+ p->sched_class = &ext_sched_class;
+#endif
else
p->sched_class = &fair_sched_class;

@@ -9120,6 +9143,7 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
case SCHED_NORMAL:
case SCHED_BATCH:
case SCHED_IDLE:
+ case SCHED_EXT:
ret = 0;
break;
}
@@ -9147,6 +9171,7 @@ SYSCALL_DEFINE1(sched_get_priority_min, int, policy)
case SCHED_NORMAL:
case SCHED_BATCH:
case SCHED_IDLE:
+ case SCHED_EXT:
ret = 0;
}
return ret;
@@ -9983,6 +10008,10 @@ void __init sched_init(void)
BUG_ON(!sched_class_above(&dl_sched_class, &rt_sched_class));
BUG_ON(!sched_class_above(&rt_sched_class, &fair_sched_class));
BUG_ON(!sched_class_above(&fair_sched_class, &idle_sched_class));
+#ifdef CONFIG_SCHED_CLASS_EXT
+ BUG_ON(!sched_class_above(&fair_sched_class, &ext_sched_class));
+ BUG_ON(!sched_class_above(&ext_sched_class, &idle_sched_class));
+#endif

wait_bit_init();

@@ -12112,3 +12141,38 @@ void sched_mm_cid_fork(struct task_struct *t)
t->mm_cid_active = 1;
}
#endif
+
+#ifdef CONFIG_SCHED_CLASS_EXT
+void sched_deq_and_put_task(struct task_struct *p, int queue_flags,
+ struct sched_enq_and_set_ctx *ctx)
+{
+ struct rq *rq = task_rq(p);
+
+ lockdep_assert_rq_held(rq);
+
+ *ctx = (struct sched_enq_and_set_ctx){
+ .p = p,
+ .queue_flags = queue_flags,
+ .queued = task_on_rq_queued(p),
+ .running = task_current(rq, p),
+ };
+
+ update_rq_clock(rq);
+ if (ctx->queued)
+ dequeue_task(rq, p, queue_flags | DEQUEUE_NOCLOCK);
+ if (ctx->running)
+ put_prev_task(rq, p);
+}
+
+void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
+{
+ struct rq *rq = task_rq(ctx->p);
+
+ lockdep_assert_rq_held(rq);
+
+ if (ctx->queued)
+ enqueue_task(rq, ctx->p, ctx->queue_flags | ENQUEUE_NOCLOCK);
+ if (ctx->running)
+ set_next_task(rq, ctx->p);
+}
+#endif /* CONFIG_SCHED_CLASS_EXT */
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8d5d98a5834d..6f306e1c9c3e 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1089,6 +1089,9 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
P(dl.runtime);
P(dl.deadline);
}
+#ifdef CONFIG_SCHED_CLASS_EXT
+ __PS("ext.enabled", task_on_scx(p));
+#endif
#undef PN_SCHEDSTAT
#undef P_SCHEDSTAT

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
new file mode 100644
index 000000000000..d4f52209111f
--- /dev/null
+++ b/kernel/sched/ext.c
@@ -0,0 +1,4238 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#define SCX_OP_IDX(op) (offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void)))
+
+enum scx_consts {
+ SCX_DSP_DFL_MAX_BATCH = 32,
+
+ SCX_EXIT_BT_LEN = 64,
+ SCX_EXIT_MSG_LEN = 1024,
+};
+
+enum scx_exit_kind {
+ SCX_EXIT_NONE,
+ SCX_EXIT_DONE,
+
+ SCX_EXIT_UNREG = 64, /* user-space initiated unregistration */
+ SCX_EXIT_UNREG_BPF, /* BPF-initiated unregistration */
+ SCX_EXIT_UNREG_KERN, /* kernel-initiated unregistration */
+
+ SCX_EXIT_ERROR = 1024, /* runtime error, error msg contains details */
+ SCX_EXIT_ERROR_BPF, /* ERROR but triggered through scx_bpf_error() */
+};
+
+/*
+ * scx_exit_info is passed to ops.exit() to describe why the BPF scheduler is
+ * being disabled.
+ */
+struct scx_exit_info {
+ /* %SCX_EXIT_* - broad category of the exit reason */
+ enum scx_exit_kind kind;
+
+ /* exit code if gracefully exiting */
+ s64 exit_code;
+
+ /* textual representation of the above */
+ const char *reason;
+
+ /* backtrace if exiting due to an error */
+ unsigned long *bt;
+ u32 bt_len;
+
+ /* informational message */
+ char *msg;
+};
+
+/* sched_ext_ops.flags */
+enum scx_ops_flags {
+ /*
+ * Keep built-in idle tracking even if ops.update_idle() is implemented.
+ */
+ SCX_OPS_KEEP_BUILTIN_IDLE = 1LLU << 0,
+
+ /*
+ * By default, if there are no other task to run on the CPU, ext core
+ * keeps running the current task even after its slice expires. If this
+ * flag is specified, such tasks are passed to ops.enqueue() with
+ * %SCX_ENQ_LAST. See the comment above %SCX_ENQ_LAST for more info.
+ */
+ SCX_OPS_ENQ_LAST = 1LLU << 1,
+
+ /*
+ * An exiting task may schedule after PF_EXITING is set. In such cases,
+ * bpf_task_from_pid() may not be able to find the task and if the BPF
+ * scheduler depends on pid lookup for dispatching, the task will be
+ * lost leading to various issues including RCU grace period stalls.
+ *
+ * To mask this problem, by default, unhashed tasks are automatically
+ * dispatched to the local DSQ on enqueue. If the BPF scheduler doesn't
+ * depend on pid lookups and wants to handle these tasks directly, the
+ * following flag can be used.
+ */
+ SCX_OPS_ENQ_EXITING = 1LLU << 2,
+
+ /*
+ * If set, only tasks with policy set to SCHED_EXT are attached to
+ * sched_ext. If clear, SCHED_NORMAL tasks are also included.
+ */
+ SCX_OPS_SWITCH_PARTIAL = 1LLU << 3,
+
+ SCX_OPS_ALL_FLAGS = SCX_OPS_KEEP_BUILTIN_IDLE |
+ SCX_OPS_ENQ_LAST |
+ SCX_OPS_ENQ_EXITING |
+ SCX_OPS_SWITCH_PARTIAL,
+};
+
+/* argument container for ops.init_task() */
+struct scx_init_task_args {
+ /*
+ * Set if ops.init_task() is being invoked on the fork path, as opposed
+ * to the scheduler transition path.
+ */
+ bool fork;
+};
+
+/* argument container for ops.exit_task() */
+struct scx_exit_task_args {
+ /* Whether the task exited before running on sched_ext. */
+ bool cancelled;
+};
+
+/**
+ * struct sched_ext_ops - Operation table for BPF scheduler implementation
+ *
+ * Userland can implement an arbitrary scheduling policy by implementing and
+ * loading operations in this table.
+ */
+struct sched_ext_ops {
+ /**
+ * select_cpu - Pick the target CPU for a task which is being woken up
+ * @p: task being woken up
+ * @prev_cpu: the cpu @p was on before sleeping
+ * @wake_flags: SCX_WAKE_*
+ *
+ * Decision made here isn't final. @p may be moved to any CPU while it
+ * is getting dispatched for execution later. However, as @p is not on
+ * the rq at this point, getting the eventual execution CPU right here
+ * saves a small bit of overhead down the line.
+ *
+ * If an idle CPU is returned, the CPU is kicked and will try to
+ * dispatch. While an explicit custom mechanism can be added,
+ * select_cpu() serves as the default way to wake up idle CPUs.
+ *
+ * @p may be dispatched directly by calling scx_bpf_dispatch(). If @p
+ * is dispatched, the ops.enqueue() callback will be skipped. Finally,
+ * if @p is dispatched to SCX_DSQ_LOCAL, it will be dispatched to the
+ * local DSQ of whatever CPU is returned by this callback.
+ */
+ s32 (*select_cpu)(struct task_struct *p, s32 prev_cpu, u64 wake_flags);
+
+ /**
+ * enqueue - Enqueue a task on the BPF scheduler
+ * @p: task being enqueued
+ * @enq_flags: %SCX_ENQ_*
+ *
+ * @p is ready to run. Dispatch directly by calling scx_bpf_dispatch()
+ * or enqueue on the BPF scheduler. If not directly dispatched, the bpf
+ * scheduler owns @p and if it fails to dispatch @p, the task will
+ * stall.
+ *
+ * If @p was dispatched from ops.select_cpu(), this callback is
+ * skipped.
+ */
+ void (*enqueue)(struct task_struct *p, u64 enq_flags);
+
+ /**
+ * dequeue - Remove a task from the BPF scheduler
+ * @p: task being dequeued
+ * @deq_flags: %SCX_DEQ_*
+ *
+ * Remove @p from the BPF scheduler. This is usually called to isolate
+ * the task while updating its scheduling properties (e.g. priority).
+ *
+ * The ext core keeps track of whether the BPF side owns a given task or
+ * not and can gracefully ignore spurious dispatches from BPF side,
+ * which makes it safe to not implement this method. However, depending
+ * on the scheduling logic, this can lead to confusing behaviors - e.g.
+ * scheduling position not being updated across a priority change.
+ */
+ void (*dequeue)(struct task_struct *p, u64 deq_flags);
+
+ /**
+ * dispatch - Dispatch tasks from the BPF scheduler and/or consume DSQs
+ * @cpu: CPU to dispatch tasks for
+ * @prev: previous task being switched out
+ *
+ * Called when a CPU's local dsq is empty. The operation should dispatch
+ * one or more tasks from the BPF scheduler into the DSQs using
+ * scx_bpf_dispatch() and/or consume user DSQs into the local DSQ using
+ * scx_bpf_consume().
+ *
+ * The maximum number of times scx_bpf_dispatch() can be called without
+ * an intervening scx_bpf_consume() is specified by
+ * ops.dispatch_max_batch. See the comments on top of the two functions
+ * for more details.
+ *
+ * When not %NULL, @prev is an SCX task with its slice depleted. If
+ * @prev is still runnable as indicated by set %SCX_TASK_QUEUED in
+ * @prev->scx.flags, it is not enqueued yet and will be enqueued after
+ * ops.dispatch() returns. To keep executing @prev, return without
+ * dispatching or consuming any tasks. Also see %SCX_OPS_ENQ_LAST.
+ */
+ void (*dispatch)(s32 cpu, struct task_struct *prev);
+
+ /**
+ * tick - Periodic tick
+ * @p: task running currently
+ *
+ * This operation is called every 1/HZ seconds on CPUs which are
+ * executing an SCX task. Setting @p->scx.slice to 0 will trigger an
+ * immediate dispatch cycle on the CPU.
+ */
+ void (*tick)(struct task_struct *p);
+
+ /**
+ * yield - Yield CPU
+ * @from: yielding task
+ * @to: optional yield target task
+ *
+ * If @to is NULL, @from is yielding the CPU to other runnable tasks.
+ * The BPF scheduler should ensure that other available tasks are
+ * dispatched before the yielding task. Return value is ignored in this
+ * case.
+ *
+ * If @to is not-NULL, @from wants to yield the CPU to @to. If the bpf
+ * scheduler can implement the request, return %true; otherwise, %false.
+ */
+ bool (*yield)(struct task_struct *from, struct task_struct *to);
+
+ /**
+ * set_weight - Set task weight
+ * @p: task to set weight for
+ * @weight: new eight [1..10000]
+ *
+ * Update @p's weight to @weight.
+ */
+ void (*set_weight)(struct task_struct *p, u32 weight);
+
+ /**
+ * set_cpumask - Set CPU affinity
+ * @p: task to set CPU affinity for
+ * @cpumask: cpumask of cpus that @p can run on
+ *
+ * Update @p's CPU affinity to @cpumask.
+ */
+ void (*set_cpumask)(struct task_struct *p,
+ const struct cpumask *cpumask);
+
+ /**
+ * update_idle - Update the idle state of a CPU
+ * @cpu: CPU to udpate the idle state for
+ * @idle: whether entering or exiting the idle state
+ *
+ * This operation is called when @rq's CPU goes or leaves the idle
+ * state. By default, implementing this operation disables the built-in
+ * idle CPU tracking and the following helpers become unavailable:
+ *
+ * - scx_bpf_select_cpu_dfl()
+ * - scx_bpf_test_and_clear_cpu_idle()
+ * - scx_bpf_pick_idle_cpu()
+ *
+ * The user also must implement ops.select_cpu() as the default
+ * implementation relies on scx_bpf_select_cpu_dfl().
+ *
+ * Specify the %SCX_OPS_KEEP_BUILTIN_IDLE flag to keep the built-in idle
+ * tracking.
+ */
+ void (*update_idle)(s32 cpu, bool idle);
+
+ /**
+ * init_task - Initialize a task to run in a BPF scheduler
+ * @p: task to initialize for BPF scheduling
+ * @args: init arguments, see the struct definition
+ *
+ * Either we're loading a BPF scheduler or a new task is being forked.
+ * Initialize @p for BPF scheduling. This operation may block and can
+ * be used for allocations, and is called exactly once for a task.
+ *
+ * Return 0 for success, -errno for failure. An error return while
+ * loading will abort loading of the BPF scheduler. During a fork, it
+ * will abort that specific fork.
+ */
+ s32 (*init_task)(struct task_struct *p, struct scx_init_task_args *args);
+
+ /**
+ * exit_task - Exit a previously-running task from the system
+ * @p: task to exit
+ *
+ * @p is exiting or the BPF scheduler is being unloaded. Perform any
+ * necessary cleanup for @p.
+ */
+ void (*exit_task)(struct task_struct *p, struct scx_exit_task_args *args);
+
+ /**
+ * enable - Enable BPF scheduling for a task
+ * @p: task to enable BPF scheduling for
+ *
+ * Enable @p for BPF scheduling. enable() is called on @p any time it
+ * enters SCX, and is always paired with a matching disable().
+ */
+ void (*enable)(struct task_struct *p);
+
+ /**
+ * disable - Disable BPF scheduling for a task
+ * @p: task to disable BPF scheduling for
+ *
+ * @p is exiting, leaving SCX or the BPF scheduler is being unloaded.
+ * Disable BPF scheduling for @p. A disable() call is always matched
+ * with a prior enable() call.
+ */
+ void (*disable)(struct task_struct *p);
+
+ /*
+ * All online ops must come before ops.init().
+ */
+
+ /**
+ * init - Initialize the BPF scheduler
+ */
+ s32 (*init)(void);
+
+ /**
+ * exit - Clean up after the BPF scheduler
+ * @info: Exit info
+ */
+ void (*exit)(struct scx_exit_info *info);
+
+ /**
+ * dispatch_max_batch - Max nr of tasks that dispatch() can dispatch
+ */
+ u32 dispatch_max_batch;
+
+ /**
+ * flags - %SCX_OPS_* flags
+ */
+ u64 flags;
+
+ /**
+ * name - BPF scheduler's name
+ *
+ * Must be a non-zero valid BPF object name including only isalnum(),
+ * '_' and '.' chars. Shows up in kernel.sched_ext_ops sysctl while the
+ * BPF scheduler is enabled.
+ */
+ char name[SCX_OPS_NAME_LEN];
+};
+
+enum scx_opi {
+ SCX_OPI_BEGIN = 0,
+ SCX_OPI_NORMAL_BEGIN = 0,
+ SCX_OPI_NORMAL_END = SCX_OP_IDX(init),
+ SCX_OPI_END = SCX_OP_IDX(init),
+};
+
+enum scx_wake_flags {
+ /* expose select WF_* flags as enums */
+ SCX_WAKE_FORK = WF_FORK,
+ SCX_WAKE_TTWU = WF_TTWU,
+ SCX_WAKE_SYNC = WF_SYNC,
+};
+
+enum scx_enq_flags {
+ /* expose select ENQUEUE_* flags as enums */
+ SCX_ENQ_WAKEUP = ENQUEUE_WAKEUP,
+ SCX_ENQ_HEAD = ENQUEUE_HEAD,
+
+ /* high 32bits are SCX specific */
+
+ /*
+ * The task being enqueued is the only task available for the cpu. By
+ * default, ext core keeps executing such tasks but when
+ * %SCX_OPS_ENQ_LAST is specified, they're ops.enqueue()'d with the
+ * %SCX_ENQ_LAST flag set.
+ *
+ * If the BPF scheduler wants to continue executing the task,
+ * ops.enqueue() should dispatch the task to %SCX_DSQ_LOCAL immediately.
+ * If the task gets queued on a different dsq or the BPF side, the BPF
+ * scheduler is responsible for triggering a follow-up scheduling event.
+ * Otherwise, Execution may stall.
+ */
+ SCX_ENQ_LAST = 1LLU << 41,
+
+ /* high 8 bits are internal */
+ __SCX_ENQ_INTERNAL_MASK = 0xffLLU << 56,
+
+ SCX_ENQ_CLEAR_OPSS = 1LLU << 56,
+};
+
+enum scx_deq_flags {
+ /* expose select DEQUEUE_* flags as enums */
+ SCX_DEQ_SLEEP = DEQUEUE_SLEEP,
+};
+
+enum scx_pick_idle_cpu_flags {
+ SCX_PICK_IDLE_CORE = 1LLU << 0, /* pick a CPU whose SMT siblings are also idle */
+};
+
+enum scx_ops_enable_state {
+ SCX_OPS_PREPPING,
+ SCX_OPS_ENABLING,
+ SCX_OPS_ENABLED,
+ SCX_OPS_DISABLING,
+ SCX_OPS_DISABLED,
+};
+
+static const char *scx_ops_enable_state_str[] = {
+ [SCX_OPS_PREPPING] = "prepping",
+ [SCX_OPS_ENABLING] = "enabling",
+ [SCX_OPS_ENABLED] = "enabled",
+ [SCX_OPS_DISABLING] = "disabling",
+ [SCX_OPS_DISABLED] = "disabled",
+};
+
+/*
+ * sched_ext_entity->ops_state
+ *
+ * Used to track the task ownership between the SCX core and the BPF scheduler.
+ * State transitions look as follows:
+ *
+ * NONE -> QUEUEING -> QUEUED -> DISPATCHING
+ * ^ | |
+ * | v v
+ * \-------------------------------/
+ *
+ * QUEUEING and DISPATCHING states can be waited upon. See wait_ops_state() call
+ * sites for explanations on the conditions being waited upon and why they are
+ * safe. Transitions out of them into NONE or QUEUED must store_release and the
+ * waiters should load_acquire.
+ *
+ * Tracking scx_ops_state enables sched_ext core to reliably determine whether
+ * any given task can be dispatched by the BPF scheduler at all times and thus
+ * relaxes the requirements on the BPF scheduler. This allows the BPF scheduler
+ * to try to dispatch any task anytime regardless of its state as the SCX core
+ * can safely reject invalid dispatches.
+ */
+enum scx_ops_state {
+ SCX_OPSS_NONE, /* owned by the SCX core */
+ SCX_OPSS_QUEUEING, /* in transit to the BPF scheduler */
+ SCX_OPSS_QUEUED, /* owned by the BPF scheduler */
+ SCX_OPSS_DISPATCHING, /* in transit back to the SCX core */
+
+ /*
+ * QSEQ brands each QUEUED instance so that, when dispatch races
+ * dequeue/requeue, the dispatcher can tell whether it still has a claim
+ * on the task being dispatched.
+ *
+ * As some 32bit archs can't do 64bit store_release/load_acquire,
+ * p->scx.ops_state is atomic_long_t which leaves 30 bits for QSEQ on
+ * 32bit machines. The dispatch race window QSEQ protects is very narrow
+ * and runs with IRQ disabled. 30 bits should be sufficient.
+ */
+ SCX_OPSS_QSEQ_SHIFT = 2,
+};
+
+/* Use macros to ensure that the type is unsigned long for the masks */
+#define SCX_OPSS_STATE_MASK ((1LU << SCX_OPSS_QSEQ_SHIFT) - 1)
+#define SCX_OPSS_QSEQ_MASK (~SCX_OPSS_STATE_MASK)
+
+/*
+ * During exit, a task may schedule after losing its PIDs. When disabling the
+ * BPF scheduler, we need to be able to iterate tasks in every state to
+ * guarantee system safety. Maintain a dedicated task list which contains every
+ * task between its fork and eventual free.
+ */
+static DEFINE_SPINLOCK(scx_tasks_lock);
+static LIST_HEAD(scx_tasks);
+
+/* ops enable/disable */
+static struct kthread_worker *scx_ops_helper;
+static DEFINE_MUTEX(scx_ops_enable_mutex);
+DEFINE_STATIC_KEY_FALSE(__scx_ops_enabled);
+DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem);
+static atomic_t scx_ops_enable_state_var = ATOMIC_INIT(SCX_OPS_DISABLED);
+static atomic_t scx_ops_bypass_depth = ATOMIC_INIT(0);
+static bool scx_switching_all;
+DEFINE_STATIC_KEY_FALSE(__scx_switched_all);
+
+static struct sched_ext_ops scx_ops;
+static bool scx_warned_zero_slice;
+
+static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_last);
+static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_exiting);
+static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
+
+struct static_key_false scx_has_op[SCX_OPI_END] =
+ { [0 ... SCX_OPI_END-1] = STATIC_KEY_FALSE_INIT };
+
+static atomic_t scx_exit_kind = ATOMIC_INIT(SCX_EXIT_DONE);
+static struct scx_exit_info *scx_exit_info;
+
+/* idle tracking */
+#ifdef CONFIG_SMP
+#ifdef CONFIG_CPUMASK_OFFSTACK
+#define CL_ALIGNED_IF_ONSTACK
+#else
+#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
+#endif
+
+static struct {
+ cpumask_var_t cpu;
+ cpumask_var_t smt;
+} idle_masks CL_ALIGNED_IF_ONSTACK;
+
+#endif /* CONFIG_SMP */
+
+/*
+ * Direct dispatch marker.
+ *
+ * Non-NULL values are used for direct dispatch from enqueue path. A valid
+ * pointer points to the task currently being enqueued. An ERR_PTR value is used
+ * to indicate that direct dispatch has already happened.
+ */
+static DEFINE_PER_CPU(struct task_struct *, direct_dispatch_task);
+
+/* dispatch queues */
+static struct scx_dispatch_q __cacheline_aligned_in_smp scx_dsq_global;
+
+static const struct rhashtable_params dsq_hash_params = {
+ .key_len = 8,
+ .key_offset = offsetof(struct scx_dispatch_q, id),
+ .head_offset = offsetof(struct scx_dispatch_q, hash_node),
+};
+
+static struct rhashtable dsq_hash;
+static LLIST_HEAD(dsqs_to_free);
+
+/* dispatch buf */
+struct scx_dsp_buf_ent {
+ struct task_struct *task;
+ unsigned long qseq;
+ u64 dsq_id;
+ u64 enq_flags;
+};
+
+static u32 scx_dsp_max_batch;
+static struct scx_dsp_buf_ent __percpu *scx_dsp_buf;
+
+struct scx_dsp_ctx {
+ struct rq *rq;
+ struct rq_flags *rf;
+ u32 buf_cursor;
+ u32 nr_tasks;
+};
+
+static DEFINE_PER_CPU(struct scx_dsp_ctx, scx_dsp_ctx);
+
+/* /sys/kernel/sched_ext interface */
+static struct kset *scx_kset;
+static struct kobject *scx_root_kobj;
+
+static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind,
+ s64 exit_code,
+ const char *fmt, ...);
+
+#define scx_ops_error_kind(err, fmt, args...) \
+ scx_ops_exit_kind((err), 0, fmt, ##args)
+
+#define scx_ops_exit(code, fmt, args...) \
+ scx_ops_exit_kind(SCX_EXIT_UNREG_KERN, (code), fmt, ##args)
+
+#define scx_ops_error(fmt, args...) \
+ scx_ops_error_kind(SCX_EXIT_ERROR, fmt, ##args)
+
+#define SCX_HAS_OP(op) static_branch_likely(&scx_has_op[SCX_OP_IDX(op)])
+
+/* if the highest set bit is N, return a mask with bits [N+1, 31] set */
+static u32 higher_bits(u32 flags)
+{
+ return ~((1 << fls(flags)) - 1);
+}
+
+/* return the mask with only the highest bit set */
+static u32 highest_bit(u32 flags)
+{
+ int bit = fls(flags);
+ return bit ? 1 << (bit - 1) : 0;
+}
+
+/*
+ * scx_kf_mask enforcement. Some kfuncs can only be called from specific SCX
+ * ops. When invoking SCX ops, SCX_CALL_OP[_RET]() should be used to indicate
+ * the allowed kfuncs and those kfuncs should use scx_kf_allowed() to check
+ * whether it's running from an allowed context.
+ *
+ * @mask is constant, always inline to cull the mask calculations.
+ */
+static __always_inline void scx_kf_allow(u32 mask)
+{
+ /* nesting is allowed only in increasing scx_kf_mask order */
+ WARN_ONCE((mask | higher_bits(mask)) & current->scx.kf_mask,
+ "invalid nesting current->scx.kf_mask=0x%x mask=0x%x\n",
+ current->scx.kf_mask, mask);
+ current->scx.kf_mask |= mask;
+ barrier();
+}
+
+static void scx_kf_disallow(u32 mask)
+{
+ barrier();
+ current->scx.kf_mask &= ~mask;
+}
+
+#define SCX_CALL_OP(mask, op, args...) \
+do { \
+ if (mask) { \
+ scx_kf_allow(mask); \
+ scx_ops.op(args); \
+ scx_kf_disallow(mask); \
+ } else { \
+ scx_ops.op(args); \
+ } \
+} while (0)
+
+#define SCX_CALL_OP_RET(mask, op, args...) \
+({ \
+ __typeof__(scx_ops.op(args)) __ret; \
+ if (mask) { \
+ scx_kf_allow(mask); \
+ __ret = scx_ops.op(args); \
+ scx_kf_disallow(mask); \
+ } else { \
+ __ret = scx_ops.op(args); \
+ } \
+ __ret; \
+})
+
+/* @mask is constant, always inline to cull unnecessary branches */
+static __always_inline bool scx_kf_allowed(u32 mask)
+{
+ if (unlikely(!(current->scx.kf_mask & mask))) {
+ scx_ops_error("kfunc with mask 0x%x called from an operation only allowing 0x%x",
+ mask, current->scx.kf_mask);
+ return false;
+ }
+
+ if (unlikely((mask & SCX_KF_SLEEPABLE) && in_interrupt())) {
+ scx_ops_error("sleepable kfunc called from non-sleepable context");
+ return false;
+ }
+
+ /*
+ * Enforce nesting boundaries. e.g. A kfunc which can be called from
+ * DISPATCH must not be called if we're running DEQUEUE which is nested
+ * inside ops.dispatch(). We don't need to check the SCX_KF_SLEEPABLE
+ * boundary thanks to the above in_interrupt() check.
+ */
+ if (unlikely(highest_bit(mask) == SCX_KF_DISPATCH &&
+ (current->scx.kf_mask & higher_bits(SCX_KF_DISPATCH)))) {
+ scx_ops_error("dispatch kfunc called from a nested operation");
+ return false;
+ }
+
+ return true;
+}
+
+
+/*
+ * SCX task iterator.
+ */
+struct scx_task_iter {
+ struct sched_ext_entity cursor;
+ struct task_struct *locked;
+ struct rq *rq;
+ struct rq_flags rf;
+};
+
+/**
+ * scx_task_iter_init - Initialize a task iterator
+ * @iter: iterator to init
+ *
+ * Initialize @iter. Must be called with scx_tasks_lock held. Once initialized,
+ * @iter must eventually be exited with scx_task_iter_exit().
+ *
+ * scx_tasks_lock may be released between this and the first next() call or
+ * between any two next() calls. If scx_tasks_lock is released between two
+ * next() calls, the caller is responsible for ensuring that the task being
+ * iterated remains accessible either through RCU read lock or obtaining a
+ * reference count.
+ *
+ * All tasks which existed when the iteration started are guaranteed to be
+ * visited as long as they still exist.
+ */
+static void scx_task_iter_init(struct scx_task_iter *iter)
+{
+ lockdep_assert_held(&scx_tasks_lock);
+
+ iter->cursor = (struct sched_ext_entity){ .flags = SCX_TASK_CURSOR };
+ list_add(&iter->cursor.tasks_node, &scx_tasks);
+ iter->locked = NULL;
+}
+
+/**
+ * scx_task_iter_exit - Exit a task iterator
+ * @iter: iterator to exit
+ *
+ * Exit a previously initialized @iter. Must be called with scx_tasks_lock held.
+ * If the iterator holds a task's rq lock, that rq lock is released. See
+ * scx_task_iter_init() for details.
+ */
+static void scx_task_iter_exit(struct scx_task_iter *iter)
+{
+ struct list_head *cursor = &iter->cursor.tasks_node;
+
+ lockdep_assert_held(&scx_tasks_lock);
+
+ if (iter->locked) {
+ task_rq_unlock(iter->rq, iter->locked, &iter->rf);
+ iter->locked = NULL;
+ }
+
+ if (list_empty(cursor))
+ return;
+
+ list_del_init(cursor);
+}
+
+/**
+ * scx_task_iter_next - Next task
+ * @iter: iterator to walk
+ *
+ * Visit the next task. See scx_task_iter_init() for details.
+ */
+static struct task_struct *scx_task_iter_next(struct scx_task_iter *iter)
+{
+ struct list_head *cursor = &iter->cursor.tasks_node;
+ struct sched_ext_entity *pos;
+
+ lockdep_assert_held(&scx_tasks_lock);
+
+ list_for_each_entry(pos, cursor, tasks_node) {
+ if (&pos->tasks_node == &scx_tasks)
+ return NULL;
+ if (!(pos->flags & SCX_TASK_CURSOR)) {
+ list_move(cursor, &pos->tasks_node);
+ return container_of(pos, struct task_struct, scx);
+ }
+ }
+
+ /* can't happen, should always terminate at scx_tasks above */
+ BUG();
+}
+
+/**
+ * scx_task_iter_next_filtered - Next non-idle task
+ * @iter: iterator to walk
+ *
+ * Visit the next non-idle task. See scx_task_iter_init() for details.
+ */
+static struct task_struct *
+scx_task_iter_next_filtered(struct scx_task_iter *iter)
+{
+ struct task_struct *p;
+
+ while ((p = scx_task_iter_next(iter))) {
+ /*
+ * is_idle_task() tests %PF_IDLE which may not be set for CPUs
+ * which haven't yet been onlined. Test sched_class directly.
+ */
+ if (p->sched_class != &idle_sched_class)
+ return p;
+ }
+ return NULL;
+}
+
+/**
+ * scx_task_iter_next_filtered_locked - Next non-idle task with its rq locked
+ * @iter: iterator to walk
+ *
+ * Visit the next non-idle task with its rq lock held. See scx_task_iter_init()
+ * for details.
+ */
+static struct task_struct *
+scx_task_iter_next_filtered_locked(struct scx_task_iter *iter)
+{
+ struct task_struct *p;
+
+ if (iter->locked) {
+ task_rq_unlock(iter->rq, iter->locked, &iter->rf);
+ iter->locked = NULL;
+ }
+
+ p = scx_task_iter_next_filtered(iter);
+ if (!p)
+ return NULL;
+
+ iter->rq = task_rq_lock(p, &iter->rf);
+ iter->locked = p;
+ return p;
+}
+
+static enum scx_ops_enable_state scx_ops_enable_state(void)
+{
+ return atomic_read(&scx_ops_enable_state_var);
+}
+
+static enum scx_ops_enable_state
+scx_ops_set_enable_state(enum scx_ops_enable_state to)
+{
+ return atomic_xchg(&scx_ops_enable_state_var, to);
+}
+
+static bool scx_ops_tryset_enable_state(enum scx_ops_enable_state to,
+ enum scx_ops_enable_state from)
+{
+ int from_v = from;
+
+ return atomic_try_cmpxchg(&scx_ops_enable_state_var, &from_v, to);
+}
+
+static bool scx_ops_bypassing(void)
+{
+ return unlikely(atomic_read(&scx_ops_bypass_depth));
+}
+
+/**
+ * wait_ops_state - Busy-wait the specified ops state to end
+ * @p: target task
+ * @opss: state to wait the end of
+ *
+ * Busy-wait for @p to transition out of @opss. This can only be used when the
+ * state part of @opss is %SCX_QUEUEING or %SCX_DISPATCHING. This function also
+ * has load_acquire semantics to ensure that the caller can see the updates made
+ * in the enqueueing and dispatching paths.
+ */
+static void wait_ops_state(struct task_struct *p, unsigned long opss)
+{
+ do {
+ cpu_relax();
+ } while (atomic_long_read_acquire(&p->scx.ops_state) == opss);
+}
+
+/**
+ * ops_cpu_valid - Verify a cpu number
+ * @cpu: cpu number which came from a BPF ops
+ * @where: extra information reported on error
+ *
+ * @cpu is a cpu number which came from the BPF scheduler and can be any value.
+ * Verify that it is in range and one of the possible cpus. If invalid, trigger
+ * an ops error.
+ */
+static bool ops_cpu_valid(s32 cpu, const char *where)
+{
+ if (likely(cpu >= 0 && cpu < nr_cpu_ids && cpu_possible(cpu))) {
+ return true;
+ } else {
+ scx_ops_error("invalid CPU %d%s%s", cpu,
+ where ? " " : "", where ?: "");
+ return false;
+ }
+}
+
+/**
+ * ops_sanitize_err - Sanitize a -errno value
+ * @ops_name: operation to blame on failure
+ * @err: -errno value to sanitize
+ *
+ * Verify @err is a valid -errno. If not, trigger scx_ops_error() and return
+ * -%EPROTO. This is necessary because returning a rogue -errno up the chain can
+ * cause misbehaviors. For an example, a large negative return from
+ * ops.init_task() triggers an oops when passed up the call chain because the
+ * value fails IS_ERR() test after being encoded with ERR_PTR() and then is
+ * handled as a pointer.
+ */
+static int ops_sanitize_err(const char *ops_name, s32 err)
+{
+ if (err < 0 && err >= -MAX_ERRNO)
+ return err;
+
+ scx_ops_error("ops.%s() returned an invalid errno %d", ops_name, err);
+ return -EPROTO;
+}
+
+static void update_curr_scx(struct rq *rq)
+{
+ struct task_struct *curr = rq->curr;
+ u64 now = rq_clock_task(rq);
+ u64 delta_exec;
+
+ if (time_before_eq64(now, curr->se.exec_start))
+ return;
+
+ delta_exec = now - curr->se.exec_start;
+ curr->se.exec_start = now;
+ curr->se.sum_exec_runtime += delta_exec;
+ account_group_exec_runtime(curr, delta_exec);
+ cgroup_account_cputime(curr, delta_exec);
+
+ curr->scx.slice -= min(curr->scx.slice, delta_exec);
+}
+
+static void dsq_mod_nr(struct scx_dispatch_q *dsq, s32 delta)
+{
+ /* scx_bpf_dsq_nr_queued() reads ->nr without locking, use WRITE_ONCE() */
+ WRITE_ONCE(dsq->nr, dsq->nr + delta);
+}
+
+static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,
+ u64 enq_flags)
+{
+ bool is_local = dsq->id == SCX_DSQ_LOCAL;
+
+ WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_node));
+
+ if (!is_local) {
+ raw_spin_lock(&dsq->lock);
+ if (unlikely(dsq->id == SCX_DSQ_INVALID)) {
+ scx_ops_error("attempting to dispatch to a destroyed dsq");
+ /* fall back to the global dsq */
+ raw_spin_unlock(&dsq->lock);
+ dsq = &scx_dsq_global;
+ raw_spin_lock(&dsq->lock);
+ }
+ }
+
+ if (enq_flags & SCX_ENQ_HEAD)
+ list_add(&p->scx.dsq_node, &dsq->list);
+ else
+ list_add_tail(&p->scx.dsq_node, &dsq->list);
+
+ dsq_mod_nr(dsq, 1);
+ p->scx.dsq = dsq;
+
+ /*
+ * scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the
+ * direct dispatch path, but we clear them here because the direct
+ * dispatch verdict may be overridden on the enqueue path during e.g.
+ * bypass.
+ */
+ p->scx.ddsp_dsq_id = SCX_DSQ_INVALID;
+ p->scx.ddsp_enq_flags = 0;
+
+ /*
+ * We're transitioning out of QUEUEING or DISPATCHING. store_release to
+ * match waiters' load_acquire.
+ */
+ if (enq_flags & SCX_ENQ_CLEAR_OPSS)
+ atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
+
+ if (is_local) {
+ struct rq *rq = container_of(dsq, struct rq, scx.local_dsq);
+
+ if (sched_class_above(&ext_sched_class, rq->curr->sched_class))
+ resched_curr(rq);
+ } else {
+ raw_spin_unlock(&dsq->lock);
+ }
+}
+
+static void dispatch_dequeue(struct scx_rq *scx_rq, struct task_struct *p)
+{
+ struct scx_dispatch_q *dsq = p->scx.dsq;
+ bool is_local = dsq == &scx_rq->local_dsq;
+
+ if (!dsq) {
+ WARN_ON_ONCE(!list_empty(&p->scx.dsq_node));
+ /*
+ * When dispatching directly from the BPF scheduler to a local
+ * DSQ, the task isn't associated with any DSQ but
+ * @p->scx.holding_cpu may be set under the protection of
+ * %SCX_OPSS_DISPATCHING.
+ */
+ if (p->scx.holding_cpu >= 0)
+ p->scx.holding_cpu = -1;
+ return;
+ }
+
+ if (!is_local)
+ raw_spin_lock(&dsq->lock);
+
+ /*
+ * Now that we hold @dsq->lock, @p->holding_cpu and @p->scx.dsq_node
+ * can't change underneath us.
+ */
+ if (p->scx.holding_cpu < 0) {
+ /* @p must still be on @dsq, dequeue */
+ WARN_ON_ONCE(list_empty(&p->scx.dsq_node));
+ list_del_init(&p->scx.dsq_node);
+ dsq_mod_nr(dsq, -1);
+ } else {
+ /*
+ * We're racing against dispatch_to_local_dsq() which already
+ * removed @p from @dsq and set @p->scx.holding_cpu. Clear the
+ * holding_cpu which tells dispatch_to_local_dsq() that it lost
+ * the race.
+ */
+ WARN_ON_ONCE(!list_empty(&p->scx.dsq_node));
+ p->scx.holding_cpu = -1;
+ }
+ p->scx.dsq = NULL;
+
+ if (!is_local)
+ raw_spin_unlock(&dsq->lock);
+}
+
+static struct scx_dispatch_q *find_user_dsq(u64 dsq_id)
+{
+ return rhashtable_lookup_fast(&dsq_hash, &dsq_id, dsq_hash_params);
+}
+
+static struct scx_dispatch_q *find_non_local_dsq(u64 dsq_id)
+{
+ lockdep_assert(rcu_read_lock_any_held());
+
+ if (dsq_id == SCX_DSQ_GLOBAL)
+ return &scx_dsq_global;
+ else
+ return find_user_dsq(dsq_id);
+}
+
+static struct scx_dispatch_q *find_dsq_for_dispatch(struct rq *rq, u64 dsq_id,
+ struct task_struct *p)
+{
+ struct scx_dispatch_q *dsq;
+
+ if (dsq_id == SCX_DSQ_LOCAL)
+ return &rq->scx.local_dsq;
+
+ dsq = find_non_local_dsq(dsq_id);
+ if (unlikely(!dsq)) {
+ scx_ops_error("non-existent DSQ 0x%llx for %s[%d]",
+ dsq_id, p->comm, p->pid);
+ return &scx_dsq_global;
+ }
+
+ return dsq;
+}
+
+static void mark_direct_dispatch(struct task_struct *ddsp_task,
+ struct task_struct *p, u64 dsq_id,
+ u64 enq_flags)
+{
+ /*
+ * Mark that dispatch already happened from ops.select_cpu() or
+ * ops.enqueue() by spoiling direct_dispatch_task with a non-NULL value
+ * which can never match a valid task pointer.
+ */
+ __this_cpu_write(direct_dispatch_task, ERR_PTR(-ESRCH));
+
+ /* @p must match the task on the enqueue path */
+ if (unlikely(p != ddsp_task)) {
+ if (IS_ERR(ddsp_task))
+ scx_ops_error("%s[%d] already direct-dispatched",
+ p->comm, p->pid);
+ else
+ scx_ops_error("scheduling for %s[%d] but trying to direct-dispatch %s[%d]",
+ ddsp_task->comm, ddsp_task->pid,
+ p->comm, p->pid);
+ return;
+ }
+
+ /*
+ * %SCX_DSQ_LOCAL_ON is not supported during direct dispatch because
+ * dispatching to the local DSQ of a different CPU requires unlocking
+ * the current rq which isn't allowed in the enqueue path. Use
+ * ops.select_cpu() to be on the target CPU and then %SCX_DSQ_LOCAL.
+ */
+ if (unlikely((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON)) {
+ scx_ops_error("SCX_DSQ_LOCAL_ON can't be used for direct-dispatch");
+ return;
+ }
+
+ WARN_ON_ONCE(p->scx.ddsp_dsq_id != SCX_DSQ_INVALID);
+ WARN_ON_ONCE(p->scx.ddsp_enq_flags);
+
+ p->scx.ddsp_dsq_id = dsq_id;
+ p->scx.ddsp_enq_flags = enq_flags;
+}
+
+static void direct_dispatch(struct task_struct *p, u64 enq_flags)
+{
+ struct scx_dispatch_q *dsq;
+
+ enq_flags |= (p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
+ dsq = find_dsq_for_dispatch(task_rq(p), p->scx.ddsp_dsq_id, p);
+ dispatch_enqueue(dsq, p, enq_flags);
+}
+
+static bool test_rq_online(struct rq *rq)
+{
+#ifdef CONFIG_SMP
+ return rq->online;
+#else
+ return true;
+#endif
+}
+
+static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
+ int sticky_cpu)
+{
+ struct task_struct **ddsp_taskp;
+ unsigned long qseq;
+
+ WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_QUEUED));
+
+ /* rq migration */
+ if (sticky_cpu == cpu_of(rq))
+ goto local_norefill;
+
+ /*
+ * If !rq->online, we already told the BPF scheduler that the CPU is
+ * offline. We're just trying to on/offline the CPU. Don't bother the
+ * BPF scheduler.
+ */
+ if (unlikely(!test_rq_online(rq)))
+ goto local;
+
+ if (scx_ops_bypassing()) {
+ if (enq_flags & SCX_ENQ_LAST)
+ goto local;
+ else
+ goto global;
+ }
+
+ if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
+ goto direct;
+
+ /* see %SCX_OPS_ENQ_EXITING */
+ if (!static_branch_unlikely(&scx_ops_enq_exiting) &&
+ unlikely(p->flags & PF_EXITING))
+ goto local;
+
+ /* see %SCX_OPS_ENQ_LAST */
+ if (!static_branch_unlikely(&scx_ops_enq_last) &&
+ (enq_flags & SCX_ENQ_LAST))
+ goto local;
+
+ if (!SCX_HAS_OP(enqueue))
+ goto global;
+
+ /* DSQ bypass didn't trigger, enqueue on the BPF scheduler */
+ qseq = rq->scx.ops_qseq++ << SCX_OPSS_QSEQ_SHIFT;
+
+ WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
+ atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
+
+ ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
+ WARN_ON_ONCE(*ddsp_taskp);
+ *ddsp_taskp = p;
+
+ SCX_CALL_OP(SCX_KF_ENQUEUE, enqueue, p, enq_flags);
+
+ *ddsp_taskp = NULL;
+ if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
+ goto direct;
+
+ /*
+ * If not directly dispatched, QUEUEING isn't clear yet and dispatch or
+ * dequeue may be waiting. The store_release matches their load_acquire.
+ */
+ atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq);
+ return;
+
+direct:
+ direct_dispatch(p, enq_flags);
+ return;
+
+local:
+ p->scx.slice = SCX_SLICE_DFL;
+local_norefill:
+ dispatch_enqueue(&rq->scx.local_dsq, p, enq_flags);
+ return;
+
+global:
+ p->scx.slice = SCX_SLICE_DFL;
+ dispatch_enqueue(&scx_dsq_global, p, enq_flags);
+}
+
+static bool task_runnable(const struct task_struct *p)
+{
+ return !list_empty(&p->scx.runnable_node);
+}
+
+static void set_task_runnable(struct rq *rq, struct task_struct *p)
+{
+ lockdep_assert_rq_held(rq);
+
+ /*
+ * list_add_tail() must be used. scx_ops_bypass() depends on tasks being
+ * appened to the runnable_list.
+ */
+ list_add_tail(&p->scx.runnable_node, &rq->scx.runnable_list);
+}
+
+static void clr_task_runnable(struct task_struct *p)
+{
+ list_del_init(&p->scx.runnable_node);
+}
+
+static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags)
+{
+ int sticky_cpu = p->scx.sticky_cpu;
+
+ enq_flags |= rq->scx.extra_enq_flags;
+
+ if (sticky_cpu >= 0)
+ p->scx.sticky_cpu = -1;
+
+ /*
+ * Restoring a running task will be immediately followed by
+ * set_next_task_scx() which expects the task to not be on the BPF
+ * scheduler as tasks can only start running through local DSQs. Force
+ * direct-dispatch into the local DSQ by setting the sticky_cpu.
+ */
+ if (unlikely(enq_flags & ENQUEUE_RESTORE) && task_current(rq, p))
+ sticky_cpu = cpu_of(rq);
+
+ if (p->scx.flags & SCX_TASK_QUEUED) {
+ WARN_ON_ONCE(!task_runnable(p));
+ return;
+ }
+
+ set_task_runnable(rq, p);
+ p->scx.flags |= SCX_TASK_QUEUED;
+ rq->scx.nr_running++;
+ add_nr_running(rq, 1);
+
+ do_enqueue_task(rq, p, enq_flags, sticky_cpu);
+}
+
+static void ops_dequeue(struct task_struct *p, u64 deq_flags)
+{
+ unsigned long opss;
+
+ clr_task_runnable(p);
+
+ /* acquire ensures that we see the preceding updates on QUEUED */
+ opss = atomic_long_read_acquire(&p->scx.ops_state);
+
+ switch (opss & SCX_OPSS_STATE_MASK) {
+ case SCX_OPSS_NONE:
+ break;
+ case SCX_OPSS_QUEUEING:
+ /*
+ * QUEUEING is started and finished while holding @p's rq lock.
+ * As we're holding the rq lock now, we shouldn't see QUEUEING.
+ */
+ BUG();
+ case SCX_OPSS_QUEUED:
+ if (SCX_HAS_OP(dequeue))
+ SCX_CALL_OP(SCX_KF_REST, dequeue, p, deq_flags);
+
+ if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
+ SCX_OPSS_NONE))
+ break;
+ fallthrough;
+ case SCX_OPSS_DISPATCHING:
+ /*
+ * If @p is being dispatched from the BPF scheduler to a DSQ,
+ * wait for the transfer to complete so that @p doesn't get
+ * added to its DSQ after dequeueing is complete.
+ *
+ * As we're waiting on DISPATCHING with the rq locked, the
+ * dispatching side shouldn't try to lock the rq while
+ * DISPATCHING is set. See dispatch_to_local_dsq().
+ *
+ * DISPATCHING shouldn't have qseq set and control can reach
+ * here with NONE @opss from the above QUEUED case block.
+ * Explicitly wait on %SCX_OPSS_DISPATCHING instead of @opss.
+ */
+ wait_ops_state(p, SCX_OPSS_DISPATCHING);
+ BUG_ON(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
+ break;
+ }
+}
+
+static void dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags)
+{
+ struct scx_rq *scx_rq = &rq->scx;
+
+ if (!(p->scx.flags & SCX_TASK_QUEUED)) {
+ WARN_ON_ONCE(task_runnable(p));
+ return;
+ }
+
+ ops_dequeue(p, deq_flags);
+
+ if (deq_flags & SCX_DEQ_SLEEP)
+ p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP;
+ else
+ p->scx.flags &= ~SCX_TASK_DEQD_FOR_SLEEP;
+
+ p->scx.flags &= ~SCX_TASK_QUEUED;
+ scx_rq->nr_running--;
+ sub_nr_running(rq, 1);
+
+ dispatch_dequeue(scx_rq, p);
+}
+
+static void yield_task_scx(struct rq *rq)
+{
+ struct task_struct *p = rq->curr;
+
+ if (SCX_HAS_OP(yield))
+ SCX_CALL_OP_RET(SCX_KF_REST, yield, p, NULL);
+ else
+ p->scx.slice = 0;
+}
+
+static bool yield_to_task_scx(struct rq *rq, struct task_struct *to)
+{
+ struct task_struct *from = rq->curr;
+
+ if (SCX_HAS_OP(yield))
+ return SCX_CALL_OP_RET(SCX_KF_REST, yield, from, to);
+ else
+ return false;
+}
+
+#ifdef CONFIG_SMP
+/**
+ * move_task_to_local_dsq - Move a task from a different rq to a local DSQ
+ * @rq: rq to move the task into, currently locked
+ * @p: task to move
+ * @enq_flags: %SCX_ENQ_*
+ *
+ * Move @p which is currently on a different rq to @rq's local DSQ. The caller
+ * must:
+ *
+ * 1. Start with exclusive access to @p either through its DSQ lock or
+ * %SCX_OPSS_DISPATCHING flag.
+ *
+ * 2. Set @p->scx.holding_cpu to raw_smp_processor_id().
+ *
+ * 3. Remember task_rq(@p). Release the exclusive access so that we don't
+ * deadlock with dequeue.
+ *
+ * 4. Lock @rq and the task_rq from #3.
+ *
+ * 5. Call this function.
+ *
+ * Returns %true if @p was successfully moved. %false after racing dequeue and
+ * losing.
+ */
+static bool move_task_to_local_dsq(struct rq *rq, struct task_struct *p,
+ u64 enq_flags)
+{
+ struct rq *task_rq;
+
+ lockdep_assert_rq_held(rq);
+
+ /*
+ * If dequeue got to @p while we were trying to lock both rq's, it'd
+ * have cleared @p->scx.holding_cpu to -1. While other cpus may have
+ * updated it to different values afterwards, as this operation can't be
+ * preempted or recurse, @p->scx.holding_cpu can never become
+ * raw_smp_processor_id() again before we're done. Thus, we can tell
+ * whether we lost to dequeue by testing whether @p->scx.holding_cpu is
+ * still raw_smp_processor_id().
+ *
+ * See dispatch_dequeue() for the counterpart.
+ */
+ if (unlikely(p->scx.holding_cpu != raw_smp_processor_id()))
+ return false;
+
+ /* @p->rq couldn't have changed if we're still the holding cpu */
+ task_rq = task_rq(p);
+ lockdep_assert_rq_held(task_rq);
+
+ WARN_ON_ONCE(!cpumask_test_cpu(cpu_of(rq), p->cpus_ptr));
+ deactivate_task(task_rq, p, 0);
+ set_task_cpu(p, cpu_of(rq));
+ p->scx.sticky_cpu = cpu_of(rq);
+
+ /*
+ * We want to pass scx-specific enq_flags but activate_task() will
+ * truncate the upper 32 bit. As we own @rq, we can pass them through
+ * @rq->scx.extra_enq_flags instead.
+ */
+ WARN_ON_ONCE(rq->scx.extra_enq_flags);
+ rq->scx.extra_enq_flags = enq_flags;
+ activate_task(rq, p, 0);
+ rq->scx.extra_enq_flags = 0;
+
+ return true;
+}
+
+/**
+ * dispatch_to_local_dsq_lock - Ensure source and desitnation rq's are locked
+ * @rq: current rq which is locked
+ * @rf: rq_flags to use when unlocking @rq
+ * @src_rq: rq to move task from
+ * @dst_rq: rq to move task to
+ *
+ * We're holding @rq lock and trying to dispatch a task from @src_rq to
+ * @dst_rq's local DSQ and thus need to lock both @src_rq and @dst_rq. Whether
+ * @rq stays locked isn't important as long as the state is restored after
+ * dispatch_to_local_dsq_unlock().
+ */
+static void dispatch_to_local_dsq_lock(struct rq *rq, struct rq_flags *rf,
+ struct rq *src_rq, struct rq *dst_rq)
+{
+ rq_unpin_lock(rq, rf);
+
+ if (src_rq == dst_rq) {
+ raw_spin_rq_unlock(rq);
+ raw_spin_rq_lock(dst_rq);
+ } else if (rq == src_rq) {
+ double_lock_balance(rq, dst_rq);
+ rq_repin_lock(rq, rf);
+ } else if (rq == dst_rq) {
+ double_lock_balance(rq, src_rq);
+ rq_repin_lock(rq, rf);
+ } else {
+ raw_spin_rq_unlock(rq);
+ double_rq_lock(src_rq, dst_rq);
+ }
+}
+
+/**
+ * dispatch_to_local_dsq_unlock - Undo dispatch_to_local_dsq_lock()
+ * @rq: current rq which is locked
+ * @rf: rq_flags to use when unlocking @rq
+ * @src_rq: rq to move task from
+ * @dst_rq: rq to move task to
+ *
+ * Unlock @src_rq and @dst_rq and ensure that @rq is locked on return.
+ */
+static void dispatch_to_local_dsq_unlock(struct rq *rq, struct rq_flags *rf,
+ struct rq *src_rq, struct rq *dst_rq)
+{
+ if (src_rq == dst_rq) {
+ raw_spin_rq_unlock(dst_rq);
+ raw_spin_rq_lock(rq);
+ rq_repin_lock(rq, rf);
+ } else if (rq == src_rq) {
+ double_unlock_balance(rq, dst_rq);
+ } else if (rq == dst_rq) {
+ double_unlock_balance(rq, src_rq);
+ } else {
+ double_rq_unlock(src_rq, dst_rq);
+ raw_spin_rq_lock(rq);
+ rq_repin_lock(rq, rf);
+ }
+}
+#endif /* CONFIG_SMP */
+
+static void consume_local_task(struct rq *rq, struct scx_dispatch_q *dsq,
+ struct task_struct *p)
+{
+ struct scx_rq *scx_rq = &rq->scx;
+
+ lockdep_assert_held(&dsq->lock); /* released on return */
+
+ /* @dsq is locked and @p is on this rq */
+ WARN_ON_ONCE(p->scx.holding_cpu >= 0);
+ list_move_tail(&p->scx.dsq_node, &scx_rq->local_dsq.list);
+ dsq_mod_nr(dsq, -1);
+ dsq_mod_nr(&scx_rq->local_dsq, 1);
+ p->scx.dsq = &scx_rq->local_dsq;
+ raw_spin_unlock(&dsq->lock);
+}
+
+#ifdef CONFIG_SMP
+/*
+ * Similar to kernel/sched/core.c::is_cpu_allowed() but we're testing whether @p
+ * can be pulled to @rq.
+ */
+static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq)
+{
+ int cpu = cpu_of(rq);
+
+ if (!cpumask_test_cpu(cpu, p->cpus_ptr))
+ return false;
+ if (unlikely(is_migration_disabled(p)))
+ return false;
+ if (!(p->flags & PF_KTHREAD) && unlikely(!task_cpu_possible(cpu, p)))
+ return false;
+ if (unlikely(!test_rq_online(rq)))
+ return false;
+ return true;
+}
+
+static bool consume_remote_task(struct rq *rq, struct rq_flags *rf,
+ struct scx_dispatch_q *dsq,
+ struct task_struct *p, struct rq *task_rq)
+{
+ bool moved = false;
+
+ lockdep_assert_held(&dsq->lock); /* released on return */
+
+ /*
+ * @dsq is locked and @p is on a remote rq. @p is currently protected by
+ * @dsq->lock. We want to pull @p to @rq but may deadlock if we grab
+ * @task_rq while holding @dsq and @rq locks. As dequeue can't drop the
+ * rq lock or fail, do a little dancing from our side. See
+ * move_task_to_local_dsq().
+ */
+ WARN_ON_ONCE(p->scx.holding_cpu >= 0);
+ list_del_init(&p->scx.dsq_node);
+ dsq_mod_nr(dsq, -1);
+ p->scx.holding_cpu = raw_smp_processor_id();
+ raw_spin_unlock(&dsq->lock);
+
+ rq_unpin_lock(rq, rf);
+ double_lock_balance(rq, task_rq);
+ rq_repin_lock(rq, rf);
+
+ moved = move_task_to_local_dsq(rq, p, 0);
+
+ double_unlock_balance(rq, task_rq);
+
+ return moved;
+}
+#else /* CONFIG_SMP */
+static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq) { return false; }
+static bool consume_remote_task(struct rq *rq, struct rq_flags *rf,
+ struct scx_dispatch_q *dsq,
+ struct task_struct *p, struct rq *task_rq) { return false; }
+#endif /* CONFIG_SMP */
+
+static bool consume_dispatch_q(struct rq *rq, struct rq_flags *rf,
+ struct scx_dispatch_q *dsq)
+{
+ struct task_struct *p;
+retry:
+ if (list_empty(&dsq->list))
+ return false;
+
+ raw_spin_lock(&dsq->lock);
+
+ list_for_each_entry(p, &dsq->list, scx.dsq_node) {
+ struct rq *task_rq = task_rq(p);
+
+ if (rq == task_rq) {
+ consume_local_task(rq, dsq, p);
+ return true;
+ }
+
+ if (task_can_run_on_remote_rq(p, rq)) {
+ if (likely(consume_remote_task(rq, rf, dsq, p, task_rq)))
+ return true;
+ goto retry;
+ }
+ }
+
+ raw_spin_unlock(&dsq->lock);
+ return false;
+}
+
+enum dispatch_to_local_dsq_ret {
+ DTL_DISPATCHED, /* successfully dispatched */
+ DTL_LOST, /* lost race to dequeue */
+ DTL_NOT_LOCAL, /* destination is not a local DSQ */
+ DTL_INVALID, /* invalid local dsq_id */
+};
+
+/**
+ * dispatch_to_local_dsq - Dispatch a task to a local dsq
+ * @rq: current rq which is locked
+ * @rf: rq_flags to use when unlocking @rq
+ * @dsq_id: destination dsq ID
+ * @p: task to dispatch
+ * @enq_flags: %SCX_ENQ_*
+ *
+ * We're holding @rq lock and want to dispatch @p to the local DSQ identified by
+ * @dsq_id. This function performs all the synchronization dancing needed
+ * because local DSQs are protected with rq locks.
+ *
+ * The caller must have exclusive ownership of @p (e.g. through
+ * %SCX_OPSS_DISPATCHING).
+ */
+static enum dispatch_to_local_dsq_ret
+dispatch_to_local_dsq(struct rq *rq, struct rq_flags *rf, u64 dsq_id,
+ struct task_struct *p, u64 enq_flags)
+{
+ struct rq *src_rq = task_rq(p);
+ struct rq *dst_rq;
+
+ /*
+ * We're synchronized against dequeue through DISPATCHING. As @p can't
+ * be dequeued, its task_rq and cpus_allowed are stable too.
+ */
+ if (dsq_id == SCX_DSQ_LOCAL) {
+ dst_rq = rq;
+ } else if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
+ s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
+
+ if (!ops_cpu_valid(cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict"))
+ return DTL_INVALID;
+ dst_rq = cpu_rq(cpu);
+ } else {
+ return DTL_NOT_LOCAL;
+ }
+
+ /* if dispatching to @rq that @p is already on, no lock dancing needed */
+ if (rq == src_rq && rq == dst_rq) {
+ dispatch_enqueue(&dst_rq->scx.local_dsq, p,
+ enq_flags | SCX_ENQ_CLEAR_OPSS);
+ return DTL_DISPATCHED;
+ }
+
+#ifdef CONFIG_SMP
+ if (cpumask_test_cpu(cpu_of(dst_rq), p->cpus_ptr)) {
+ struct rq *locked_dst_rq = dst_rq;
+ bool dsp;
+
+ /*
+ * @p is on a possibly remote @src_rq which we need to lock to
+ * move the task. If dequeue is in progress, it'd be locking
+ * @src_rq and waiting on DISPATCHING, so we can't grab @src_rq
+ * lock while holding DISPATCHING.
+ *
+ * As DISPATCHING guarantees that @p is wholly ours, we can
+ * pretend that we're moving from a DSQ and use the same
+ * mechanism - mark the task under transfer with holding_cpu,
+ * release DISPATCHING and then follow the same protocol.
+ */
+ p->scx.holding_cpu = raw_smp_processor_id();
+
+ /* store_release ensures that dequeue sees the above */
+ atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
+
+ dispatch_to_local_dsq_lock(rq, rf, src_rq, locked_dst_rq);
+
+ /*
+ * We don't require the BPF scheduler to avoid dispatching to
+ * offline CPUs mostly for convenience but also because CPUs can
+ * go offline between scx_bpf_dispatch() calls and here. If @p
+ * is destined to an offline CPU, queue it on its current CPU
+ * instead, which should always be safe. As this is an allowed
+ * behavior, don't trigger an ops error.
+ */
+ if (unlikely(!test_rq_online(dst_rq)))
+ dst_rq = src_rq;
+
+ if (src_rq == dst_rq) {
+ /*
+ * As @p is staying on the same rq, there's no need to
+ * go through the full deactivate/activate cycle.
+ * Optimize by abbreviating the operations in
+ * move_task_to_local_dsq().
+ */
+ dsp = p->scx.holding_cpu == raw_smp_processor_id();
+ if (likely(dsp)) {
+ p->scx.holding_cpu = -1;
+ dispatch_enqueue(&dst_rq->scx.local_dsq, p,
+ enq_flags);
+ }
+ } else {
+ dsp = move_task_to_local_dsq(dst_rq, p, enq_flags);
+ }
+
+ /* if the destination CPU is idle, wake it up */
+ if (dsp && p->sched_class > dst_rq->curr->sched_class)
+ resched_curr(dst_rq);
+
+ dispatch_to_local_dsq_unlock(rq, rf, src_rq, locked_dst_rq);
+
+ return dsp ? DTL_DISPATCHED : DTL_LOST;
+ }
+#endif /* CONFIG_SMP */
+
+ scx_ops_error("SCX_DSQ_LOCAL[_ON] verdict target cpu %d not allowed for %s[%d]",
+ cpu_of(dst_rq), p->comm, p->pid);
+ return DTL_INVALID;
+}
+
+/**
+ * finish_dispatch - Asynchronously finish dispatching a task
+ * @rq: current rq which is locked
+ * @rf: rq_flags to use when unlocking @rq
+ * @p: task to finish dispatching
+ * @qseq_at_dispatch: qseq when @p started getting dispatched
+ * @dsq_id: destination DSQ ID
+ * @enq_flags: %SCX_ENQ_*
+ *
+ * Dispatching to local DSQs may need to wait for queueing to complete or
+ * require rq lock dancing. As we don't wanna do either while inside
+ * ops.dispatch() to avoid locking order inversion, we split dispatching into
+ * two parts. scx_bpf_dispatch() which is called by ops.dispatch() records the
+ * task and its qseq. Once ops.dispatch() returns, this function is called to
+ * finish up.
+ *
+ * There is no guarantee that @p is still valid for dispatching or even that it
+ * was valid in the first place. Make sure that the task is still owned by the
+ * BPF scheduler and claim the ownership before dispatching.
+ */
+static void finish_dispatch(struct rq *rq, struct rq_flags *rf,
+ struct task_struct *p,
+ unsigned long qseq_at_dispatch,
+ u64 dsq_id, u64 enq_flags)
+{
+ struct scx_dispatch_q *dsq;
+ unsigned long opss;
+
+retry:
+ /*
+ * No need for _acquire here. @p is accessed only after a successful
+ * try_cmpxchg to DISPATCHING.
+ */
+ opss = atomic_long_read(&p->scx.ops_state);
+
+ switch (opss & SCX_OPSS_STATE_MASK) {
+ case SCX_OPSS_DISPATCHING:
+ case SCX_OPSS_NONE:
+ /* someone else already got to it */
+ return;
+ case SCX_OPSS_QUEUED:
+ /*
+ * If qseq doesn't match, @p has gone through at least one
+ * dispatch/dequeue and re-enqueue cycle between
+ * scx_bpf_dispatch() and here and we have no claim on it.
+ */
+ if ((opss & SCX_OPSS_QSEQ_MASK) != qseq_at_dispatch)
+ return;
+
+ /*
+ * While we know @p is accessible, we don't yet have a claim on
+ * it - the BPF scheduler is allowed to dispatch tasks
+ * spuriously and there can be a racing dequeue attempt. Let's
+ * claim @p by atomically transitioning it from QUEUED to
+ * DISPATCHING.
+ */
+ if (likely(atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
+ SCX_OPSS_DISPATCHING)))
+ break;
+ goto retry;
+ case SCX_OPSS_QUEUEING:
+ /*
+ * do_enqueue_task() is in the process of transferring the task
+ * to the BPF scheduler while holding @p's rq lock. As we aren't
+ * holding any kernel or BPF resource that the enqueue path may
+ * depend upon, it's safe to wait.
+ */
+ wait_ops_state(p, opss);
+ goto retry;
+ }
+
+ BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
+
+ switch (dispatch_to_local_dsq(rq, rf, dsq_id, p, enq_flags)) {
+ case DTL_DISPATCHED:
+ break;
+ case DTL_LOST:
+ break;
+ case DTL_INVALID:
+ dsq_id = SCX_DSQ_GLOBAL;
+ fallthrough;
+ case DTL_NOT_LOCAL:
+ dsq = find_dsq_for_dispatch(cpu_rq(raw_smp_processor_id()),
+ dsq_id, p);
+ dispatch_enqueue(dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
+ break;
+ }
+}
+
+static void flush_dispatch_buf(struct rq *rq, struct rq_flags *rf)
+{
+ struct scx_dsp_ctx *dspc = this_cpu_ptr(&scx_dsp_ctx);
+ u32 u;
+
+ for (u = 0; u < dspc->buf_cursor; u++) {
+ struct scx_dsp_buf_ent *ent = &this_cpu_ptr(scx_dsp_buf)[u];
+
+ finish_dispatch(rq, rf, ent->task, ent->qseq, ent->dsq_id,
+ ent->enq_flags);
+ }
+
+ dspc->nr_tasks += dspc->buf_cursor;
+ dspc->buf_cursor = 0;
+}
+
+static int balance_scx(struct rq *rq, struct task_struct *prev,
+ struct rq_flags *rf)
+{
+ struct scx_rq *scx_rq = &rq->scx;
+ struct scx_dsp_ctx *dspc = this_cpu_ptr(&scx_dsp_ctx);
+ bool prev_on_scx = prev->sched_class == &ext_sched_class;
+
+ lockdep_assert_rq_held(rq);
+
+ if (prev_on_scx) {
+ WARN_ON_ONCE(prev->scx.flags & SCX_TASK_BAL_KEEP);
+ update_curr_scx(rq);
+
+ /*
+ * If @prev is runnable & has slice left, it has priority and
+ * fetching more just increases latency for the fetched tasks.
+ * Tell put_prev_task_scx() to put @prev on local_dsq.
+ *
+ * See scx_ops_disable_workfn() for the explanation on the
+ * bypassing test.
+ */
+ if ((prev->scx.flags & SCX_TASK_QUEUED) &&
+ prev->scx.slice && !scx_ops_bypassing()) {
+ prev->scx.flags |= SCX_TASK_BAL_KEEP;
+ return 1;
+ }
+ }
+
+ /* if there already are tasks to run, nothing to do */
+ if (scx_rq->local_dsq.nr)
+ return 1;
+
+ if (consume_dispatch_q(rq, rf, &scx_dsq_global))
+ return 1;
+
+ if (!SCX_HAS_OP(dispatch) || scx_ops_bypassing())
+ return 0;
+
+ dspc->rq = rq;
+ dspc->rf = rf;
+
+ /*
+ * The dispatch loop. Because flush_dispatch_buf() may drop the rq lock,
+ * the local DSQ might still end up empty after a successful
+ * ops.dispatch(). If the local DSQ is empty even after ops.dispatch()
+ * produced some tasks, retry. The BPF scheduler may depend on this
+ * looping behavior to simplify its implementation.
+ */
+ do {
+ dspc->nr_tasks = 0;
+
+ SCX_CALL_OP(SCX_KF_DISPATCH, dispatch, cpu_of(rq),
+ prev_on_scx ? prev : NULL);
+
+ flush_dispatch_buf(rq, rf);
+
+ if (scx_rq->local_dsq.nr)
+ return 1;
+ if (consume_dispatch_q(rq, rf, &scx_dsq_global))
+ return 1;
+ } while (dspc->nr_tasks);
+
+ return 0;
+}
+
+static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
+{
+ if (p->scx.flags & SCX_TASK_QUEUED) {
+ WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
+ dispatch_dequeue(&rq->scx, p);
+ }
+
+ p->se.exec_start = rq_clock_task(rq);
+
+ clr_task_runnable(p);
+}
+
+static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
+{
+#ifndef CONFIG_SMP
+ /*
+ * UP workaround.
+ *
+ * Because SCX may transfer tasks across CPUs during dispatch, dispatch
+ * is performed from its balance operation which isn't called in UP.
+ * Let's work around by calling it from the operations which come right
+ * after.
+ *
+ * 1. If the prev task is on SCX, pick_next_task() calls
+ * .put_prev_task() right after. As .put_prev_task() is also called
+ * from other places, we need to distinguish the calls which can be
+ * done by looking at the previous task's state - if still queued or
+ * dequeued with %SCX_DEQ_SLEEP, the caller must be pick_next_task().
+ * This case is handled here.
+ *
+ * 2. If the prev task is not on SCX, the first following call into SCX
+ * will be .pick_next_task(), which is covered by calling
+ * balance_scx() from pick_next_task_scx().
+ *
+ * Note that we can't merge the first case into the second as
+ * balance_scx() must be called before the previous SCX task goes
+ * through put_prev_task_scx().
+ *
+ * As UP doesn't transfer tasks around, balance_scx() doesn't need @rf.
+ * Pass in %NULL.
+ */
+ if (p->scx.flags & (SCX_TASK_QUEUED | SCX_TASK_DEQD_FOR_SLEEP))
+ balance_scx(rq, p, NULL);
+#endif
+
+ update_curr_scx(rq);
+
+ /*
+ * If we're being called from put_prev_task_balance(), balance_scx() may
+ * have decided that @p should keep running.
+ */
+ if (p->scx.flags & SCX_TASK_BAL_KEEP) {
+ p->scx.flags &= ~SCX_TASK_BAL_KEEP;
+ set_task_runnable(rq, p);
+ dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
+ return;
+ }
+
+ if (p->scx.flags & SCX_TASK_QUEUED) {
+ set_task_runnable(rq, p);
+
+ /*
+ * If @p has slice left and balance_scx() didn't tag it for
+ * keeping, @p is getting preempted by a higher priority
+ * scheduler class. Leave it at the head of the local DSQ.
+ */
+ if (p->scx.slice && !scx_ops_bypassing()) {
+ dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
+ return;
+ }
+
+ /*
+ * If we're in the pick_next_task path, balance_scx() should
+ * have already populated the local DSQ if there are any other
+ * available tasks. If empty, tell ops.enqueue() that @p is the
+ * only one available for this cpu. ops.enqueue() should put it
+ * on the local DSQ so that the subsequent pick_next_task_scx()
+ * can find the task unless it wants to trigger a separate
+ * follow-up scheduling event.
+ */
+ if (list_empty(&rq->scx.local_dsq.list))
+ do_enqueue_task(rq, p, SCX_ENQ_LAST, -1);
+ else
+ do_enqueue_task(rq, p, 0, -1);
+ }
+}
+
+static struct task_struct *first_local_task(struct rq *rq)
+{
+ return list_first_entry_or_null(&rq->scx.local_dsq.list,
+ struct task_struct, scx.dsq_node);
+}
+
+static struct task_struct *pick_next_task_scx(struct rq *rq)
+{
+ struct task_struct *p;
+
+#ifndef CONFIG_SMP
+ /* UP workaround - see the comment at the head of put_prev_task_scx() */
+ if (unlikely(rq->curr->sched_class != &ext_sched_class))
+ balance_scx(rq, rq->curr, NULL);
+#endif
+
+ p = first_local_task(rq);
+ if (!p)
+ return NULL;
+
+ set_next_task_scx(rq, p, true);
+
+ if (unlikely(!p->scx.slice)) {
+ if (!scx_ops_bypassing() && !scx_warned_zero_slice) {
+ printk_deferred(KERN_WARNING "sched_ext: %s[%d] has zero slice in pick_next_task_scx()\n",
+ p->comm, p->pid);
+ scx_warned_zero_slice = true;
+ }
+ p->scx.slice = SCX_SLICE_DFL;
+ }
+
+ return p;
+}
+
+#ifdef CONFIG_SMP
+
+static bool test_and_clear_cpu_idle(int cpu)
+{
+#ifdef CONFIG_SCHED_SMT
+ /*
+ * SMT mask should be cleared whether we can claim @cpu or not. The SMT
+ * cluster is not wholly idle either way. This also prevents
+ * scx_pick_idle_cpu() from getting caught in an infinite loop.
+ */
+ if (sched_smt_active()) {
+ const struct cpumask *smt = cpu_smt_mask(cpu);
+
+ /*
+ * If offline, @cpu is not its own sibling and
+ * scx_pick_idle_cpu() can get caught in an infinite loop as
+ * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
+ * is eventually cleared.
+ */
+ if (cpumask_intersects(smt, idle_masks.smt))
+ cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
+ else if (cpumask_test_cpu(cpu, idle_masks.smt))
+ __cpumask_clear_cpu(cpu, idle_masks.smt);
+ }
+#endif
+ return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
+}
+
+static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
+{
+ int cpu;
+
+retry:
+ if (sched_smt_active()) {
+ cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
+ if (cpu < nr_cpu_ids)
+ goto found;
+
+ if (flags & SCX_PICK_IDLE_CORE)
+ return -EBUSY;
+ }
+
+ cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
+ if (cpu >= nr_cpu_ids)
+ return -EBUSY;
+
+found:
+ if (test_and_clear_cpu_idle(cpu))
+ return cpu;
+ else
+ goto retry;
+}
+
+static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
+ u64 wake_flags, bool *found)
+{
+ s32 cpu;
+
+ *found = false;
+
+ if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+ scx_ops_error("built-in idle tracking is disabled");
+ return prev_cpu;
+ }
+
+ /*
+ * If WAKE_SYNC, the waker's local DSQ is empty, and the system is
+ * under utilized, wake up @p to the local DSQ of the waker. Checking
+ * only for an empty local DSQ is insufficient as it could give the
+ * wakee an unfair advantage when the system is oversaturated.
+ * Checking only for the presence of idle CPUs is also insufficient as
+ * the local DSQ of the waker could have tasks piled up on it even if
+ * there is an idle core elsewhere on the system.
+ */
+ cpu = smp_processor_id();
+ if ((wake_flags & SCX_WAKE_SYNC) && p->nr_cpus_allowed > 1 &&
+ !cpumask_empty(idle_masks.cpu) && !(current->flags & PF_EXITING) &&
+ cpu_rq(cpu)->scx.local_dsq.nr == 0) {
+ if (cpumask_test_cpu(cpu, p->cpus_ptr))
+ goto cpu_found;
+ }
+
+ if (p->nr_cpus_allowed == 1) {
+ if (test_and_clear_cpu_idle(prev_cpu)) {
+ cpu = prev_cpu;
+ goto cpu_found;
+ } else {
+ return prev_cpu;
+ }
+ }
+
+ /*
+ * If CPU has SMT, any wholly idle CPU is likely a better pick than
+ * partially idle @prev_cpu.
+ */
+ if (sched_smt_active()) {
+ if (cpumask_test_cpu(prev_cpu, idle_masks.smt) &&
+ test_and_clear_cpu_idle(prev_cpu)) {
+ cpu = prev_cpu;
+ goto cpu_found;
+ }
+
+ cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE);
+ if (cpu >= 0)
+ goto cpu_found;
+ }
+
+ if (test_and_clear_cpu_idle(prev_cpu)) {
+ cpu = prev_cpu;
+ goto cpu_found;
+ }
+
+ cpu = scx_pick_idle_cpu(p->cpus_ptr, 0);
+ if (cpu >= 0)
+ goto cpu_found;
+
+ return prev_cpu;
+
+cpu_found:
+ *found = true;
+ return cpu;
+}
+
+static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flags)
+{
+ /*
+ * sched_exec() calls with %WF_EXEC when @p is about to exec(2) as it
+ * can be a good migration opportunity with low cache and memory
+ * footprint. Returning a CPU different than @prev_cpu triggers
+ * immediate rq migration. However, for SCX, as the current rq
+ * association doesn't dictate where the task is going to run, this
+ * doesn't fit well. If necessary, we can later add a dedicated method
+ * which can decide to preempt self to force it through the regular
+ * scheduling path.
+ */
+ if (unlikely(wake_flags & WF_EXEC))
+ return prev_cpu;
+
+ if (SCX_HAS_OP(select_cpu)) {
+ s32 cpu;
+ struct task_struct **ddsp_taskp;
+
+ ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
+ WARN_ON_ONCE(*ddsp_taskp);
+ *ddsp_taskp = p;
+
+ cpu = SCX_CALL_OP_RET(SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU,
+ select_cpu, p, prev_cpu, wake_flags);
+ *ddsp_taskp = NULL;
+ if (ops_cpu_valid(cpu, "from ops.select_cpu()"))
+ return cpu;
+ else
+ return prev_cpu;
+ } else {
+ bool found;
+ s32 cpu;
+
+ cpu = scx_select_cpu_dfl(p, prev_cpu, wake_flags, &found);
+ if (found) {
+ p->scx.slice = SCX_SLICE_DFL;
+ p->scx.ddsp_dsq_id = SCX_DSQ_LOCAL;
+ }
+ return cpu;
+ }
+}
+
+static void set_cpus_allowed_scx(struct task_struct *p,
+ struct affinity_context *ac)
+{
+ set_cpus_allowed_common(p, ac);
+
+ /*
+ * The effective cpumask is stored in @p->cpus_ptr which may temporarily
+ * differ from the configured one in @p->cpus_mask. Always tell the bpf
+ * scheduler the effective one.
+ *
+ * Fine-grained memory write control is enforced by BPF making the const
+ * designation pointless. Cast it away when calling the operation.
+ */
+ if (SCX_HAS_OP(set_cpumask))
+ SCX_CALL_OP(SCX_KF_REST, set_cpumask, p,
+ (struct cpumask *)p->cpus_ptr);
+}
+
+static void reset_idle_masks(void)
+{
+ /*
+ * Consider all online cpus idle. Should converge to the actual state
+ * quickly.
+ */
+ cpumask_copy(idle_masks.cpu, cpu_online_mask);
+ cpumask_copy(idle_masks.smt, cpu_online_mask);
+}
+
+void __scx_update_idle(struct rq *rq, bool idle)
+{
+ int cpu = cpu_of(rq);
+
+ if (SCX_HAS_OP(update_idle)) {
+ SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
+ if (!static_branch_unlikely(&scx_builtin_idle_enabled))
+ return;
+ }
+
+ if (idle)
+ cpumask_set_cpu(cpu, idle_masks.cpu);
+ else
+ cpumask_clear_cpu(cpu, idle_masks.cpu);
+
+#ifdef CONFIG_SCHED_SMT
+ if (sched_smt_active()) {
+ const struct cpumask *smt = cpu_smt_mask(cpu);
+
+ if (idle) {
+ /*
+ * idle_masks.smt handling is racy but that's fine as
+ * it's only for optimization and self-correcting.
+ */
+ for_each_cpu(cpu, smt) {
+ if (!cpumask_test_cpu(cpu, idle_masks.cpu))
+ return;
+ }
+ cpumask_or(idle_masks.smt, idle_masks.smt, smt);
+ } else {
+ cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
+ }
+ }
+#endif
+}
+
+#else /* CONFIG_SMP */
+
+static bool test_and_clear_cpu_idle(int cpu) { return false; }
+static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) { return -EBUSY; }
+static void reset_idle_masks(void) {}
+
+#endif /* CONFIG_SMP */
+
+static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued)
+{
+ update_other_load_avgs(rq);
+ update_curr_scx(rq);
+
+ /*
+ * While bypassing, always resched as we can't trust the slice
+ * management.
+ */
+ if (scx_ops_bypassing())
+ curr->scx.slice = 0;
+ else if (SCX_HAS_OP(tick))
+ SCX_CALL_OP(SCX_KF_REST, tick, curr);
+
+ if (!curr->scx.slice)
+ resched_curr(rq);
+}
+
+static enum scx_task_state scx_get_task_state(const struct task_struct *p)
+{
+ return (p->scx.flags & SCX_TASK_STATE_MASK) >> SCX_TASK_STATE_SHIFT;
+}
+
+static void scx_set_task_state(struct task_struct *p, enum scx_task_state state)
+{
+ enum scx_task_state prev_state = scx_get_task_state(p);
+ bool warn = false;
+
+ BUILD_BUG_ON(SCX_TASK_NR_STATES > (1 << SCX_TASK_STATE_BITS));
+
+ switch (state) {
+ case SCX_TASK_NONE:
+ break;
+ case SCX_TASK_INIT:
+ warn = prev_state != SCX_TASK_NONE;
+ break;
+ case SCX_TASK_READY:
+ warn = prev_state == SCX_TASK_NONE;
+ break;
+ case SCX_TASK_ENABLED:
+ warn = prev_state != SCX_TASK_READY;
+ break;
+ default:
+ warn = true;
+ return;
+ }
+
+ WARN_ONCE(warn, "sched_ext: Invalid task state transition %d -> %d for %s[%d]",
+ prev_state, state, p->comm, p->pid);
+
+ p->scx.flags &= ~SCX_TASK_STATE_MASK;
+ p->scx.flags |= state << SCX_TASK_STATE_SHIFT;
+}
+
+static int scx_ops_init_task(struct task_struct *p, struct task_group *tg, bool fork)
+{
+ int ret;
+
+ if (SCX_HAS_OP(init_task)) {
+ struct scx_init_task_args args = {
+ .fork = fork,
+ };
+
+ ret = SCX_CALL_OP_RET(SCX_KF_SLEEPABLE, init_task, p, &args);
+ if (unlikely(ret)) {
+ ret = ops_sanitize_err("init_task", ret);
+ return ret;
+ }
+ }
+
+ scx_set_task_state(p, SCX_TASK_INIT);
+
+ return 0;
+}
+
+static void set_task_scx_weight(struct task_struct *p)
+{
+ u32 weight = sched_prio_to_weight[p->static_prio - MAX_RT_PRIO];
+
+ p->scx.weight = sched_weight_to_cgroup(weight);
+}
+
+static void scx_ops_enable_task(struct task_struct *p)
+{
+ lockdep_assert_rq_held(task_rq(p));
+
+ /*
+ * Set the weight before calling ops.enable() so that the scheduler
+ * doesn't see a stale value if they inspect the task struct.
+ */
+ set_task_scx_weight(p);
+ if (SCX_HAS_OP(enable))
+ SCX_CALL_OP(SCX_KF_REST, enable, p);
+ scx_set_task_state(p, SCX_TASK_ENABLED);
+
+ if (SCX_HAS_OP(set_weight))
+ SCX_CALL_OP(SCX_KF_REST, set_weight, p, p->scx.weight);
+}
+
+static void scx_ops_disable_task(struct task_struct *p)
+{
+ lockdep_assert_rq_held(task_rq(p));
+ WARN_ON_ONCE(scx_get_task_state(p) != SCX_TASK_ENABLED);
+
+ if (SCX_HAS_OP(disable))
+ SCX_CALL_OP(SCX_KF_REST, disable, p);
+ scx_set_task_state(p, SCX_TASK_READY);
+}
+
+static void scx_ops_exit_task(struct task_struct *p)
+{
+ struct scx_exit_task_args args = {
+ .cancelled = false,
+ };
+
+ lockdep_assert_rq_held(task_rq(p));
+
+ switch (scx_get_task_state(p)) {
+ case SCX_TASK_NONE:
+ return;
+ case SCX_TASK_INIT:
+ args.cancelled = true;
+ break;
+ case SCX_TASK_READY:
+ break;
+ case SCX_TASK_ENABLED:
+ scx_ops_disable_task(p);
+ break;
+ default:
+ WARN_ON_ONCE(true);
+ return;
+ }
+
+ if (SCX_HAS_OP(exit_task))
+ SCX_CALL_OP(SCX_KF_REST, exit_task, p, &args);
+ scx_set_task_state(p, SCX_TASK_NONE);
+}
+
+void init_scx_entity(struct sched_ext_entity *scx)
+{
+ /*
+ * init_idle() calls this function again after fork sequence is
+ * complete. Don't touch ->tasks_node as it's already linked.
+ */
+ memset(scx, 0, offsetof(struct sched_ext_entity, tasks_node));
+
+ INIT_LIST_HEAD(&scx->dsq_node);
+ scx->sticky_cpu = -1;
+ scx->holding_cpu = -1;
+ INIT_LIST_HEAD(&scx->runnable_node);
+ scx->ddsp_dsq_id = SCX_DSQ_INVALID;
+ scx->slice = SCX_SLICE_DFL;
+}
+
+void scx_pre_fork(struct task_struct *p)
+{
+ /*
+ * BPF scheduler enable/disable paths want to be able to iterate and
+ * update all tasks which can become complex when racing forks. As
+ * enable/disable are very cold paths, let's use a percpu_rwsem to
+ * exclude forks.
+ */
+ percpu_down_read(&scx_fork_rwsem);
+}
+
+int scx_fork(struct task_struct *p)
+{
+ percpu_rwsem_assert_held(&scx_fork_rwsem);
+
+ if (scx_enabled())
+ return scx_ops_init_task(p, task_group(p), true);
+ else
+ return 0;
+}
+
+void scx_post_fork(struct task_struct *p)
+{
+ if (scx_enabled()) {
+ scx_set_task_state(p, SCX_TASK_READY);
+
+ /*
+ * Enable the task immediately if it's running on sched_ext.
+ * Otherwise, it'll be enabled in switching_to_scx() if and
+ * when it's ever configured to run with a SCHED_EXT policy.
+ */
+ if (p->sched_class == &ext_sched_class) {
+ struct rq_flags rf;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &rf);
+ scx_ops_enable_task(p);
+ task_rq_unlock(rq, p, &rf);
+ }
+ }
+
+ spin_lock_irq(&scx_tasks_lock);
+ list_add_tail(&p->scx.tasks_node, &scx_tasks);
+ spin_unlock_irq(&scx_tasks_lock);
+
+ percpu_up_read(&scx_fork_rwsem);
+}
+
+void scx_cancel_fork(struct task_struct *p)
+{
+ if (scx_enabled()) {
+ struct rq *rq;
+ struct rq_flags rf;
+
+ rq = task_rq_lock(p, &rf);
+ WARN_ON_ONCE(scx_get_task_state(p) >= SCX_TASK_READY);
+ scx_ops_exit_task(p);
+ task_rq_unlock(rq, p, &rf);
+ }
+
+ percpu_up_read(&scx_fork_rwsem);
+}
+
+void sched_ext_free(struct task_struct *p)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&scx_tasks_lock, flags);
+ list_del_init(&p->scx.tasks_node);
+ spin_unlock_irqrestore(&scx_tasks_lock, flags);
+
+ /*
+ * @p is off scx_tasks and wholly ours. scx_ops_enable()'s READY ->
+ * ENABLED transitions can't race us. Disable ops for @p.
+ */
+ if (scx_get_task_state(p) != SCX_TASK_NONE) {
+ struct rq_flags rf;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &rf);
+ scx_ops_exit_task(p);
+ task_rq_unlock(rq, p, &rf);
+ }
+}
+
+static void reweight_task_scx(struct rq *rq, struct task_struct *p, int newprio)
+{
+ lockdep_assert_rq_held(task_rq(p));
+
+ set_task_scx_weight(p);
+ if (SCX_HAS_OP(set_weight))
+ SCX_CALL_OP(SCX_KF_REST, set_weight, p, p->scx.weight);
+}
+
+static void prio_changed_scx(struct rq *rq, struct task_struct *p, int oldprio)
+{
+}
+
+static void switching_to_scx(struct rq *rq, struct task_struct *p)
+{
+ scx_ops_enable_task(p);
+
+ /*
+ * set_cpus_allowed_scx() is not called while @p is associated with a
+ * different scheduler class. Keep the BPF scheduler up-to-date.
+ */
+ if (SCX_HAS_OP(set_cpumask))
+ SCX_CALL_OP(SCX_KF_REST, set_cpumask, p,
+ (struct cpumask *)p->cpus_ptr);
+}
+
+static void switched_from_scx(struct rq *rq, struct task_struct *p)
+{
+ scx_ops_disable_task(p);
+}
+
+static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
+static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
+
+/*
+ * Omitted operations:
+ *
+ * - wakeup_preempt: NOOP as it isn't useful in the wakeup path because the task
+ * isn't tied to the CPU at that point.
+ *
+ * - migrate_task_rq: Unncessary as task to cpu mapping is transient.
+ *
+ * - task_fork/dead: We need fork/dead notifications for all tasks regardless of
+ * their current sched_class. Call them directly from sched core instead.
+ *
+ * - task_woken: Unnecessary.
+ */
+DEFINE_SCHED_CLASS(ext) = {
+ .enqueue_task = enqueue_task_scx,
+ .dequeue_task = dequeue_task_scx,
+ .yield_task = yield_task_scx,
+ .yield_to_task = yield_to_task_scx,
+
+ .wakeup_preempt = wakeup_preempt_scx,
+
+ .pick_next_task = pick_next_task_scx,
+
+ .put_prev_task = put_prev_task_scx,
+ .set_next_task = set_next_task_scx,
+
+#ifdef CONFIG_SMP
+ .balance = balance_scx,
+ .select_task_rq = select_task_rq_scx,
+ .set_cpus_allowed = set_cpus_allowed_scx,
+#endif
+
+ .task_tick = task_tick_scx,
+
+ .switching_to = switching_to_scx,
+ .switched_from = switched_from_scx,
+ .switched_to = switched_to_scx,
+ .reweight_task = reweight_task_scx,
+ .prio_changed = prio_changed_scx,
+
+ .update_curr = update_curr_scx,
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 0,
+#endif
+};
+
+static void init_dsq(struct scx_dispatch_q *dsq, u64 dsq_id)
+{
+ memset(dsq, 0, sizeof(*dsq));
+
+ raw_spin_lock_init(&dsq->lock);
+ INIT_LIST_HEAD(&dsq->list);
+ dsq->id = dsq_id;
+}
+
+static struct scx_dispatch_q *create_dsq(u64 dsq_id, int node)
+{
+ struct scx_dispatch_q *dsq;
+ int ret;
+
+ if (dsq_id & SCX_DSQ_FLAG_BUILTIN)
+ return ERR_PTR(-EINVAL);
+
+ dsq = kmalloc_node(sizeof(*dsq), GFP_KERNEL, node);
+ if (!dsq)
+ return ERR_PTR(-ENOMEM);
+
+ init_dsq(dsq, dsq_id);
+
+ ret = rhashtable_insert_fast(&dsq_hash, &dsq->hash_node,
+ dsq_hash_params);
+ if (ret) {
+ kfree(dsq);
+ return ERR_PTR(ret);
+ }
+ return dsq;
+}
+
+static void free_dsq_irq_workfn(struct irq_work *irq_work)
+{
+ struct llist_node *to_free = llist_del_all(&dsqs_to_free);
+ struct scx_dispatch_q *dsq, *tmp_dsq;
+
+ llist_for_each_entry_safe(dsq, tmp_dsq, to_free, free_node)
+ kfree_rcu(dsq, rcu);
+}
+
+static DEFINE_IRQ_WORK(free_dsq_irq_work, free_dsq_irq_workfn);
+
+static void destroy_dsq(u64 dsq_id)
+{
+ struct scx_dispatch_q *dsq;
+ unsigned long flags;
+
+ rcu_read_lock();
+
+ dsq = find_user_dsq(dsq_id);
+ if (!dsq)
+ goto out_unlock_rcu;
+
+ raw_spin_lock_irqsave(&dsq->lock, flags);
+
+ if (dsq->nr) {
+ scx_ops_error("attempting to destroy in-use dsq 0x%016llx (nr=%u)",
+ dsq->id, dsq->nr);
+ goto out_unlock_dsq;
+ }
+
+ if (rhashtable_remove_fast(&dsq_hash, &dsq->hash_node, dsq_hash_params))
+ goto out_unlock_dsq;
+
+ /*
+ * Mark dead by invalidating ->id to prevent dispatch_enqueue() from
+ * queueing more tasks. As this function can be called from anywhere,
+ * freeing is bounced through an irq work to avoid nesting RCU
+ * operations inside scheduler locks.
+ */
+ dsq->id = SCX_DSQ_INVALID;
+ llist_add(&dsq->free_node, &dsqs_to_free);
+ irq_work_queue(&free_dsq_irq_work);
+
+out_unlock_dsq:
+ raw_spin_unlock_irqrestore(&dsq->lock, flags);
+out_unlock_rcu:
+ rcu_read_unlock();
+}
+
+
+/********************************************************************************
+ * Sysfs interface and ops enable/disable.
+ */
+
+#define SCX_ATTR(_name) \
+ static struct kobj_attribute scx_attr_##_name = { \
+ .attr = { .name = __stringify(_name), .mode = 0444 }, \
+ .show = scx_attr_##_name##_show, \
+ }
+
+static ssize_t scx_attr_state_show(struct kobject *kobj,
+ struct kobj_attribute *ka, char *buf)
+{
+ return sysfs_emit(buf, "%s\n",
+ scx_ops_enable_state_str[scx_ops_enable_state()]);
+}
+SCX_ATTR(state);
+
+static ssize_t scx_attr_switch_all_show(struct kobject *kobj,
+ struct kobj_attribute *ka, char *buf)
+{
+ return sysfs_emit(buf, "%d\n", READ_ONCE(scx_switching_all));
+}
+SCX_ATTR(switch_all);
+
+static struct attribute *scx_global_attrs[] = {
+ &scx_attr_state.attr,
+ &scx_attr_switch_all.attr,
+ NULL,
+};
+
+static const struct attribute_group scx_global_attr_group = {
+ .attrs = scx_global_attrs,
+};
+
+static void scx_kobj_release(struct kobject *kobj)
+{
+ kfree(kobj);
+}
+
+static ssize_t scx_attr_ops_show(struct kobject *kobj,
+ struct kobj_attribute *ka, char *buf)
+{
+ return sysfs_emit(buf, "%s\n", scx_ops.name);
+}
+SCX_ATTR(ops);
+
+static struct attribute *scx_sched_attrs[] = {
+ &scx_attr_ops.attr,
+ NULL,
+};
+ATTRIBUTE_GROUPS(scx_sched);
+
+static const struct kobj_type scx_ktype = {
+ .release = scx_kobj_release,
+ .sysfs_ops = &kobj_sysfs_ops,
+ .default_groups = scx_sched_groups,
+};
+
+static int scx_uevent(const struct kobject *kobj, struct kobj_uevent_env *env)
+{
+ return add_uevent_var(env, "SCXOPS=%s", scx_ops.name);
+}
+
+static const struct kset_uevent_ops scx_uevent_ops = {
+ .uevent = scx_uevent,
+};
+
+/*
+ * Used by sched_fork() and __setscheduler_prio() to pick the matching
+ * sched_class. dl/rt are already handled.
+ */
+bool task_should_scx(struct task_struct *p)
+{
+ if (!scx_enabled() ||
+ unlikely(scx_ops_enable_state() == SCX_OPS_DISABLING))
+ return false;
+ if (READ_ONCE(scx_switching_all))
+ return true;
+ return p->policy == SCHED_EXT;
+}
+
+/**
+ * scx_ops_bypass - [Un]bypass scx_ops and guarantee forward progress
+ *
+ * Bypassing guarantees that all runnable tasks make forward progress without
+ * trusting the BPF scheduler. We can't grab any mutexes or rwsems as they might
+ * be held by tasks that the BPF scheduler is forgetting to run, which
+ * unfortunately also excludes toggling the static branches.
+ *
+ * Let's work around by overriding a couple ops and modifying behaviors based on
+ * the DISABLING state and then cycling the queued tasks through dequeue/enqueue
+ * to force global FIFO scheduling.
+ *
+ * a. ops.enqueue() is ignored and tasks are queued in simple global FIFO order.
+ *
+ * b. ops.dispatch() is ignored.
+ *
+ * c. balance_scx() never sets %SCX_TASK_BAL_KEEP as the slice value can't be
+ * trusted. Whenever a tick triggers, the running task is rotated to the tail
+ * of the queue.
+ *
+ * d. pick_next_task() suppresses zero slice warning.
+ */
+static void scx_ops_bypass(bool bypass)
+{
+ int depth, cpu;
+
+ if (bypass) {
+ depth = atomic_inc_return(&scx_ops_bypass_depth);
+ WARN_ON_ONCE(depth <= 0);
+ if (depth != 1)
+ return;
+ } else {
+ depth = atomic_dec_return(&scx_ops_bypass_depth);
+ WARN_ON_ONCE(depth < 0);
+ if (depth != 0)
+ return;
+ }
+
+ mutex_lock(&scx_ops_enable_mutex);
+ if (!scx_enabled())
+ goto out_unlock;
+
+ /*
+ * No task property is changing. We just need to make sure all currently
+ * queued tasks are re-queued according to the new scx_ops_bypassing()
+ * state. As an optimization, walk each rq's runnable_list instead of
+ * the scx_tasks list.
+ *
+ * This function can't trust the scheduler and thus can't use
+ * cpus_read_lock(). Walk all possible CPUs instead of online.
+ */
+ for_each_possible_cpu(cpu) {
+ struct rq *rq = cpu_rq(cpu);
+ struct rq_flags rf;
+ struct task_struct *p, *n;
+
+ rq_lock_irqsave(rq, &rf);
+
+ /*
+ * The use of list_for_each_entry_safe_reverse() is required
+ * because each task is going to be removed from and added back
+ * to the runnable_list during iteration. Because they're added
+ * to the tail of the list, safe reverse iteration can still
+ * visit all nodes.
+ */
+ list_for_each_entry_safe_reverse(p, n, &rq->scx.runnable_list,
+ scx.runnable_node) {
+ struct sched_enq_and_set_ctx ctx;
+
+ /* cycling deq/enq is enough, see the function comment */
+ sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx);
+ sched_enq_and_set_task(&ctx);
+ }
+
+ rq_unlock_irqrestore(rq, &rf);
+ }
+
+out_unlock:
+ mutex_unlock(&scx_ops_enable_mutex);
+}
+
+static void free_exit_info(struct scx_exit_info *ei)
+{
+ kfree(ei->msg);
+ kfree(ei->bt);
+ kfree(ei);
+}
+
+static struct scx_exit_info *alloc_exit_info(void)
+{
+ struct scx_exit_info *ei;
+
+ ei = kzalloc(sizeof(*ei), GFP_KERNEL);
+ if (!ei)
+ return NULL;
+
+ ei->bt = kcalloc(sizeof(ei->bt[0]), SCX_EXIT_BT_LEN, GFP_KERNEL);
+ ei->msg = kzalloc(SCX_EXIT_MSG_LEN, GFP_KERNEL);
+
+ if (!ei->bt || !ei->msg) {
+ free_exit_info(ei);
+ return NULL;
+ }
+
+ return ei;
+}
+
+static const char *scx_exit_reason(enum scx_exit_kind kind)
+{
+ switch (kind) {
+ case SCX_EXIT_UNREG:
+ return "Scheduler unregistered from user space";
+ case SCX_EXIT_UNREG_BPF:
+ return "Scheduler unregistered from BPF";
+ case SCX_EXIT_UNREG_KERN:
+ return "Scheduler unregistered from the main kernel";
+ case SCX_EXIT_ERROR:
+ return "runtime error";
+ case SCX_EXIT_ERROR_BPF:
+ return "scx_bpf_error";
+ default:
+ return "<UNKNOWN>";
+ }
+}
+
+static void scx_ops_disable_workfn(struct kthread_work *work)
+{
+ struct scx_exit_info *ei = scx_exit_info;
+ struct scx_task_iter sti;
+ struct task_struct *p;
+ struct rhashtable_iter rht_iter;
+ struct scx_dispatch_q *dsq;
+ int i, kind;
+
+ kind = atomic_read(&scx_exit_kind);
+ while (true) {
+ /*
+ * NONE indicates that a new scx_ops has been registered since
+ * disable was scheduled - don't kill the new ops. DONE
+ * indicates that the ops has already been disabled.
+ */
+ if (kind == SCX_EXIT_NONE || kind == SCX_EXIT_DONE)
+ return;
+ if (atomic_try_cmpxchg(&scx_exit_kind, &kind, SCX_EXIT_DONE))
+ break;
+ }
+ ei->kind = kind;
+ ei->reason = scx_exit_reason(ei->kind);
+
+ /* guarantee forward progress by bypassing scx_ops */
+ scx_ops_bypass(true);
+
+ switch (scx_ops_set_enable_state(SCX_OPS_DISABLING)) {
+ case SCX_OPS_DISABLING:
+ WARN_ONCE(true, "sched_ext: duplicate disabling instance?");
+ break;
+ case SCX_OPS_DISABLED:
+ pr_warn("sched_ext: ops error detected without ops (%s)\n",
+ scx_exit_info->msg);
+ WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_DISABLED) !=
+ SCX_OPS_DISABLING);
+ goto done;
+ default:
+ break;
+ }
+
+ /*
+ * Here, every runnable task is guaranteed to make forward progress and
+ * we can safely use blocking synchronization constructs. Actually
+ * disable ops.
+ */
+ mutex_lock(&scx_ops_enable_mutex);
+
+ static_branch_disable(&__scx_switched_all);
+ WRITE_ONCE(scx_switching_all, false);
+
+ /*
+ * Avoid racing against fork. See scx_ops_enable() for explanation on
+ * the locking order.
+ */
+ percpu_down_write(&scx_fork_rwsem);
+ cpus_read_lock();
+
+ spin_lock_irq(&scx_tasks_lock);
+ scx_task_iter_init(&sti);
+ while ((p = scx_task_iter_next_filtered_locked(&sti))) {
+ const struct sched_class *old_class = p->sched_class;
+ struct sched_enq_and_set_ctx ctx;
+
+ sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx);
+
+ p->scx.slice = min_t(u64, p->scx.slice, SCX_SLICE_DFL);
+ __setscheduler_prio(p, p->prio);
+ check_class_changing(task_rq(p), p, old_class);
+
+ sched_enq_and_set_task(&ctx);
+
+ check_class_changed(task_rq(p), p, old_class, p->prio);
+ scx_ops_exit_task(p);
+ }
+ scx_task_iter_exit(&sti);
+ spin_unlock_irq(&scx_tasks_lock);
+
+ /* no task is on scx, turn off all the switches and flush in-progress calls */
+ static_branch_disable_cpuslocked(&__scx_ops_enabled);
+ for (i = SCX_OPI_BEGIN; i < SCX_OPI_END; i++)
+ static_branch_disable_cpuslocked(&scx_has_op[i]);
+ static_branch_disable_cpuslocked(&scx_ops_enq_last);
+ static_branch_disable_cpuslocked(&scx_ops_enq_exiting);
+ static_branch_disable_cpuslocked(&scx_builtin_idle_enabled);
+ synchronize_rcu();
+
+ cpus_read_unlock();
+ percpu_up_write(&scx_fork_rwsem);
+
+ if (ei->kind >= SCX_EXIT_ERROR) {
+ printk(KERN_ERR "sched_ext: BPF scheduler \"%s\" errored, disabling\n", scx_ops.name);
+
+ if (ei->msg[0] == '\0')
+ printk(KERN_ERR "sched_ext: %s\n", ei->reason);
+ else
+ printk(KERN_ERR "sched_ext: %s (%s)\n", ei->reason, ei->msg);
+
+ stack_trace_print(ei->bt, ei->bt_len, 2);
+ }
+
+ if (scx_ops.exit)
+ SCX_CALL_OP(SCX_KF_UNLOCKED, exit, ei);
+
+ /*
+ * Delete the kobject from the hierarchy eagerly in addition to just
+ * dropping a reference. Otherwise, if the object is deleted
+ * asynchronously, sysfs could observe an object of the same name still
+ * in the hierarchy when another scheduler is loaded.
+ */
+ kobject_del(scx_root_kobj);
+ kobject_put(scx_root_kobj);
+ scx_root_kobj = NULL;
+
+ memset(&scx_ops, 0, sizeof(scx_ops));
+
+ rhashtable_walk_enter(&dsq_hash, &rht_iter);
+ do {
+ rhashtable_walk_start(&rht_iter);
+
+ while ((dsq = rhashtable_walk_next(&rht_iter)) && !IS_ERR(dsq))
+ destroy_dsq(dsq->id);
+
+ rhashtable_walk_stop(&rht_iter);
+ } while (dsq == ERR_PTR(-EAGAIN));
+ rhashtable_walk_exit(&rht_iter);
+
+ free_percpu(scx_dsp_buf);
+ scx_dsp_buf = NULL;
+ scx_dsp_max_batch = 0;
+
+ free_exit_info(scx_exit_info);
+ scx_exit_info = NULL;
+
+ mutex_unlock(&scx_ops_enable_mutex);
+
+ WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_DISABLED) !=
+ SCX_OPS_DISABLING);
+done:
+ scx_ops_bypass(false);
+}
+
+static DEFINE_KTHREAD_WORK(scx_ops_disable_work, scx_ops_disable_workfn);
+
+static void schedule_scx_ops_disable_work(void)
+{
+ struct kthread_worker *helper = READ_ONCE(scx_ops_helper);
+
+ /*
+ * We may be called spuriously before the first bpf_sched_ext_reg(). If
+ * scx_ops_helper isn't set up yet, there's nothing to do.
+ */
+ if (helper)
+ kthread_queue_work(helper, &scx_ops_disable_work);
+}
+
+static void scx_ops_disable(enum scx_exit_kind kind)
+{
+ int none = SCX_EXIT_NONE;
+
+ if (WARN_ON_ONCE(kind == SCX_EXIT_NONE || kind == SCX_EXIT_DONE))
+ kind = SCX_EXIT_ERROR;
+
+ atomic_try_cmpxchg(&scx_exit_kind, &none, kind);
+
+ schedule_scx_ops_disable_work();
+}
+
+static void scx_ops_error_irq_workfn(struct irq_work *irq_work)
+{
+ schedule_scx_ops_disable_work();
+}
+
+static DEFINE_IRQ_WORK(scx_ops_error_irq_work, scx_ops_error_irq_workfn);
+
+static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind,
+ s64 exit_code,
+ const char *fmt, ...)
+{
+ struct scx_exit_info *ei = scx_exit_info;
+ int none = SCX_EXIT_NONE;
+ va_list args;
+
+ if (!atomic_try_cmpxchg(&scx_exit_kind, &none, kind))
+ return;
+
+ ei->exit_code = exit_code;
+
+ if (kind >= SCX_EXIT_ERROR)
+ ei->bt_len = stack_trace_save(ei->bt, SCX_EXIT_BT_LEN, 1);
+
+ va_start(args, fmt);
+ vscnprintf(ei->msg, SCX_EXIT_MSG_LEN, fmt, args);
+ va_end(args);
+
+ irq_work_queue(&scx_ops_error_irq_work);
+}
+
+static struct kthread_worker *scx_create_rt_helper(const char *name)
+{
+ struct kthread_worker *helper;
+
+ helper = kthread_create_worker(0, name);
+ if (helper)
+ sched_set_fifo(helper->task);
+ return helper;
+}
+
+static int validate_ops(const struct sched_ext_ops *ops)
+{
+ /*
+ * It doesn't make sense to specify the SCX_OPS_ENQ_LAST flag if the
+ * ops.enqueue() callback isn't implemented.
+ */
+ if ((ops->flags & SCX_OPS_ENQ_LAST) && !ops->enqueue) {
+ scx_ops_error("SCX_OPS_ENQ_LAST requires ops.enqueue() to be implemented");
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static int scx_ops_enable(struct sched_ext_ops *ops)
+{
+ struct scx_task_iter sti;
+ struct task_struct *p;
+ int i, ret;
+
+ mutex_lock(&scx_ops_enable_mutex);
+
+ if (!scx_ops_helper) {
+ WRITE_ONCE(scx_ops_helper,
+ scx_create_rt_helper("sched_ext_ops_helper"));
+ if (!scx_ops_helper) {
+ ret = -ENOMEM;
+ goto err_unlock;
+ }
+ }
+
+ if (scx_ops_enable_state() != SCX_OPS_DISABLED) {
+ ret = -EBUSY;
+ goto err_unlock;
+ }
+
+ scx_root_kobj = kzalloc(sizeof(*scx_root_kobj), GFP_KERNEL);
+ if (!scx_root_kobj) {
+ ret = -ENOMEM;
+ goto err_unlock;
+ }
+
+ scx_root_kobj->kset = scx_kset;
+ ret = kobject_init_and_add(scx_root_kobj, &scx_ktype, NULL, "root");
+ if (ret < 0)
+ goto err;
+
+ scx_exit_info = alloc_exit_info();
+ if (!scx_exit_info) {
+ ret = -ENOMEM;
+ goto err_del;
+ }
+
+ /*
+ * Set scx_ops, transition to PREPPING and clear exit info to arm the
+ * disable path. Failure triggers full disabling from here on.
+ */
+ scx_ops = *ops;
+
+ WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_PREPPING) !=
+ SCX_OPS_DISABLED);
+
+ atomic_set(&scx_exit_kind, SCX_EXIT_NONE);
+ scx_warned_zero_slice = false;
+
+ /*
+ * Keep CPUs stable during enable so that the BPF scheduler can track
+ * online CPUs by watching ->on/offline_cpu() after ->init().
+ */
+ cpus_read_lock();
+
+ if (scx_ops.init) {
+ ret = SCX_CALL_OP_RET(SCX_KF_SLEEPABLE, init);
+ if (ret) {
+ ret = ops_sanitize_err("init", ret);
+ goto err_disable_unlock_cpus;
+ }
+ }
+
+ cpus_read_unlock();
+
+ ret = validate_ops(ops);
+ if (ret)
+ goto err_disable;
+
+ WARN_ON_ONCE(scx_dsp_buf);
+ scx_dsp_max_batch = ops->dispatch_max_batch ?: SCX_DSP_DFL_MAX_BATCH;
+ scx_dsp_buf = __alloc_percpu(sizeof(scx_dsp_buf[0]) * scx_dsp_max_batch,
+ __alignof__(scx_dsp_buf[0]));
+ if (!scx_dsp_buf) {
+ ret = -ENOMEM;
+ goto err_disable;
+ }
+
+ /*
+ * Lock out forks before opening the floodgate so that they don't wander
+ * into the operations prematurely.
+ *
+ * We don't need to keep the CPUs stable but grab cpus_read_lock() to
+ * ease future locking changes for cgroup suport.
+ *
+ * Note that cpu_hotplug_lock must nest inside scx_fork_rwsem due to the
+ * following dependency chain:
+ *
+ * scx_fork_rwsem --> pernet_ops_rwsem --> cpu_hotplug_lock
+ */
+ percpu_down_write(&scx_fork_rwsem);
+ cpus_read_lock();
+
+ for (i = SCX_OPI_NORMAL_BEGIN; i < SCX_OPI_NORMAL_END; i++)
+ if (((void (**)(void))ops)[i])
+ static_branch_enable_cpuslocked(&scx_has_op[i]);
+
+ if (ops->flags & SCX_OPS_ENQ_LAST)
+ static_branch_enable_cpuslocked(&scx_ops_enq_last);
+
+ if (ops->flags & SCX_OPS_ENQ_EXITING)
+ static_branch_enable_cpuslocked(&scx_ops_enq_exiting);
+
+ if (!ops->update_idle || (ops->flags & SCX_OPS_KEEP_BUILTIN_IDLE)) {
+ reset_idle_masks();
+ static_branch_enable_cpuslocked(&scx_builtin_idle_enabled);
+ } else {
+ static_branch_disable_cpuslocked(&scx_builtin_idle_enabled);
+ }
+
+ static_branch_enable_cpuslocked(&__scx_ops_enabled);
+
+ /*
+ * Enable ops for every task. Fork is excluded by scx_fork_rwsem
+ * preventing new tasks from being added. No need to exclude tasks
+ * leaving as sched_ext_free() can handle both prepped and enabled
+ * tasks. Prep all tasks first and then enable them with preemption
+ * disabled.
+ */
+ spin_lock_irq(&scx_tasks_lock);
+
+ scx_task_iter_init(&sti);
+ while ((p = scx_task_iter_next_filtered(&sti))) {
+ get_task_struct(p);
+ spin_unlock_irq(&scx_tasks_lock);
+
+ ret = scx_ops_init_task(p, task_group(p), false);
+ if (ret) {
+ put_task_struct(p);
+ spin_lock_irq(&scx_tasks_lock);
+ scx_task_iter_exit(&sti);
+ spin_unlock_irq(&scx_tasks_lock);
+ pr_err("sched_ext: ops.init_task() failed (%d) for %s[%d] while loading\n",
+ ret, p->comm, p->pid);
+ goto err_disable_unlock_all;
+ }
+
+ put_task_struct(p);
+ spin_lock_irq(&scx_tasks_lock);
+ }
+ scx_task_iter_exit(&sti);
+
+ /*
+ * All tasks are prepped but are still ops-disabled. Ensure that
+ * %current can't be scheduled out and switch everyone.
+ * preempt_disable() is necessary because we can't guarantee that
+ * %current won't be starved if scheduled out while switching.
+ */
+ preempt_disable();
+
+ /*
+ * From here on, the disable path must assume that tasks have ops
+ * enabled and need to be recovered.
+ *
+ * Transition to ENABLING fails iff the BPF scheduler has already
+ * triggered scx_bpf_error(). Returning an error code here would lose
+ * the recorded error information. Exit indicating success so that the
+ * error is notified through ops.exit() with all the details.
+ */
+ if (!scx_ops_tryset_enable_state(SCX_OPS_ENABLING, SCX_OPS_PREPPING)) {
+ preempt_enable();
+ spin_unlock_irq(&scx_tasks_lock);
+ WARN_ON_ONCE(atomic_read(&scx_exit_kind) == SCX_EXIT_NONE);
+ ret = 0;
+ goto err_disable_unlock_all;
+ }
+
+ /*
+ * We're fully committed and can't fail. The PREPPED -> ENABLED
+ * transitions here are synchronized against sched_ext_free() through
+ * scx_tasks_lock.
+ */
+ WRITE_ONCE(scx_switching_all, !(ops->flags & SCX_OPS_SWITCH_PARTIAL));
+
+ scx_task_iter_init(&sti);
+ while ((p = scx_task_iter_next_filtered_locked(&sti))) {
+ const struct sched_class *old_class = p->sched_class;
+ struct sched_enq_and_set_ctx ctx;
+
+ sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx);
+
+ scx_set_task_state(p, SCX_TASK_READY);
+ __setscheduler_prio(p, p->prio);
+ check_class_changing(task_rq(p), p, old_class);
+
+ sched_enq_and_set_task(&ctx);
+
+ check_class_changed(task_rq(p), p, old_class, p->prio);
+ }
+ scx_task_iter_exit(&sti);
+
+ spin_unlock_irq(&scx_tasks_lock);
+ preempt_enable();
+ cpus_read_unlock();
+ percpu_up_write(&scx_fork_rwsem);
+
+ /* see above ENABLING transition for the explanation on exiting with 0 */
+ if (!scx_ops_tryset_enable_state(SCX_OPS_ENABLED, SCX_OPS_ENABLING)) {
+ WARN_ON_ONCE(atomic_read(&scx_exit_kind) == SCX_EXIT_NONE);
+ ret = 0;
+ goto err_disable;
+ }
+
+ if (!(ops->flags & SCX_OPS_SWITCH_PARTIAL))
+ static_branch_enable(&__scx_switched_all);
+
+ kobject_uevent(scx_root_kobj, KOBJ_ADD);
+ mutex_unlock(&scx_ops_enable_mutex);
+
+ return 0;
+
+err_del:
+ kobject_del(scx_root_kobj);
+err:
+ kobject_put(scx_root_kobj);
+ scx_root_kobj = NULL;
+ if (scx_exit_info) {
+ free_exit_info(scx_exit_info);
+ scx_exit_info = NULL;
+ }
+err_unlock:
+ mutex_unlock(&scx_ops_enable_mutex);
+ return ret;
+
+err_disable_unlock_all:
+ percpu_up_write(&scx_fork_rwsem);
+err_disable_unlock_cpus:
+ cpus_read_unlock();
+err_disable:
+ mutex_unlock(&scx_ops_enable_mutex);
+ /* must be fully disabled before returning */
+ scx_ops_disable(SCX_EXIT_ERROR);
+ kthread_flush_work(&scx_ops_disable_work);
+ return ret;
+}
+
+
+/********************************************************************************
+ * bpf_struct_ops plumbing.
+ */
+#include <linux/bpf_verifier.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
+
+extern struct btf *btf_vmlinux;
+static const struct btf_type *task_struct_type;
+static u32 task_struct_type_id;
+
+/* Make the 2nd argument of .dispatch a pointer that can be NULL. */
+static bool promote_dispatch_2nd_arg(int off, int size,
+ enum bpf_access_type type,
+ const struct bpf_prog *prog,
+ struct bpf_insn_access_aux *info)
+{
+ struct btf *btf = bpf_get_btf_vmlinux();
+ const struct bpf_struct_ops_desc *st_ops_desc;
+ const struct btf_member *member;
+ const struct btf_type *t;
+ u32 btf_id, member_idx;
+ const char *mname;
+
+ /* btf_id should be the type id of struct sched_ext_ops */
+ btf_id = prog->aux->attach_btf_id;
+ st_ops_desc = bpf_struct_ops_find(btf, btf_id);
+ if (!st_ops_desc)
+ return false;
+
+ /* BTF type of struct sched_ext_ops */
+ t = st_ops_desc->type;
+
+ member_idx = prog->expected_attach_type;
+ if (member_idx >= btf_type_vlen(t))
+ return false;
+
+ /*
+ * Get the member name of this struct_ops program, which corresponds to
+ * a field in struct sched_ext_ops. For example, the member name of the
+ * dispatch struct_ops program (callback) is "dispatch".
+ */
+ member = &btf_type_member(t)[member_idx];
+ mname = btf_name_by_offset(btf_vmlinux, member->name_off);
+
+ /*
+ * Check if it is the second argument of the function pointer at
+ * "dispatch" in struct sched_ext_ops. The arguments of struct_ops
+ * operators are sequential and 64-bit, so the second argument is at
+ * offset sizeof(__u64).
+ */
+ if (strcmp(mname, "dispatch") == 0 &&
+ off == sizeof(__u64)) {
+ /*
+ * The value is a pointer to a type (struct task_struct) given
+ * by a BTF ID (PTR_TO_BTF_ID). It is trusted (PTR_TRUSTED),
+ * however, can be a NULL (PTR_MAYBE_NULL). The BPF program
+ * should check the pointer to make sure it is not NULL before
+ * using it, or the verifier will reject the program.
+ *
+ * Longer term, this is something that should be addressed by
+ * BTF, and be fully contained within the verifier.
+ */
+ info->reg_type = PTR_MAYBE_NULL | PTR_TO_BTF_ID | PTR_TRUSTED;
+ info->btf = btf_vmlinux;
+ info->btf_id = task_struct_type_id;
+
+ return true;
+ }
+
+ return false;
+}
+
+static bool bpf_scx_is_valid_access(int off, int size,
+ enum bpf_access_type type,
+ const struct bpf_prog *prog,
+ struct bpf_insn_access_aux *info)
+{
+ if (type != BPF_READ)
+ return false;
+ if (promote_dispatch_2nd_arg(off, size, type, prog, info))
+ return true;
+ if (off < 0 || off >= sizeof(__u64) * MAX_BPF_FUNC_ARGS)
+ return false;
+ if (off % size != 0)
+ return false;
+
+ return btf_ctx_access(off, size, type, prog, info);
+}
+
+static int bpf_scx_btf_struct_access(struct bpf_verifier_log *log,
+ const struct bpf_reg_state *reg, int off,
+ int size)
+{
+ const struct btf_type *t;
+
+ t = btf_type_by_id(reg->btf, reg->btf_id);
+ if (t == task_struct_type) {
+ if (off >= offsetof(struct task_struct, scx.slice) &&
+ off + size <= offsetofend(struct task_struct, scx.slice))
+ return SCALAR_VALUE;
+ }
+
+ return -EACCES;
+}
+
+static const struct bpf_func_proto *
+bpf_scx_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+ switch (func_id) {
+ case BPF_FUNC_task_storage_get:
+ return &bpf_task_storage_get_proto;
+ case BPF_FUNC_task_storage_delete:
+ return &bpf_task_storage_delete_proto;
+ default:
+ return bpf_base_func_proto(func_id, prog);
+ }
+}
+
+static const struct bpf_verifier_ops bpf_scx_verifier_ops = {
+ .get_func_proto = bpf_scx_get_func_proto,
+ .is_valid_access = bpf_scx_is_valid_access,
+ .btf_struct_access = bpf_scx_btf_struct_access,
+};
+
+static int bpf_scx_init_member(const struct btf_type *t,
+ const struct btf_member *member,
+ void *kdata, const void *udata)
+{
+ const struct sched_ext_ops *uops = udata;
+ struct sched_ext_ops *ops = kdata;
+ u32 moff = __btf_member_bit_offset(t, member) / 8;
+ int ret;
+
+ switch (moff) {
+ case offsetof(struct sched_ext_ops, dispatch_max_batch):
+ if (*(u32 *)(udata + moff) > INT_MAX)
+ return -E2BIG;
+ ops->dispatch_max_batch = *(u32 *)(udata + moff);
+ return 1;
+ case offsetof(struct sched_ext_ops, flags):
+ if (*(u64 *)(udata + moff) & ~SCX_OPS_ALL_FLAGS)
+ return -EINVAL;
+ ops->flags = *(u64 *)(udata + moff);
+ return 1;
+ case offsetof(struct sched_ext_ops, name):
+ ret = bpf_obj_name_cpy(ops->name, uops->name,
+ sizeof(ops->name));
+ if (ret < 0)
+ return ret;
+ if (ret == 0)
+ return -EINVAL;
+ return 1;
+ }
+
+ return 0;
+}
+
+static int bpf_scx_check_member(const struct btf_type *t,
+ const struct btf_member *member,
+ const struct bpf_prog *prog)
+{
+ u32 moff = __btf_member_bit_offset(t, member) / 8;
+
+ switch (moff) {
+ case offsetof(struct sched_ext_ops, init_task):
+ case offsetof(struct sched_ext_ops, init):
+ case offsetof(struct sched_ext_ops, exit):
+ break;
+ default:
+ if (prog->sleepable)
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static int bpf_scx_reg(void *kdata)
+{
+ return scx_ops_enable(kdata);
+}
+
+static void bpf_scx_unreg(void *kdata)
+{
+ scx_ops_disable(SCX_EXIT_UNREG);
+ kthread_flush_work(&scx_ops_disable_work);
+}
+
+static int bpf_scx_init(struct btf *btf)
+{
+ u32 type_id;
+
+ type_id = btf_find_by_name_kind(btf, "task_struct", BTF_KIND_STRUCT);
+ if (type_id < 0)
+ return -EINVAL;
+ task_struct_type = btf_type_by_id(btf, type_id);
+ task_struct_type_id = type_id;
+
+ return 0;
+}
+
+static int bpf_scx_update(void *kdata, void *old_kdata)
+{
+ /*
+ * sched_ext does not support updating the actively-loaded BPF
+ * scheduler, as registering a BPF scheduler can always fail if the
+ * scheduler returns an error code for e.g. ops.init(), ops.init_task(),
+ * etc. Similarly, we can always race with unregistration happening
+ * elsewhere, such as with sysrq.
+ */
+ return -EOPNOTSUPP;
+}
+
+static int bpf_scx_validate(void *kdata)
+{
+ return 0;
+}
+
+static s32 select_cpu_stub(struct task_struct *p, s32 prev_cpu, u64 wake_flags) { return -EINVAL; }
+static void enqueue_stub(struct task_struct *p, u64 enq_flags) {}
+static void dequeue_stub(struct task_struct *p, u64 enq_flags) {}
+static void dispatch_stub(s32 prev_cpu, struct task_struct *p) {}
+static bool yield_stub(struct task_struct *from, struct task_struct *to) { return false; }
+static void set_weight_stub(struct task_struct *p, u32 weight) {}
+static void set_cpumask_stub(struct task_struct *p, const struct cpumask *mask) {}
+static void update_idle_stub(s32 cpu, bool idle) {}
+static s32 init_task_stub(struct task_struct *p, struct scx_init_task_args *args) { return -EINVAL; }
+static void exit_task_stub(struct task_struct *p, struct scx_exit_task_args *args) {}
+static void enable_stub(struct task_struct *p) {}
+static void disable_stub(struct task_struct *p) {}
+static s32 init_stub(void) { return -EINVAL; }
+static void exit_stub(struct scx_exit_info *info) {}
+
+static struct sched_ext_ops __bpf_ops_sched_ext_ops = {
+ .select_cpu = select_cpu_stub,
+ .enqueue = enqueue_stub,
+ .dequeue = dequeue_stub,
+ .dispatch = dispatch_stub,
+ .yield = yield_stub,
+ .set_weight = set_weight_stub,
+ .set_cpumask = set_cpumask_stub,
+ .update_idle = update_idle_stub,
+ .init_task = init_task_stub,
+ .exit_task = exit_task_stub,
+ .enable = enable_stub,
+ .disable = disable_stub,
+ .init = init_stub,
+ .exit = exit_stub,
+};
+
+static struct bpf_struct_ops bpf_sched_ext_ops = {
+ .verifier_ops = &bpf_scx_verifier_ops,
+ .reg = bpf_scx_reg,
+ .unreg = bpf_scx_unreg,
+ .check_member = bpf_scx_check_member,
+ .init_member = bpf_scx_init_member,
+ .init = bpf_scx_init,
+ .update = bpf_scx_update,
+ .validate = bpf_scx_validate,
+ .name = "sched_ext_ops",
+ .owner = THIS_MODULE,
+ .cfi_stubs = &__bpf_ops_sched_ext_ops
+};
+
+
+/********************************************************************************
+ * System integration and init.
+ */
+
+void __init init_sched_ext_class(void)
+{
+ s32 cpu, v;
+
+ /*
+ * The following is to prevent the compiler from optimizing out the enum
+ * definitions so that BPF scheduler implementations can use them
+ * through the generated vmlinux.h.
+ */
+ WRITE_ONCE(v, SCX_ENQ_WAKEUP | SCX_DEQ_SLEEP);
+
+ BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params));
+ init_dsq(&scx_dsq_global, SCX_DSQ_GLOBAL);
+#ifdef CONFIG_SMP
+ BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL));
+ BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL));
+#endif
+ for_each_possible_cpu(cpu) {
+ struct rq *rq = cpu_rq(cpu);
+
+ init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL);
+ INIT_LIST_HEAD(&rq->scx.runnable_list);
+ }
+}
+
+
+/********************************************************************************
+ * Helpers that can be called from the BPF scheduler.
+ */
+#include <linux/btf_ids.h>
+
+__bpf_kfunc_start_defs();
+
+/**
+ * scx_bpf_create_dsq - Create a custom DSQ
+ * @dsq_id: DSQ to create
+ * @node: NUMA node to allocate from
+ *
+ * Create a custom DSQ identified by @dsq_id. Can be called from ops.init() and
+ * ops.init_task().
+ */
+__bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node)
+{
+ if (!scx_kf_allowed(SCX_KF_SLEEPABLE))
+ return -EINVAL;
+
+ if (unlikely(node >= (int)nr_node_ids ||
+ (node < 0 && node != NUMA_NO_NODE)))
+ return -EINVAL;
+ return PTR_ERR_OR_ZERO(create_dsq(dsq_id, node));
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(scx_kfunc_ids_sleepable)
+BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE)
+BTF_KFUNCS_END(scx_kfunc_ids_sleepable)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_sleepable = {
+ .owner = THIS_MODULE,
+ .set = &scx_kfunc_ids_sleepable,
+};
+
+__bpf_kfunc_start_defs();
+
+/**
+ * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu()
+ * @p: task_struct to select a CPU for
+ * @prev_cpu: CPU @p was on previously
+ * @wake_flags: %SCX_WAKE_* flags
+ * @is_idle: out parameter indicating whether the returned CPU is idle
+ *
+ * Can only be called from ops.select_cpu() if the built-in CPU selection is
+ * enabled - ops.update_idle() is missing or %SCX_OPS_KEEP_BUILTIN_IDLE is set.
+ * @p, @prev_cpu and @wake_flags match ops.select_cpu().
+ *
+ * Returns the picked CPU with *@is_idle indicating whether the picked CPU is
+ * currently idle and thus a good candidate for direct dispatching.
+ */
+__bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
+ u64 wake_flags, bool *is_idle)
+{
+ if (!scx_kf_allowed(SCX_KF_SELECT_CPU)) {
+ *is_idle = false;
+ return prev_cpu;
+ }
+#ifdef CONFIG_SMP
+ return scx_select_cpu_dfl(p, prev_cpu, wake_flags, is_idle);
+#else
+ *is_idle = false;
+ return prev_cpu;
+#endif
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(scx_kfunc_ids_select_cpu)
+BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
+BTF_KFUNCS_END(scx_kfunc_ids_select_cpu)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = {
+ .owner = THIS_MODULE,
+ .set = &scx_kfunc_ids_select_cpu,
+};
+
+static bool scx_dispatch_preamble(struct task_struct *p, u64 enq_flags)
+{
+ if (!scx_kf_allowed(SCX_KF_ENQUEUE | SCX_KF_DISPATCH))
+ return false;
+
+ lockdep_assert_irqs_disabled();
+
+ if (unlikely(!p)) {
+ scx_ops_error("called with NULL task");
+ return false;
+ }
+
+ if (unlikely(enq_flags & __SCX_ENQ_INTERNAL_MASK)) {
+ scx_ops_error("invalid enq_flags 0x%llx", enq_flags);
+ return false;
+ }
+
+ return true;
+}
+
+static void scx_dispatch_commit(struct task_struct *p, u64 dsq_id, u64 enq_flags)
+{
+ struct task_struct *ddsp_task;
+ int idx;
+
+ ddsp_task = __this_cpu_read(direct_dispatch_task);
+ if (ddsp_task) {
+ mark_direct_dispatch(ddsp_task, p, dsq_id, enq_flags);
+ return;
+ }
+
+ idx = __this_cpu_read(scx_dsp_ctx.buf_cursor);
+ if (unlikely(idx >= scx_dsp_max_batch)) {
+ scx_ops_error("dispatch buffer overflow");
+ return;
+ }
+
+ this_cpu_ptr(scx_dsp_buf)[idx] = (struct scx_dsp_buf_ent){
+ .task = p,
+ .qseq = atomic_long_read(&p->scx.ops_state) & SCX_OPSS_QSEQ_MASK,
+ .dsq_id = dsq_id,
+ .enq_flags = enq_flags,
+ };
+ __this_cpu_inc(scx_dsp_ctx.buf_cursor);
+}
+
+__bpf_kfunc_start_defs();
+
+/**
+ * scx_bpf_dispatch - Dispatch a task into the FIFO queue of a DSQ
+ * @p: task_struct to dispatch
+ * @dsq_id: DSQ to dispatch to
+ * @slice: duration @p can run for in nsecs
+ * @enq_flags: SCX_ENQ_*
+ *
+ * Dispatch @p into the FIFO queue of the DSQ identified by @dsq_id. It is safe
+ * to call this function spuriously. Can be called from ops.enqueue(),
+ * ops.select_cpu(), and ops.dispatch().
+ *
+ * When called from ops.select_cpu() or ops.enqueue(), it's for direct dispatch
+ * and @p must match the task being enqueued. Also, %SCX_DSQ_LOCAL_ON can't be
+ * used to target the local DSQ of a CPU other than the enqueueing one. Use
+ * ops.select_cpu() to be on the target CPU in the first place.
+ *
+ * When called from ops.select_cpu(), @enq_flags and @dsp_id are stored, and @p
+ * will be directly dispatched to the corresponding dispatch queue after
+ * ops.select_cpu() returns. If @p is dispatched to SCX_DSQ_LOCAL, it will be
+ * dispatched to the local DSQ of the CPU returned by ops.select_cpu().
+ * @enq_flags are OR'd with the enqueue flags on the enqueue path before the
+ * task is dispatched.
+ *
+ * When called from ops.dispatch(), there are no restrictions on @p or @dsq_id
+ * and this function can be called upto ops.dispatch_max_batch times to dispatch
+ * multiple tasks. scx_bpf_dispatch_nr_slots() returns the number of the
+ * remaining slots. scx_bpf_consume() flushes the batch and resets the counter.
+ *
+ * This function doesn't have any locking restrictions and may be called under
+ * BPF locks (in the future when BPF introduces more flexible locking).
+ *
+ * @p is allowed to run for @slice. The scheduling path is triggered on slice
+ * exhaustion. If zero, the current residual slice is maintained.
+ */
+__bpf_kfunc void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice,
+ u64 enq_flags)
+{
+ if (!scx_dispatch_preamble(p, enq_flags))
+ return;
+
+ if (slice)
+ p->scx.slice = slice;
+ else
+ p->scx.slice = p->scx.slice ?: 1;
+
+ scx_dispatch_commit(p, dsq_id, enq_flags);
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(scx_kfunc_ids_enqueue_dispatch)
+BTF_ID_FLAGS(func, scx_bpf_dispatch, KF_RCU)
+BTF_KFUNCS_END(scx_kfunc_ids_enqueue_dispatch)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_enqueue_dispatch = {
+ .owner = THIS_MODULE,
+ .set = &scx_kfunc_ids_enqueue_dispatch,
+};
+
+__bpf_kfunc_start_defs();
+
+/**
+ * scx_bpf_dispatch_nr_slots - Return the number of remaining dispatch slots
+ *
+ * Can only be called from ops.dispatch().
+ */
+__bpf_kfunc u32 scx_bpf_dispatch_nr_slots(void)
+{
+ if (!scx_kf_allowed(SCX_KF_DISPATCH))
+ return 0;
+
+ return scx_dsp_max_batch - __this_cpu_read(scx_dsp_ctx.buf_cursor);
+}
+
+/**
+ * scx_bpf_dispatch_cancel - Cancel the latest dispatch
+ *
+ * Cancel the latest dispatch. Can be called multiple times to cancel further
+ * dispatches. Can only be called from ops.dispatch().
+ */
+__bpf_kfunc void scx_bpf_dispatch_cancel(void)
+{
+ struct scx_dsp_ctx *dspc = this_cpu_ptr(&scx_dsp_ctx);
+
+ if (!scx_kf_allowed(SCX_KF_DISPATCH))
+ return;
+
+ if (dspc->buf_cursor > 0)
+ dspc->buf_cursor--;
+ else
+ scx_ops_error("dispatch buffer underflow");
+}
+
+/**
+ * scx_bpf_consume - Transfer a task from a DSQ to the current CPU's local DSQ
+ * @dsq_id: DSQ to consume
+ *
+ * Consume a task from the non-local DSQ identified by @dsq_id and transfer it
+ * to the current CPU's local DSQ for execution. Can only be called from
+ * ops.dispatch().
+ *
+ * This function flushes the in-flight dispatches from scx_bpf_dispatch() before
+ * trying to consume the specified DSQ. It may also grab rq locks and thus can't
+ * be called under any BPF locks.
+ *
+ * Returns %true if a task has been consumed, %false if there isn't any task to
+ * consume.
+ */
+__bpf_kfunc bool scx_bpf_consume(u64 dsq_id)
+{
+ struct scx_dsp_ctx *dspc = this_cpu_ptr(&scx_dsp_ctx);
+ struct scx_dispatch_q *dsq;
+
+ if (!scx_kf_allowed(SCX_KF_DISPATCH))
+ return false;
+
+ flush_dispatch_buf(dspc->rq, dspc->rf);
+
+ dsq = find_non_local_dsq(dsq_id);
+ if (unlikely(!dsq)) {
+ scx_ops_error("invalid DSQ ID 0x%016llx", dsq_id);
+ return false;
+ }
+
+ if (consume_dispatch_q(dspc->rq, dspc->rf, dsq)) {
+ /*
+ * A successfully consumed task can be dequeued before it starts
+ * running while the CPU is trying to migrate other dispatched
+ * tasks. Bump nr_tasks to tell balance_scx() to retry on empty
+ * local DSQ.
+ */
+ dspc->nr_tasks++;
+ return true;
+ } else {
+ return false;
+ }
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(scx_kfunc_ids_dispatch)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel)
+BTF_ID_FLAGS(func, scx_bpf_consume)
+BTF_KFUNCS_END(scx_kfunc_ids_dispatch)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = {
+ .owner = THIS_MODULE,
+ .set = &scx_kfunc_ids_dispatch,
+};
+
+__bpf_kfunc_start_defs();
+
+/**
+ * scx_bpf_dsq_nr_queued - Return the number of queued tasks
+ * @dsq_id: id of the DSQ
+ *
+ * Return the number of tasks in the DSQ matching @dsq_id. If not found,
+ * -%ENOENT is returned.
+ */
+__bpf_kfunc s32 scx_bpf_dsq_nr_queued(u64 dsq_id)
+{
+ struct scx_dispatch_q *dsq;
+ s32 ret;
+
+ preempt_disable();
+
+ if (dsq_id == SCX_DSQ_LOCAL) {
+ ret = READ_ONCE(this_rq()->scx.local_dsq.nr);
+ goto out;
+ } else if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
+ s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
+
+ if (ops_cpu_valid(cpu, NULL)) {
+ ret = READ_ONCE(cpu_rq(cpu)->scx.local_dsq.nr);
+ goto out;
+ }
+ } else {
+ dsq = find_non_local_dsq(dsq_id);
+ if (dsq) {
+ ret = READ_ONCE(dsq->nr);
+ goto out;
+ }
+ }
+ ret = -ENOENT;
+out:
+ preempt_enable();
+ return ret;
+}
+
+/**
+ * scx_bpf_destroy_dsq - Destroy a custom DSQ
+ * @dsq_id: DSQ to destroy
+ *
+ * Destroy the custom DSQ identified by @dsq_id. Only DSQs created with
+ * scx_bpf_create_dsq() can be destroyed. The caller must ensure that the DSQ is
+ * empty and no further tasks are dispatched to it. Ignored if called on a DSQ
+ * which doesn't exist. Can be called from any online scx_ops operations.
+ */
+__bpf_kfunc void scx_bpf_destroy_dsq(u64 dsq_id)
+{
+ destroy_dsq(dsq_id);
+}
+
+__bpf_kfunc_end_defs();
+
+struct scx_bpf_error_bstr_bufs {
+ u64 data[MAX_BPRINTF_VARARGS];
+ char msg[SCX_EXIT_MSG_LEN];
+};
+
+static DEFINE_PER_CPU(struct scx_bpf_error_bstr_bufs, scx_bpf_error_bstr_bufs);
+
+static void bpf_exit_bstr_common(enum scx_exit_kind kind, s64 exit_code,
+ char *fmt, unsigned long long *data,
+ u32 data__sz)
+{
+ struct bpf_bprintf_data bprintf_data = { .get_bin_args = true };
+ struct scx_bpf_error_bstr_bufs *bufs;
+ unsigned long flags;
+ int ret;
+
+ local_irq_save(flags);
+ bufs = this_cpu_ptr(&scx_bpf_error_bstr_bufs);
+
+ if (data__sz % 8 || data__sz > MAX_BPRINTF_VARARGS * 8 ||
+ (data__sz && !data)) {
+ scx_ops_error("invalid data=%p and data__sz=%u",
+ (void *)data, data__sz);
+ goto out_restore;
+ }
+
+ ret = copy_from_kernel_nofault(bufs->data, data, data__sz);
+ if (ret) {
+ scx_ops_error("failed to read data fields (%d)", ret);
+ goto out_restore;
+ }
+
+ ret = bpf_bprintf_prepare(fmt, UINT_MAX, bufs->data, data__sz / 8,
+ &bprintf_data);
+ if (ret < 0) {
+ scx_ops_error("failed to format prepration (%d)", ret);
+ goto out_restore;
+ }
+
+ ret = bstr_printf(bufs->msg, sizeof(bufs->msg), fmt,
+ bprintf_data.bin_args);
+ bpf_bprintf_cleanup(&bprintf_data);
+ if (ret < 0) {
+ scx_ops_error("scx_ops_error(\"%s\", %p, %u) failed to format",
+ fmt, data, data__sz);
+ goto out_restore;
+ }
+
+ scx_ops_exit_kind(kind, exit_code, "%s", bufs->msg);
+out_restore:
+ local_irq_restore(flags);
+
+}
+
+__bpf_kfunc_start_defs();
+
+/**
+ * scx_bpf_exit_bstr - Gracefully exit the BPF scheduler.
+ * @exit_code: Exit value to pass to user space via struct scx_exit_info.
+ * @fmt: error message format string
+ * @data: format string parameters packaged using ___bpf_fill() macro
+ * @data__sz: @data len, must end in '__sz' for the verifier
+ *
+ * Indicate that the BPF scheduler wants to exit gracefully, and initiate ops
+ * disabling.
+ */
+__bpf_kfunc void scx_bpf_exit_bstr(s64 exit_code, char *fmt,
+ unsigned long long *data, u32 data__sz)
+{
+ bpf_exit_bstr_common(SCX_EXIT_UNREG_BPF, exit_code, fmt, data,
+ data__sz);
+}
+
+/**
+ * scx_bpf_error_bstr - Indicate fatal error
+ * @fmt: error message format string
+ * @data: format string parameters packaged using ___bpf_fill() macro
+ * @data__sz: @data len, must end in '__sz' for the verifier
+ *
+ * Indicate that the BPF scheduler encountered a fatal error and initiate ops
+ * disabling.
+ */
+__bpf_kfunc void scx_bpf_error_bstr(char *fmt, unsigned long long *data,
+ u32 data__sz)
+{
+
+ bpf_exit_bstr_common(SCX_EXIT_ERROR_BPF, 0, fmt, data,
+ data__sz);
+}
+
+/**
+ * scx_bpf_nr_cpu_ids - Return the number of possible CPU IDs
+ *
+ * All valid CPU IDs in the system are smaller than the returned value.
+ */
+__bpf_kfunc u32 scx_bpf_nr_cpu_ids(void)
+{
+ return nr_cpu_ids;
+}
+
+/**
+ * scx_bpf_get_possible_cpumask - Get a referenced kptr to cpu_possible_mask
+ */
+__bpf_kfunc const struct cpumask *scx_bpf_get_possible_cpumask(void)
+{
+ return cpu_possible_mask;
+}
+
+/**
+ * scx_bpf_get_online_cpumask - Get a referenced kptr to cpu_online_mask
+ */
+__bpf_kfunc const struct cpumask *scx_bpf_get_online_cpumask(void)
+{
+ return cpu_online_mask;
+}
+
+/**
+ * scx_bpf_put_cpumask - Release a possible/online cpumask
+ * @cpumask: cpumask to release
+ */
+__bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask)
+{
+ /*
+ * Empty function body because we aren't actually acquiring or releasing
+ * a reference to a global cpumask, which is read-only in the caller and
+ * is never released. The acquire / release semantics here are just used
+ * to make the cpumask is a trusted pointer in the caller.
+ */
+}
+
+/**
+ * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
+ * per-CPU cpumask.
+ *
+ * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
+ */
+__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
+{
+ if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+ scx_ops_error("built-in idle tracking is disabled");
+ return cpu_none_mask;
+ }
+
+#ifdef CONFIG_SMP
+ return idle_masks.cpu;
+#else
+ return cpu_none_mask;
+#endif
+}
+
+/**
+ * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
+ * per-physical-core cpumask. Can be used to determine if an entire physical
+ * core is free.
+ *
+ * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
+ */
+__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
+{
+ if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+ scx_ops_error("built-in idle tracking is disabled");
+ return cpu_none_mask;
+ }
+
+#ifdef CONFIG_SMP
+ if (sched_smt_active())
+ return idle_masks.smt;
+ else
+ return idle_masks.cpu;
+#else
+ return cpu_none_mask;
+#endif
+}
+
+/**
+ * scx_bpf_put_idle_cpumask - Release a previously acquired referenced kptr to
+ * either the percpu, or SMT idle-tracking cpumask.
+ */
+__bpf_kfunc void scx_bpf_put_idle_cpumask(const struct cpumask *idle_mask)
+{
+ /*
+ * Empty function body because we aren't actually acquiring or releasing
+ * a reference to a global idle cpumask, which is read-only in the
+ * caller and is never released. The acquire / release semantics here
+ * are just used to make the cpumask a trusted pointer in the caller.
+ */
+}
+
+/**
+ * scx_bpf_test_and_clear_cpu_idle - Test and clear @cpu's idle state
+ * @cpu: cpu to test and clear idle for
+ *
+ * Returns %true if @cpu was idle and its idle state was successfully cleared.
+ * %false otherwise.
+ *
+ * Unavailable if ops.update_idle() is implemented and
+ * %SCX_OPS_KEEP_BUILTIN_IDLE is not set.
+ */
+__bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
+{
+ if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+ scx_ops_error("built-in idle tracking is disabled");
+ return false;
+ }
+
+ if (ops_cpu_valid(cpu, NULL))
+ return test_and_clear_cpu_idle(cpu);
+ else
+ return false;
+}
+
+/**
+ * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu
+ * @cpus_allowed: Allowed cpumask
+ * @flags: %SCX_PICK_IDLE_CPU_* flags
+ *
+ * Pick and claim an idle cpu in @cpus_allowed. Returns the picked idle cpu
+ * number on success. -%EBUSY if no matching cpu was found.
+ *
+ * Idle CPU tracking may race against CPU scheduling state transitions. For
+ * example, this function may return -%EBUSY as CPUs are transitioning into the
+ * idle state. If the caller then assumes that there will be dispatch events on
+ * the CPUs as they were all busy, the scheduler may end up stalling with CPUs
+ * idling while there are pending tasks. Use scx_bpf_pick_any_cpu() and
+ * scx_bpf_kick_cpu() to guarantee that there will be at least one dispatch
+ * event in the near future.
+ *
+ * Unavailable if ops.update_idle() is implemented and
+ * %SCX_OPS_KEEP_BUILTIN_IDLE is not set.
+ */
+__bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed,
+ u64 flags)
+{
+ if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+ scx_ops_error("built-in idle tracking is disabled");
+ return -EBUSY;
+ }
+
+ return scx_pick_idle_cpu(cpus_allowed, flags);
+}
+
+/**
+ * scx_bpf_pick_any_cpu - Pick and claim an idle cpu if available or pick any CPU
+ * @cpus_allowed: Allowed cpumask
+ * @flags: %SCX_PICK_IDLE_CPU_* flags
+ *
+ * Pick and claim an idle cpu in @cpus_allowed. If none is available, pick any
+ * CPU in @cpus_allowed. Guaranteed to succeed and returns the picked idle cpu
+ * number if @cpus_allowed is not empty. -%EBUSY is returned if @cpus_allowed is
+ * empty.
+ *
+ * If ops.update_idle() is implemented and %SCX_OPS_KEEP_BUILTIN_IDLE is not
+ * set, this function can't tell which CPUs are idle and will always pick any
+ * CPU.
+ */
+__bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed,
+ u64 flags)
+{
+ s32 cpu;
+
+ if (static_branch_likely(&scx_builtin_idle_enabled)) {
+ cpu = scx_pick_idle_cpu(cpus_allowed, flags);
+ if (cpu >= 0)
+ return cpu;
+ }
+
+ cpu = cpumask_any_distribute(cpus_allowed);
+ if (cpu < nr_cpu_ids)
+ return cpu;
+ else
+ return -EBUSY;
+}
+
+/**
+ * scx_bpf_task_running - Is task currently running?
+ * @p: task of interest
+ */
+__bpf_kfunc bool scx_bpf_task_running(const struct task_struct *p)
+{
+ return task_rq(p)->curr == p;
+}
+
+/**
+ * scx_bpf_task_cpu - CPU a task is currently associated with
+ * @p: task of interest
+ */
+__bpf_kfunc s32 scx_bpf_task_cpu(const struct task_struct *p)
+{
+ return task_cpu(p);
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(scx_kfunc_ids_any)
+BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued)
+BTF_ID_FLAGS(func, scx_bpf_destroy_dsq)
+BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids)
+BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_put_cpumask, KF_RELEASE)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE)
+BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
+BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU)
+BTF_KFUNCS_END(scx_kfunc_ids_any)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_any = {
+ .owner = THIS_MODULE,
+ .set = &scx_kfunc_ids_any,
+};
+
+static int __init scx_init(void)
+{
+ int ret;
+
+ /*
+ * kfunc registration can't be done from init_sched_ext_class() as
+ * register_btf_kfunc_id_set() needs most of the system to be up.
+ *
+ * Some kfuncs are context-sensitive and can only be called from
+ * specific SCX ops. They are grouped into BTF sets accordingly.
+ * Unfortunately, BPF currently doesn't have a way of enforcing such
+ * restrictions. Eventually, the verifier should be able to enforce
+ * them. For now, register them the same and make each kfunc explicitly
+ * check using scx_kf_allowed().
+ */
+ if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+ &scx_kfunc_set_sleepable)) ||
+ (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+ &scx_kfunc_set_select_cpu)) ||
+ (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+ &scx_kfunc_set_enqueue_dispatch)) ||
+ (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+ &scx_kfunc_set_dispatch)) ||
+ (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+ &scx_kfunc_set_any)) ||
+ (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,
+ &scx_kfunc_set_any)) ||
+ (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL,
+ &scx_kfunc_set_any))) {
+ pr_err("sched_ext: Failed to register kfunc sets (%d)\n", ret);
+ return ret;
+ }
+
+ ret = register_bpf_struct_ops(&bpf_sched_ext_ops, sched_ext_ops);
+ if (ret) {
+ pr_err("sched_ext: Failed to register struct_ops (%d)\n", ret);
+ return ret;
+ }
+
+ scx_kset = kset_create_and_add("sched_ext", &scx_uevent_ops, kernel_kobj);
+ if (!scx_kset) {
+ pr_err("sched_ext: Failed to create /sys/kernel/sched_ext\n");
+ return -ENOMEM;
+ }
+
+ ret = sysfs_create_group(&scx_kset->kobj, &scx_global_attr_group);
+ if (ret < 0) {
+ pr_err("sched_ext: Failed to add global attributes\n");
+ return ret;
+ }
+
+ return 0;
+}
+__initcall(scx_init);
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 42bd1e47b304..328abf5445b1 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -1,15 +1,85 @@
/* SPDX-License-Identifier: GPL-2.0 */
-
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
#ifdef CONFIG_SCHED_CLASS_EXT
-#error "NOT IMPLEMENTED YET"
+
+struct sched_enq_and_set_ctx {
+ struct task_struct *p;
+ int queue_flags;
+ bool queued;
+ bool running;
+};
+
+void sched_deq_and_put_task(struct task_struct *p, int queue_flags,
+ struct sched_enq_and_set_ctx *ctx);
+void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx);
+
+extern const struct sched_class ext_sched_class;
+
+DECLARE_STATIC_KEY_FALSE(__scx_ops_enabled);
+DECLARE_STATIC_KEY_FALSE(__scx_switched_all);
+#define scx_enabled() static_branch_unlikely(&__scx_ops_enabled)
+#define scx_switched_all() static_branch_unlikely(&__scx_switched_all)
+
+static inline bool task_on_scx(const struct task_struct *p)
+{
+ return scx_enabled() && p->sched_class == &ext_sched_class;
+}
+
+void init_scx_entity(struct sched_ext_entity *scx);
+void scx_pre_fork(struct task_struct *p);
+int scx_fork(struct task_struct *p);
+void scx_post_fork(struct task_struct *p);
+void scx_cancel_fork(struct task_struct *p);
+bool task_should_scx(struct task_struct *p);
+void init_sched_ext_class(void);
+
+static inline u32 scx_cpuperf_target(s32 cpu)
+{
+ /* for now, peg cpus at max performance while enabled */
+ if (scx_enabled())
+ return SCHED_CAPACITY_SCALE;
+ else
+ return 0;
+}
+
+static inline const struct sched_class *next_active_class(const struct sched_class *class)
+{
+ class++;
+ if (scx_switched_all() && class == &fair_sched_class)
+ class++;
+ if (!scx_enabled() && class == &ext_sched_class)
+ class++;
+ return class;
+}
+
+#define for_active_class_range(class, _from, _to) \
+ for (class = (_from); class != (_to); class = next_active_class(class))
+
+#define for_each_active_class(class) \
+ for_active_class_range(class, __sched_class_highest, __sched_class_lowest)
+
+/*
+ * SCX requires a balance() call before every pick_next_task() call including
+ * when waking up from idle.
+ */
+#define for_balance_class_range(class, prev_class, end_class) \
+ for_active_class_range(class, (prev_class) > &ext_sched_class ? \
+ &ext_sched_class : (prev_class), (end_class))
+
#else /* CONFIG_SCHED_CLASS_EXT */

#define scx_enabled() false
+#define scx_switched_all() false

static inline void scx_pre_fork(struct task_struct *p) {}
static inline int scx_fork(struct task_struct *p) { return 0; }
static inline void scx_post_fork(struct task_struct *p) {}
static inline void scx_cancel_fork(struct task_struct *p) {}
+static inline bool task_on_scx(const struct task_struct *p) { return false; }
static inline void init_sched_ext_class(void) {}
static inline u32 scx_cpuperf_target(s32 cpu) { return 0; }

@@ -19,7 +89,13 @@ static inline u32 scx_cpuperf_target(s32 cpu) { return 0; }
#endif /* CONFIG_SCHED_CLASS_EXT */

#if defined(CONFIG_SCHED_CLASS_EXT) && defined(CONFIG_SMP)
-#error "NOT IMPLEMENTED YET"
+void __scx_update_idle(struct rq *rq, bool idle);
+
+static inline void scx_update_idle(struct rq *rq, bool idle)
+{
+ if (scx_enabled())
+ __scx_update_idle(rq, idle);
+}
#else
static inline void scx_update_idle(struct rq *rq, bool idle) {}
#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fbd1e9ea8b18..4a55a31250ab 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -174,6 +174,10 @@ static inline int idle_policy(int policy)

static inline int normal_policy(int policy)
{
+#ifdef CONFIG_SCHED_CLASS_EXT
+ if (policy == SCHED_EXT)
+ return true;
+#endif
return policy == SCHED_NORMAL;
}

@@ -704,6 +708,16 @@ struct cfs_rq {
#endif /* CONFIG_FAIR_GROUP_SCHED */
};

+#ifdef CONFIG_SCHED_CLASS_EXT
+struct scx_rq {
+ struct scx_dispatch_q local_dsq;
+ struct list_head runnable_list; /* runnable tasks on this rq */
+ unsigned long ops_qseq;
+ u64 extra_enq_flags; /* see move_task_to_local_dsq() */
+ u32 nr_running;
+};
+#endif /* CONFIG_SCHED_CLASS_EXT */
+
static inline int rt_bandwidth_enabled(void)
{
return sysctl_sched_rt_runtime >= 0;
@@ -1044,6 +1058,9 @@ struct rq {
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
+#ifdef CONFIG_SCHED_CLASS_EXT
+ struct scx_rq scx;
+#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this CPU: */
--
2.44.0


2024-05-01 15:19:29

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 25/39] sched_ext: Add task state tracking operations

Being able to track the task runnable and running state transitions are
useful for a variety of purposes including latency tracking and load factor
calculation.

Currently, BPF schedulers don't have a good way of tracking these
transitions. Becoming runnable can be determined from ops.enqueue() but
becoming quiescent can only be inferred from the lack of subsequent enqueue.
Also, as the local dsq can have multiple tasks and some events are handled
in the sched_ext core, it's difficult to determine when a given task starts
and stops executing.

This patch adds sched_ext_ops.runnable(), .running(), .stopping() and
quiescent() operations to track the task runnable and running state
transitions. They're mostly self explanatory; however, we want to ensure
that running <-> stopping transitions are always contained within runnable
<-> quiescent transitions which is a bit different from how the scheduler
core behaves. This adds a bit of complication. See the comment in
dequeue_task_scx().

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/ext.c | 104 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 104 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 495210cd12f9..8f10fb228a45 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -203,6 +203,71 @@ struct sched_ext_ops {
*/
void (*tick)(struct task_struct *p);

+ /**
+ * runnable - A task is becoming runnable on its associated CPU
+ * @p: task becoming runnable
+ * @enq_flags: %SCX_ENQ_*
+ *
+ * This and the following three functions can be used to track a task's
+ * execution state transitions. A task becomes ->runnable() on a CPU,
+ * and then goes through one or more ->running() and ->stopping() pairs
+ * as it runs on the CPU, and eventually becomes ->quiescent() when it's
+ * done running on the CPU.
+ *
+ * @p is becoming runnable on the CPU because it's
+ *
+ * - waking up (%SCX_ENQ_WAKEUP)
+ * - being moved from another CPU
+ * - being restored after temporarily taken off the queue for an
+ * attribute change.
+ *
+ * This and ->enqueue() are related but not coupled. This operation
+ * notifies @p's state transition and may not be followed by ->enqueue()
+ * e.g. when @p is being dispatched to a remote CPU. Likewise, a task
+ * may be ->enqueue()'d without being preceded by this operation e.g.
+ * after exhausting its slice.
+ */
+ void (*runnable)(struct task_struct *p, u64 enq_flags);
+
+ /**
+ * running - A task is starting to run on its associated CPU
+ * @p: task starting to run
+ *
+ * See ->runnable() for explanation on the task state notifiers.
+ */
+ void (*running)(struct task_struct *p);
+
+ /**
+ * stopping - A task is stopping execution
+ * @p: task stopping to run
+ * @runnable: is task @p still runnable?
+ *
+ * See ->runnable() for explanation on the task state notifiers. If
+ * !@runnable, ->quiescent() will be invoked after this operation
+ * returns.
+ */
+ void (*stopping)(struct task_struct *p, bool runnable);
+
+ /**
+ * quiescent - A task is becoming not runnable on its associated CPU
+ * @p: task becoming not runnable
+ * @deq_flags: %SCX_DEQ_*
+ *
+ * See ->runnable() for explanation on the task state notifiers.
+ *
+ * @p is becoming quiescent on the CPU because it's
+ *
+ * - sleeping (%SCX_DEQ_SLEEP)
+ * - being moved to another CPU
+ * - being temporarily taken off the queue for an attribute change
+ * (%SCX_DEQ_SAVE)
+ *
+ * This and ->dequeue() are related but not coupled. This operation
+ * notifies @p's state transition and may not be preceded by ->dequeue()
+ * e.g. when @p is being dispatched to a remote CPU.
+ */
+ void (*quiescent)(struct task_struct *p, u64 deq_flags);
+
/**
* yield - Yield CPU
* @from: yielding task
@@ -1289,6 +1354,9 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
rq->scx.nr_running++;
add_nr_running(rq, 1);

+ if (SCX_HAS_OP(runnable))
+ SCX_CALL_OP(SCX_KF_REST, runnable, p, enq_flags);
+
do_enqueue_task(rq, p, enq_flags, sticky_cpu);
}

@@ -1350,6 +1418,26 @@ static void dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags

ops_dequeue(p, deq_flags);

+ /*
+ * A currently running task which is going off @rq first gets dequeued
+ * and then stops running. As we want running <-> stopping transitions
+ * to be contained within runnable <-> quiescent transitions, trigger
+ * ->stopping() early here instead of in put_prev_task_scx().
+ *
+ * @p may go through multiple stopping <-> running transitions between
+ * here and put_prev_task_scx() if task attribute changes occur while
+ * balance_scx() leaves @rq unlocked. However, they don't contain any
+ * information meaningful to the BPF scheduler and can be suppressed by
+ * skipping the callbacks if the task is !QUEUED.
+ */
+ if (SCX_HAS_OP(stopping) && task_current(rq, p)) {
+ update_curr_scx(rq);
+ SCX_CALL_OP(SCX_KF_REST, stopping, p, false);
+ }
+
+ if (SCX_HAS_OP(quiescent))
+ SCX_CALL_OP(SCX_KF_REST, quiescent, p, deq_flags);
+
if (deq_flags & SCX_DEQ_SLEEP)
p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP;
else
@@ -1933,6 +2021,10 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)

p->se.exec_start = rq_clock_task(rq);

+ /* see dequeue_task_scx() on why we skip when !QUEUED */
+ if (SCX_HAS_OP(running) && (p->scx.flags & SCX_TASK_QUEUED))
+ SCX_CALL_OP(SCX_KF_REST, running, p);
+
clr_task_runnable(p, true);
}

@@ -1971,6 +2063,10 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p)

update_curr_scx(rq);

+ /* see dequeue_task_scx() on why we skip when !QUEUED */
+ if (SCX_HAS_OP(stopping) && (p->scx.flags & SCX_TASK_QUEUED))
+ SCX_CALL_OP(SCX_KF_REST, stopping, p, true);
+
/*
* If we're being called from put_prev_task_balance(), balance_scx() may
* have decided that @p should keep running.
@@ -3830,6 +3926,10 @@ static s32 select_cpu_stub(struct task_struct *p, s32 prev_cpu, u64 wake_flags)
static void enqueue_stub(struct task_struct *p, u64 enq_flags) {}
static void dequeue_stub(struct task_struct *p, u64 enq_flags) {}
static void dispatch_stub(s32 prev_cpu, struct task_struct *p) {}
+static void runnable_stub(struct task_struct *p, u64 enq_flags) {}
+static void running_stub(struct task_struct *p) {}
+static void stopping_stub(struct task_struct *p, bool runnable) {}
+static void quiescent_stub(struct task_struct *p, u64 deq_flags) {}
static bool yield_stub(struct task_struct *from, struct task_struct *to) { return false; }
static void set_weight_stub(struct task_struct *p, u32 weight) {}
static void set_cpumask_stub(struct task_struct *p, const struct cpumask *mask) {}
@@ -3846,6 +3946,10 @@ static struct sched_ext_ops __bpf_ops_sched_ext_ops = {
.enqueue = enqueue_stub,
.dequeue = dequeue_stub,
.dispatch = dispatch_stub,
+ .runnable = runnable_stub,
+ .running = running_stub,
+ .stopping = stopping_stub,
+ .quiescent = quiescent_stub,
.yield = yield_stub,
.set_weight = set_weight_stub,
.set_cpumask = set_cpumask_stub,
--
2.44.0


2024-05-01 15:20:03

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 26/39] sched_ext: Implement tickless support

Allow BPF schedulers to indicate tickless operation by setting p->scx.slice
to SCX_SLICE_INF. A CPU whose current task has infinte slice goes into
tickless operation.

scx_central is updated to use tickless operations for all tasks and
instead use a BPF timer to expire slices. This also uses the SCX_ENQ_PREEMPT
and task state tracking added by the previous patches.

Currently, there is no way to pin the timer on the central CPU, so it may
end up on one of the worker CPUs; however, outside of that, the worker CPUs
can go tickless both while running sched_ext tasks and idling.

With schbench running, scx_central shows:

root@test ~# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts
LOC: 142024 656 664 449 Local timer interrupts
LOC: 161663 663 665 449 Local timer interrupts

Without it:

root@test ~ [SIGINT]# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts
LOC: 188778 3142 3793 3993 Local timer interrupts
LOC: 198993 5314 6323 6438 Local timer interrupts

While scx_central itself is too barebone to be useful as a
production scheduler, a more featureful central scheduler can be built using
the same approach. Google's experience shows that such an approach can have
significant benefits for certain applications such as VM hosting.

v4: Allow operation even if BPF_F_TIMER_CPU_PIN is not available.

v3: Pin the central scheduler's timer on the central_cpu using
BPF_F_TIMER_CPU_PIN.

v2: Convert to BPF inline iterators.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/sched/ext.h | 1 +
kernel/sched/core.c | 9 +-
kernel/sched/ext.c | 52 +++++++++-
kernel/sched/ext.h | 2 +
kernel/sched/sched.h | 1 +
tools/sched_ext/scx_central.bpf.c | 159 ++++++++++++++++++++++++++++--
tools/sched_ext/scx_central.c | 29 +++++-
7 files changed, 241 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 4be270d02b98..218bba9dcf34 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -16,6 +16,7 @@ enum scx_public_consts {
SCX_OPS_NAME_LEN = 128,

SCX_SLICE_DFL = 20 * NSEC_PER_MSEC,
+ SCX_SLICE_INF = U64_MAX, /* infinite, implies nohz */
};

/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 42fe654bf946..667527603bea 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1255,13 +1255,16 @@ bool sched_can_stop_tick(struct rq *rq)
return true;

/*
- * If there are no DL,RR/FIFO tasks, there must only be CFS tasks left;
- * if there's more than one we need the tick for involuntary
- * preemption.
+ * If there are no DL,RR/FIFO tasks, there must only be CFS or SCX tasks
+ * left. For CFS, if there's more than one we need the tick for
+ * involuntary preemption. For SCX, ask.
*/
if (!scx_switched_all() && rq->nr_running > 1)
return false;

+ if (scx_enabled() && !scx_can_stop_tick(rq))
+ return false;
+
/*
* If there is one task and it has CFS runtime bandwidth constraints
* and it's on the cpu now we don't want to stop the tick.
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 8f10fb228a45..68b364c1f613 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1010,7 +1010,8 @@ static void update_curr_scx(struct rq *rq)
account_group_exec_runtime(curr, delta_exec);
cgroup_account_cputime(curr, delta_exec);

- curr->scx.slice -= min(curr->scx.slice, delta_exec);
+ if (curr->scx.slice != SCX_SLICE_INF)
+ curr->scx.slice -= min(curr->scx.slice, delta_exec);
}

static void dsq_mod_nr(struct scx_dispatch_q *dsq, s32 delta)
@@ -2026,6 +2027,28 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
SCX_CALL_OP(SCX_KF_REST, running, p);

clr_task_runnable(p, true);
+
+ /*
+ * @p is getting newly scheduled or got kicked after someone updated its
+ * slice. Refresh whether tick can be stopped. See scx_can_stop_tick().
+ */
+ if ((p->scx.slice == SCX_SLICE_INF) !=
+ (bool)(rq->scx.flags & SCX_RQ_CAN_STOP_TICK)) {
+ if (p->scx.slice == SCX_SLICE_INF)
+ rq->scx.flags |= SCX_RQ_CAN_STOP_TICK;
+ else
+ rq->scx.flags &= ~SCX_RQ_CAN_STOP_TICK;
+
+ sched_update_tick_dependency(rq);
+
+ /*
+ * For now, let's refresh the load_avgs just when transitioning
+ * in and out of nohz. In the future, we might want to add a
+ * mechanism which calls the following periodically on
+ * tick-stopped CPUs.
+ */
+ update_other_load_avgs(rq);
+ }
}

static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
@@ -2751,6 +2774,26 @@ int scx_check_setscheduler(struct task_struct *p, int policy)
return 0;
}

+#ifdef CONFIG_NO_HZ_FULL
+bool scx_can_stop_tick(struct rq *rq)
+{
+ struct task_struct *p = rq->curr;
+
+ if (scx_ops_bypassing())
+ return false;
+
+ if (p->sched_class != &ext_sched_class)
+ return true;
+
+ /*
+ * @rq can dispatch from different DSQs, so we can't tell whether it
+ * needs the tick or not by looking at nr_running. Allow stopping ticks
+ * iff the BPF scheduler indicated so. See set_next_task_scx().
+ */
+ return rq->scx.flags & SCX_RQ_CAN_STOP_TICK;
+}
+#endif
+
/*
* Omitted operations:
*
@@ -3049,6 +3092,9 @@ static void scx_ops_bypass(bool bypass)
}

rq_unlock_irqrestore(rq, &rf);
+
+ /* kick to restore ticks */
+ resched_cpu(cpu);
}

out_unlock:
@@ -4310,7 +4356,9 @@ __bpf_kfunc_start_defs();
* BPF locks (in the future when BPF introduces more flexible locking).
*
* @p is allowed to run for @slice. The scheduling path is triggered on slice
- * exhaustion. If zero, the current residual slice is maintained.
+ * exhaustion. If zero, the current residual slice is maintained. If
+ * %SCX_SLICE_INF, @p never expires and the BPF scheduler must kick the CPU with
+ * scx_bpf_kick_cpu() to trigger scheduling.
*/
__bpf_kfunc void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice,
u64 enq_flags)
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 2ea6c19d2462..954ae4c2b53d 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -36,6 +36,7 @@ int scx_fork(struct task_struct *p);
void scx_post_fork(struct task_struct *p);
void scx_cancel_fork(struct task_struct *p);
int scx_check_setscheduler(struct task_struct *p, int policy);
+bool scx_can_stop_tick(struct rq *rq);
bool task_should_scx(struct task_struct *p);
void init_sched_ext_class(void);

@@ -83,6 +84,7 @@ static inline int scx_fork(struct task_struct *p) { return 0; }
static inline void scx_post_fork(struct task_struct *p) {}
static inline void scx_cancel_fork(struct task_struct *p) {}
static inline int scx_check_setscheduler(struct task_struct *p, int policy) { return 0; }
+static inline bool scx_can_stop_tick(struct rq *rq) { return true; }
static inline bool task_on_scx(const struct task_struct *p) { return false; }
static inline void init_sched_ext_class(void) {}
static inline u32 scx_cpuperf_target(s32 cpu) { return 0; }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2ce8cd64fa65..c6c9b46eeacc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -712,6 +712,7 @@ struct cfs_rq {
/* scx_rq->flags, protected by the rq lock */
enum scx_rq_flags {
SCX_RQ_BALANCING = 1 << 0,
+ SCX_RQ_CAN_STOP_TICK = 1 << 1,
};

struct scx_rq {
diff --git a/tools/sched_ext/scx_central.bpf.c b/tools/sched_ext/scx_central.bpf.c
index 3d980375a058..1ab8a42edbe7 100644
--- a/tools/sched_ext/scx_central.bpf.c
+++ b/tools/sched_ext/scx_central.bpf.c
@@ -13,7 +13,26 @@
* through per-CPU BPF queues. The current design is chosen to maximally
* utilize and verify various SCX mechanisms such as LOCAL_ON dispatching.
*
- * b. Preemption
+ * b. Tickless operation
+ *
+ * All tasks are dispatched with the infinite slice which allows stopping the
+ * ticks on CONFIG_NO_HZ_FULL kernels running with the proper nohz_full
+ * parameter. The tickless operation can be observed through
+ * /proc/interrupts.
+ *
+ * Periodic switching is enforced by a periodic timer checking all CPUs and
+ * preempting them as necessary. Unfortunately, BPF timer currently doesn't
+ * have a way to pin to a specific CPU, so the periodic timer isn't pinned to
+ * the central CPU.
+ *
+ * c. Preemption
+ *
+ * Kthreads are unconditionally queued to the head of a matching local dsq
+ * and dispatched with SCX_DSQ_PREEMPT. This ensures that a kthread is always
+ * prioritized over user threads, which is required for ensuring forward
+ * progress as e.g. the periodic timer may run on a ksoftirqd and if the
+ * ksoftirqd gets starved by a user thread, there may not be anything else to
+ * vacate that user thread.
*
* SCX_KICK_PREEMPT is used to trigger scheduling and CPUs to move to the
* next tasks.
@@ -32,14 +51,17 @@ char _license[] SEC("license") = "GPL";

enum {
FALLBACK_DSQ_ID = 0,
+ MS_TO_NS = 1000LLU * 1000,
+ TIMER_INTERVAL_NS = 1 * MS_TO_NS,
};

const volatile s32 central_cpu;
const volatile u32 nr_cpu_ids = 1; /* !0 for veristat, set during init */
const volatile u64 slice_ns = SCX_SLICE_DFL;

+bool timer_pinned = true;
u64 nr_total, nr_locals, nr_queued, nr_lost_pids;
-u64 nr_dispatches, nr_mismatches, nr_retries;
+u64 nr_timers, nr_dispatches, nr_mismatches, nr_retries;
u64 nr_overflows;

UEI_DEFINE(uei);
@@ -52,6 +74,23 @@ struct {

/* can't use percpu map due to bad lookups */
bool RESIZABLE_ARRAY(data, cpu_gimme_task);
+u64 RESIZABLE_ARRAY(data, cpu_started_at);
+
+struct central_timer {
+ struct bpf_timer timer;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 1);
+ __type(key, u32);
+ __type(value, struct central_timer);
+} central_timer SEC(".maps");
+
+static bool vtime_before(u64 a, u64 b)
+{
+ return (s64)(a - b) < 0;
+}

s32 BPF_STRUCT_OPS(central_select_cpu, struct task_struct *p,
s32 prev_cpu, u64 wake_flags)
@@ -71,9 +110,22 @@ void BPF_STRUCT_OPS(central_enqueue, struct task_struct *p, u64 enq_flags)

__sync_fetch_and_add(&nr_total, 1);

+ /*
+ * Push per-cpu kthreads at the head of local dsq's and preempt the
+ * corresponding CPU. This ensures that e.g. ksoftirqd isn't blocked
+ * behind other threads which is necessary for forward progress
+ * guarantee as we depend on the BPF timer which may run from ksoftirqd.
+ */
+ if ((p->flags & PF_KTHREAD) && p->nr_cpus_allowed == 1) {
+ __sync_fetch_and_add(&nr_locals, 1);
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_INF,
+ enq_flags | SCX_ENQ_PREEMPT);
+ return;
+ }
+
if (bpf_map_push_elem(&central_q, &pid, 0)) {
__sync_fetch_and_add(&nr_overflows, 1);
- scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_DFL, enq_flags);
+ scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_INF, enq_flags);
return;
}

@@ -106,7 +158,7 @@ static bool dispatch_to_cpu(s32 cpu)
*/
if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) {
__sync_fetch_and_add(&nr_mismatches, 1);
- scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_DFL, 0);
+ scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_INF, 0);
bpf_task_release(p);
/*
* We might run out of dispatch buffer slots if we continue dispatching
@@ -120,7 +172,7 @@ static bool dispatch_to_cpu(s32 cpu)
}

/* dispatch to local and mark that @cpu doesn't need more */
- scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, SCX_SLICE_DFL, 0);
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, SCX_SLICE_INF, 0);

if (cpu != central_cpu)
scx_bpf_kick_cpu(cpu, __COMPAT_SCX_KICK_IDLE);
@@ -188,9 +240,102 @@ void BPF_STRUCT_OPS(central_dispatch, s32 cpu, struct task_struct *prev)
}
}

+void BPF_STRUCT_OPS(central_running, struct task_struct *p)
+{
+ s32 cpu = scx_bpf_task_cpu(p);
+ u64 *started_at = ARRAY_ELEM_PTR(cpu_started_at, cpu, nr_cpu_ids);
+ if (started_at)
+ *started_at = bpf_ktime_get_ns() ?: 1; /* 0 indicates idle */
+}
+
+void BPF_STRUCT_OPS(central_stopping, struct task_struct *p, bool runnable)
+{
+ s32 cpu = scx_bpf_task_cpu(p);
+ u64 *started_at = ARRAY_ELEM_PTR(cpu_started_at, cpu, nr_cpu_ids);
+ if (started_at)
+ *started_at = 0;
+}
+
+static int central_timerfn(void *map, int *key, struct bpf_timer *timer)
+{
+ u64 now = bpf_ktime_get_ns();
+ u64 nr_to_kick = nr_queued;
+ s32 i, curr_cpu;
+
+ curr_cpu = bpf_get_smp_processor_id();
+ if (timer_pinned && (curr_cpu != central_cpu)) {
+ scx_bpf_error("Central timer ran on CPU %d, not central CPU %d",
+ curr_cpu, central_cpu);
+ return 0;
+ }
+
+ bpf_for(i, 0, nr_cpu_ids) {
+ s32 cpu = (nr_timers + i) % nr_cpu_ids;
+ u64 *started_at;
+
+ if (cpu == central_cpu)
+ continue;
+
+ /* kick iff the current one exhausted its slice */
+ started_at = ARRAY_ELEM_PTR(cpu_started_at, cpu, nr_cpu_ids);
+ if (started_at && *started_at &&
+ vtime_before(now, *started_at + slice_ns))
+ continue;
+
+ /* and there's something pending */
+ if (scx_bpf_dsq_nr_queued(FALLBACK_DSQ_ID) ||
+ scx_bpf_dsq_nr_queued(SCX_DSQ_LOCAL_ON | cpu))
+ ;
+ else if (nr_to_kick)
+ nr_to_kick--;
+ else
+ continue;
+
+ scx_bpf_kick_cpu(cpu, SCX_KICK_PREEMPT);
+ }
+
+ bpf_timer_start(timer, TIMER_INTERVAL_NS, BPF_F_TIMER_CPU_PIN);
+ __sync_fetch_and_add(&nr_timers, 1);
+ return 0;
+}
+
int BPF_STRUCT_OPS_SLEEPABLE(central_init)
{
- return scx_bpf_create_dsq(FALLBACK_DSQ_ID, -1);
+ u32 key = 0;
+ struct bpf_timer *timer;
+ int ret;
+
+ ret = scx_bpf_create_dsq(FALLBACK_DSQ_ID, -1);
+ if (ret)
+ return ret;
+
+ timer = bpf_map_lookup_elem(&central_timer, &key);
+ if (!timer)
+ return -ESRCH;
+
+ if (bpf_get_smp_processor_id() != central_cpu) {
+ scx_bpf_error("init from non-central CPU");
+ return -EINVAL;
+ }
+
+ bpf_timer_init(timer, &central_timer, CLOCK_MONOTONIC);
+ bpf_timer_set_callback(timer, central_timerfn);
+
+ ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, BPF_F_TIMER_CPU_PIN);
+ /*
+ * BPF_F_TIMER_CPU_PIN is pretty new (>=6.7). If we're running in a
+ * kernel which doesn't have it, bpf_timer_start() will return -EINVAL.
+ * Retry without the PIN. This would be the perfect use case for
+ * bpf_core_enum_value_exists() but the enum type doesn't have a name
+ * and can't be used with bpf_core_enum_value_exists(). Oh well...
+ */
+ if (ret == -EINVAL) {
+ timer_pinned = false;
+ ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, 0);
+ }
+ if (ret)
+ scx_bpf_error("bpf_timer_start failed (%d)", ret);
+ return ret;
}

void BPF_STRUCT_OPS(central_exit, struct scx_exit_info *ei)
@@ -209,6 +354,8 @@ SCX_OPS_DEFINE(central_ops,
.select_cpu = (void *)central_select_cpu,
.enqueue = (void *)central_enqueue,
.dispatch = (void *)central_dispatch,
+ .running = (void *)central_running,
+ .stopping = (void *)central_stopping,
.init = (void *)central_init,
.exit = (void *)central_exit,
.name = "central");
diff --git a/tools/sched_ext/scx_central.c b/tools/sched_ext/scx_central.c
index 02cd983a5287..2908add16880 100644
--- a/tools/sched_ext/scx_central.c
+++ b/tools/sched_ext/scx_central.c
@@ -48,6 +48,7 @@ int main(int argc, char **argv)
struct bpf_link *link;
__u64 seq = 0;
__s32 opt;
+ cpu_set_t *cpuset;

libbpf_set_print(libbpf_print_fn);
signal(SIGINT, sigint_handler);
@@ -77,10 +78,35 @@ int main(int argc, char **argv)

/* Resize arrays so their element count is equal to cpu count. */
RESIZE_ARRAY(data, cpu_gimme_task, skel->rodata->nr_cpu_ids);
+ RESIZE_ARRAY(data, cpu_started_at, skel->rodata->nr_cpu_ids);

SCX_OPS_LOAD(skel, central_ops, scx_central, uei);
+
+ /*
+ * Affinitize the loading thread to the central CPU, as:
+ * - That's where the BPF timer is first invoked in the BPF program.
+ * - We probably don't want this user space component to take up a core
+ * from a task that would benefit from avoiding preemption on one of
+ * the tickless cores.
+ *
+ * Until BPF supports pinning the timer, it's not guaranteed that it
+ * will always be invoked on the central CPU. In practice, this
+ * suffices the majority of the time.
+ */
+ cpuset = CPU_ALLOC(skel->rodata->nr_cpu_ids);
+ SCX_BUG_ON(!cpuset, "Failed to allocate cpuset");
+ CPU_ZERO(cpuset);
+ CPU_SET(skel->rodata->central_cpu, cpuset);
+ SCX_BUG_ON(sched_setaffinity(0, sizeof(cpuset), cpuset),
+ "Failed to affinitize to central CPU %d (max %d)",
+ skel->rodata->central_cpu, skel->rodata->nr_cpu_ids - 1);
+ CPU_FREE(cpuset);
+
link = SCX_OPS_ATTACH(skel, central_ops);

+ if (!skel->data->timer_pinned)
+ printf("WARNING : BPF_F_TIMER_CPU_PIN not available, timer not pinned to central\n");
+
while (!exit_req && !UEI_EXITED(skel, uei)) {
printf("[SEQ %llu]\n", seq++);
printf("total :%10" PRIu64 " local:%10" PRIu64 " queued:%10" PRIu64 " lost:%10" PRIu64 "\n",
@@ -88,7 +114,8 @@ int main(int argc, char **argv)
skel->bss->nr_locals,
skel->bss->nr_queued,
skel->bss->nr_lost_pids);
- printf(" dispatch:%10" PRIu64 " mismatch:%10" PRIu64 " retry:%10" PRIu64 "\n",
+ printf("timer :%10" PRIu64 " dispatch:%10" PRIu64 " mismatch:%10" PRIu64 " retry:%10" PRIu64 "\n",
+ skel->bss->nr_timers,
skel->bss->nr_dispatches,
skel->bss->nr_mismatches,
skel->bss->nr_retries);
--
2.44.0


2024-05-01 15:20:17

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 27/39] sched_ext: Track tasks that are subjects of the in-flight SCX operation

When some SCX operations are in flight, it is known that the subject task's
rq lock is held throughout which makes it safe to access certain fields of
the task - e.g. its current task_group. We want to add SCX kfunc helpers
that can make use of this guarantee - e.g. to help determining the currently
associated CPU cgroup from the task's current task_group.

As it'd be dangerous call such a helper on a task which isn't rq lock
protected, the helper should be able to verify the input task and reject
accordingly. This patch adds sched_ext_entity.kf_tasks[] that track the
tasks which are currently being operated on by a terminal SCX operation. The
new SCX_CALL_OP_[2]TASK[_RET]() can be used when invoking SCX operations
which take tasks as arguments and the scx_kf_allowed_on_arg_tasks() can be
used by kfunc helpers to verify the input task status.

Note that as sched_ext_entity.kf_tasks[] can't handle nesting, the tracking
is currently only limited to terminal SCX operations. If needed in the
future, this restriction can be removed by moving the tracking to the task
side with a couple per-task counters.

v2: Updated to reflect the addition of SCX_KF_SELECT_CPU.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
---
include/linux/sched/ext.h | 2 +
kernel/sched/ext.c | 91 +++++++++++++++++++++++++++++++--------
2 files changed, 76 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 218bba9dcf34..bfff0c6caa55 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -106,6 +106,7 @@ enum scx_kf_mask {

__SCX_KF_RQ_LOCKED = SCX_KF_DISPATCH |
SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST,
+ __SCX_KF_TERMINAL = SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST,
};

/*
@@ -120,6 +121,7 @@ struct sched_ext_entity {
s32 sticky_cpu;
s32 holding_cpu;
u32 kf_mask; /* see scx_kf_mask above */
+ struct task_struct *kf_tasks[2]; /* see SCX_CALL_OP_TASK() */
atomic_long_t ops_state;

struct list_head runnable_node; /* rq->scx.runnable_list */
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 68b364c1f613..98d977c71a4f 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -749,6 +749,47 @@ do { \
__ret; \
})

+/*
+ * Some kfuncs are allowed only on the tasks that are subjects of the
+ * in-progress scx_ops operation for, e.g., locking guarantees. To enforce such
+ * restrictions, the following SCX_CALL_OP_*() variants should be used when
+ * invoking scx_ops operations that take task arguments. These can only be used
+ * for non-nesting operations due to the way the tasks are tracked.
+ *
+ * kfuncs which can only operate on such tasks can in turn use
+ * scx_kf_allowed_on_arg_tasks() to test whether the invocation is allowed on
+ * the specific task.
+ */
+#define SCX_CALL_OP_TASK(mask, op, task, args...) \
+do { \
+ BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \
+ current->scx.kf_tasks[0] = task; \
+ SCX_CALL_OP(mask, op, task, ##args); \
+ current->scx.kf_tasks[0] = NULL; \
+} while (0)
+
+#define SCX_CALL_OP_TASK_RET(mask, op, task, args...) \
+({ \
+ __typeof__(scx_ops.op(task, ##args)) __ret; \
+ BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \
+ current->scx.kf_tasks[0] = task; \
+ __ret = SCX_CALL_OP_RET(mask, op, task, ##args); \
+ current->scx.kf_tasks[0] = NULL; \
+ __ret; \
+})
+
+#define SCX_CALL_OP_2TASKS_RET(mask, op, task0, task1, args...) \
+({ \
+ __typeof__(scx_ops.op(task0, task1, ##args)) __ret; \
+ BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \
+ current->scx.kf_tasks[0] = task0; \
+ current->scx.kf_tasks[1] = task1; \
+ __ret = SCX_CALL_OP_RET(mask, op, task0, task1, ##args); \
+ current->scx.kf_tasks[0] = NULL; \
+ current->scx.kf_tasks[1] = NULL; \
+ __ret; \
+})
+
/* @mask is constant, always inline to cull unnecessary branches */
static __always_inline bool scx_kf_allowed(u32 mask)
{
@@ -778,6 +819,22 @@ static __always_inline bool scx_kf_allowed(u32 mask)
return true;
}

+/* see SCX_CALL_OP_TASK() */
+static __always_inline bool scx_kf_allowed_on_arg_tasks(u32 mask,
+ struct task_struct *p)
+{
+ if (!scx_kf_allowed(mask))
+ return false;
+
+ if (unlikely((p != current->scx.kf_tasks[0] &&
+ p != current->scx.kf_tasks[1]))) {
+ scx_ops_error("called on a task not being operated on");
+ return false;
+ }
+
+ return true;
+}
+

/*
* SCX task iterator.
@@ -1271,7 +1328,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
WARN_ON_ONCE(*ddsp_taskp);
*ddsp_taskp = p;

- SCX_CALL_OP(SCX_KF_ENQUEUE, enqueue, p, enq_flags);
+ SCX_CALL_OP_TASK(SCX_KF_ENQUEUE, enqueue, p, enq_flags);

*ddsp_taskp = NULL;
if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
@@ -1356,7 +1413,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
add_nr_running(rq, 1);

if (SCX_HAS_OP(runnable))
- SCX_CALL_OP(SCX_KF_REST, runnable, p, enq_flags);
+ SCX_CALL_OP_TASK(SCX_KF_REST, runnable, p, enq_flags);

do_enqueue_task(rq, p, enq_flags, sticky_cpu);
}
@@ -1382,7 +1439,7 @@ static void ops_dequeue(struct task_struct *p, u64 deq_flags)
BUG();
case SCX_OPSS_QUEUED:
if (SCX_HAS_OP(dequeue))
- SCX_CALL_OP(SCX_KF_REST, dequeue, p, deq_flags);
+ SCX_CALL_OP_TASK(SCX_KF_REST, dequeue, p, deq_flags);

if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
SCX_OPSS_NONE))
@@ -1433,11 +1490,11 @@ static void dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags
*/
if (SCX_HAS_OP(stopping) && task_current(rq, p)) {
update_curr_scx(rq);
- SCX_CALL_OP(SCX_KF_REST, stopping, p, false);
+ SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, false);
}

if (SCX_HAS_OP(quiescent))
- SCX_CALL_OP(SCX_KF_REST, quiescent, p, deq_flags);
+ SCX_CALL_OP_TASK(SCX_KF_REST, quiescent, p, deq_flags);

if (deq_flags & SCX_DEQ_SLEEP)
p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP;
@@ -1456,7 +1513,7 @@ static void yield_task_scx(struct rq *rq)
struct task_struct *p = rq->curr;

if (SCX_HAS_OP(yield))
- SCX_CALL_OP_RET(SCX_KF_REST, yield, p, NULL);
+ SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, yield, p, NULL);
else
p->scx.slice = 0;
}
@@ -1466,7 +1523,7 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to)
struct task_struct *from = rq->curr;

if (SCX_HAS_OP(yield))
- return SCX_CALL_OP_RET(SCX_KF_REST, yield, from, to);
+ return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, yield, from, to);
else
return false;
}
@@ -2024,7 +2081,7 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)

/* see dequeue_task_scx() on why we skip when !QUEUED */
if (SCX_HAS_OP(running) && (p->scx.flags & SCX_TASK_QUEUED))
- SCX_CALL_OP(SCX_KF_REST, running, p);
+ SCX_CALL_OP_TASK(SCX_KF_REST, running, p);

clr_task_runnable(p, true);

@@ -2088,7 +2145,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p)

/* see dequeue_task_scx() on why we skip when !QUEUED */
if (SCX_HAS_OP(stopping) && (p->scx.flags & SCX_TASK_QUEUED))
- SCX_CALL_OP(SCX_KF_REST, stopping, p, true);
+ SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, true);

/*
* If we're being called from put_prev_task_balance(), balance_scx() may
@@ -2310,8 +2367,8 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
WARN_ON_ONCE(*ddsp_taskp);
*ddsp_taskp = p;

- cpu = SCX_CALL_OP_RET(SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU,
- select_cpu, p, prev_cpu, wake_flags);
+ cpu = SCX_CALL_OP_TASK_RET(SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU,
+ select_cpu, p, prev_cpu, wake_flags);
*ddsp_taskp = NULL;
if (ops_cpu_valid(cpu, "from ops.select_cpu()"))
return cpu;
@@ -2344,8 +2401,8 @@ static void set_cpus_allowed_scx(struct task_struct *p,
* designation pointless. Cast it away when calling the operation.
*/
if (SCX_HAS_OP(set_cpumask))
- SCX_CALL_OP(SCX_KF_REST, set_cpumask, p,
- (struct cpumask *)p->cpus_ptr);
+ SCX_CALL_OP_TASK(SCX_KF_REST, set_cpumask, p,
+ (struct cpumask *)p->cpus_ptr);
}

static void reset_idle_masks(void)
@@ -2580,7 +2637,7 @@ static void scx_ops_enable_task(struct task_struct *p)
*/
set_task_scx_weight(p);
if (SCX_HAS_OP(enable))
- SCX_CALL_OP(SCX_KF_REST, enable, p);
+ SCX_CALL_OP_TASK(SCX_KF_REST, enable, p);
scx_set_task_state(p, SCX_TASK_ENABLED);

if (SCX_HAS_OP(set_weight))
@@ -2734,7 +2791,7 @@ static void reweight_task_scx(struct rq *rq, struct task_struct *p, int newprio)

set_task_scx_weight(p);
if (SCX_HAS_OP(set_weight))
- SCX_CALL_OP(SCX_KF_REST, set_weight, p, p->scx.weight);
+ SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, p, p->scx.weight);
}

static void prio_changed_scx(struct rq *rq, struct task_struct *p, int oldprio)
@@ -2750,8 +2807,8 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p)
* different scheduler class. Keep the BPF scheduler up-to-date.
*/
if (SCX_HAS_OP(set_cpumask))
- SCX_CALL_OP(SCX_KF_REST, set_cpumask, p,
- (struct cpumask *)p->cpus_ptr);
+ SCX_CALL_OP_TASK(SCX_KF_REST, set_cpumask, p,
+ (struct cpumask *)p->cpus_ptr);
}

static void switched_from_scx(struct rq *rq, struct task_struct *p)
--
2.44.0


2024-05-01 15:20:25

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 28/39] sched_ext: Add cgroup support

Add sched_ext_ops operations to init/exit cgroups, and track task migrations
and config changes. Because different BPF schedulers may implement different
subsets of CPU control features, allow BPF schedulers to pick which cgroup
interface files to enable using SCX_OPS_CGROUP_KNOB_* flags. For now, only
the weight knobs are supported but adding more should be straightforward.

While a BPF scheduler is being enabled and disabled, relevant cgroup
operations are locked out using scx_cgroup_rwsem. This avoids situations
like task prep taking place while the task is being moved across cgroups,
making things easier for BPF schedulers.

v5: - Flipped the locking order between scx_cgroup_rwsem and
cpus_read_lock() to avoid locking order conflict w/ cpuset. Better
documentation around locking.

- sched_move_task() takes an early exit if the source and destination
are identical. This triggered the warning in scx_cgroup_can_attach()
as it left p->scx.cgrp_moving_from uncleared. Updated the cgroup
migration path so that ops.cgroup_prep_move() is skipped for identity
migrations so that its invocations always match ops.cgroup_move()
one-to-one.

v4: - Example schedulers moved into their own patches.

- Fix build failure when !CONFIG_CGROUP_SCHED, reported by Andrea Righi.

v3: - Make scx_example_pair switch all tasks by default.

- Convert to BPF inline iterators.

- scx_bpf_task_cgroup() is added to determine the current cgroup from
CPU controller's POV. This allows BPF schedulers to accurately track
CPU cgroup membership.

- scx_example_flatcg added. This demonstrates flattened hierarchy
implementation of CPU cgroup control and shows significant performance
improvement when cgroups which are nested multiple levels are under
competition.

v2: - Build fixes for different CONFIG combinations.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
Reported-by: kernel test robot <[email protected]>
Cc: Andrea Righi <[email protected]>
---
include/linux/sched/ext.h | 3 +
init/Kconfig | 5 +
kernel/sched/core.c | 70 ++-
kernel/sched/ext.c | 524 ++++++++++++++++++++++-
kernel/sched/ext.h | 20 +
kernel/sched/sched.h | 12 +-
tools/sched_ext/include/scx/common.bpf.h | 1 +
7 files changed, 614 insertions(+), 21 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index bfff0c6caa55..0a9f8e5a46af 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -157,6 +157,9 @@ struct sched_ext_entity {
bool disallow; /* reject switching into SCX */

/* cold fields */
+#ifdef CONFIG_EXT_GROUP_SCHED
+ struct cgroup *cgrp_moving_from;
+#endif
/* must be the last field, see init_scx_entity() */
struct list_head tasks_node;
};
diff --git a/init/Kconfig b/init/Kconfig
index aa02aec6aa7d..b8fde3e53a77 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1033,6 +1033,11 @@ config RT_GROUP_SCHED
realtime bandwidth for them.
See Documentation/scheduler/sched-rt-group.rst for more information.

+config EXT_GROUP_SCHED
+ bool
+ depends on SCHED_CLASS_EXT && CGROUP_SCHED
+ default y
+
endif #CGROUP_SCHED

config SCHED_MM_CID
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 667527603bea..de49b94844a9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10043,6 +10043,9 @@ void __init sched_init(void)
root_task_group.shares = ROOT_TASK_GROUP_LOAD;
init_cfs_bandwidth(&root_task_group.cfs_bandwidth, NULL);
#endif /* CONFIG_FAIR_GROUP_SCHED */
+#ifdef CONFIG_EXT_GROUP_SCHED
+ root_task_group.scx_weight = CGROUP_WEIGHT_DFL;
+#endif /* CONFIG_EXT_GROUP_SCHED */
#ifdef CONFIG_RT_GROUP_SCHED
root_task_group.rt_se = (struct sched_rt_entity **)ptr;
ptr += nr_cpu_ids * sizeof(void **);
@@ -10474,6 +10477,7 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;

+ scx_group_set_weight(tg, CGROUP_WEIGHT_DFL);
alloc_uclamp_sched_group(tg, parent);

return tg;
@@ -10601,6 +10605,7 @@ void sched_move_task(struct task_struct *tsk)
put_prev_task(rq, tsk);

sched_change_group(tsk, group);
+ scx_move_task(tsk);

if (queued)
enqueue_task(rq, tsk, queue_flags);
@@ -10638,6 +10643,11 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
{
struct task_group *tg = css_tg(css);
struct task_group *parent = css_tg(css->parent);
+ int ret;
+
+ ret = scx_tg_online(tg);
+ if (ret)
+ return ret;

if (parent)
sched_online_group(tg, parent);
@@ -10652,6 +10662,13 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
return 0;
}

+static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
+{
+ struct task_group *tg = css_tg(css);
+
+ scx_tg_offline(tg);
+}
+
static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
{
struct task_group *tg = css_tg(css);
@@ -10669,9 +10686,10 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
sched_unregister_group(tg);
}

-#ifdef CONFIG_RT_GROUP_SCHED
+#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_EXT_GROUP_SCHED)
static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
{
+#ifdef CONFIG_RT_GROUP_SCHED
struct task_struct *task;
struct cgroup_subsys_state *css;

@@ -10679,7 +10697,8 @@ static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
if (!sched_rt_can_attach(css_tg(css), task))
return -EINVAL;
}
- return 0;
+#endif
+ return scx_cgroup_can_attach(tset);
}
#endif

@@ -10690,8 +10709,17 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)

cgroup_taskset_for_each(task, css, tset)
sched_move_task(task);
+
+ scx_cgroup_finish_attach();
}

+#ifdef CONFIG_EXT_GROUP_SCHED
+static void cpu_cgroup_cancel_attach(struct cgroup_taskset *tset)
+{
+ scx_cgroup_cancel_attach(tset);
+}
+#endif
+
#ifdef CONFIG_UCLAMP_TASK_GROUP
static void cpu_util_update_eff(struct cgroup_subsys_state *css)
{
@@ -10870,9 +10898,15 @@ static int cpu_uclamp_max_show(struct seq_file *sf, void *v)
static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 shareval)
{
+ int ret;
+
if (shareval > scale_load_down(ULONG_MAX))
shareval = MAX_SHARES;
- return sched_group_set_shares(css_tg(css), scale_load(shareval));
+ ret = sched_group_set_shares(css_tg(css), scale_load(shareval));
+ if (!ret)
+ scx_group_set_weight(css_tg(css),
+ sched_weight_to_cgroup(shareval));
+ return ret;
}

static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
@@ -11380,11 +11414,15 @@ static int cpu_local_stat_show(struct seq_file *sf,
return 0;
}

-#ifdef CONFIG_FAIR_GROUP_SCHED
+#if defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_EXT_GROUP_SCHED)

static unsigned long tg_weight(struct task_group *tg)
{
+#ifdef CONFIG_FAIR_GROUP_SCHED
return scale_load_down(tg->shares);
+#else
+ return sched_weight_from_cgroup(tg->scx_weight);
+#endif
}

static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css,
@@ -11397,13 +11435,17 @@ static int cpu_weight_write_u64(struct cgroup_subsys_state *css,
struct cftype *cft, u64 cgrp_weight)
{
unsigned long weight;
+ int ret;

if (cgrp_weight < CGROUP_WEIGHT_MIN || cgrp_weight > CGROUP_WEIGHT_MAX)
return -ERANGE;

weight = sched_weight_from_cgroup(cgrp_weight);

- return sched_group_set_shares(css_tg(css), scale_load(weight));
+ ret = sched_group_set_shares(css_tg(css), scale_load(weight));
+ if (!ret)
+ scx_group_set_weight(css_tg(css), cgrp_weight);
+ return ret;
}

static s64 cpu_weight_nice_read_s64(struct cgroup_subsys_state *css,
@@ -11428,7 +11470,7 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
struct cftype *cft, s64 nice)
{
unsigned long weight;
- int idx;
+ int idx, ret;

if (nice < MIN_NICE || nice > MAX_NICE)
return -ERANGE;
@@ -11437,7 +11479,11 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
idx = array_index_nospec(idx, 40);
weight = sched_prio_to_weight[idx];

- return sched_group_set_shares(css_tg(css), scale_load(weight));
+ ret = sched_group_set_shares(css_tg(css), scale_load(weight));
+ if (!ret)
+ scx_group_set_weight(css_tg(css),
+ sched_weight_to_cgroup(weight));
+ return ret;
}
#endif

@@ -11499,7 +11545,7 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
#endif

struct cftype cpu_cftypes[CPU_CFTYPE_CNT + 1] = {
-#ifdef CONFIG_FAIR_GROUP_SCHED
+#if defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_EXT_GROUP_SCHED)
[CPU_CFTYPE_WEIGHT] = {
.name = "weight",
.flags = CFTYPE_NOT_ON_ROOT,
@@ -11512,6 +11558,8 @@ struct cftype cpu_cftypes[CPU_CFTYPE_CNT + 1] = {
.read_s64 = cpu_weight_nice_read_s64,
.write_s64 = cpu_weight_nice_write_s64,
},
+#endif
+#ifdef CONFIG_FAIR_GROUP_SCHED
[CPU_CFTYPE_IDLE] = {
.name = "idle",
.flags = CFTYPE_NOT_ON_ROOT,
@@ -11553,14 +11601,18 @@ struct cftype cpu_cftypes[CPU_CFTYPE_CNT + 1] = {
struct cgroup_subsys cpu_cgrp_subsys = {
.css_alloc = cpu_cgroup_css_alloc,
.css_online = cpu_cgroup_css_online,
+ .css_offline = cpu_cgroup_css_offline,
.css_released = cpu_cgroup_css_released,
.css_free = cpu_cgroup_css_free,
.css_extra_stat_show = cpu_extra_stat_show,
.css_local_stat_show = cpu_local_stat_show,
-#ifdef CONFIG_RT_GROUP_SCHED
+#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_EXT_GROUP_SCHED)
.can_attach = cpu_cgroup_can_attach,
#endif
.attach = cpu_cgroup_attach,
+#ifdef CONFIG_EXT_GROUP_SCHED
+ .cancel_attach = cpu_cgroup_cancel_attach,
+#endif
.legacy_cftypes = cpu_legacy_cftypes,
.dfl_cftypes = cpu_cftypes,
.early_init = true,
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 98d977c71a4f..80c313a56958 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -89,10 +89,16 @@ enum scx_ops_flags {
*/
SCX_OPS_SWITCH_PARTIAL = 1LLU << 3,

+ /*
+ * CPU cgroup knob enable flags
+ */
+ SCX_OPS_CGROUP_KNOB_WEIGHT = 1LLU << 16, /* cpu.weight */
+
SCX_OPS_ALL_FLAGS = SCX_OPS_KEEP_BUILTIN_IDLE |
SCX_OPS_ENQ_LAST |
SCX_OPS_ENQ_EXITING |
- SCX_OPS_SWITCH_PARTIAL,
+ SCX_OPS_SWITCH_PARTIAL |
+ SCX_OPS_CGROUP_KNOB_WEIGHT,
};

/* argument container for ops.init_task() */
@@ -102,6 +108,10 @@ struct scx_init_task_args {
* to the scheduler transition path.
*/
bool fork;
+#ifdef CONFIG_EXT_GROUP_SCHED
+ /* the cgroup the task is joining */
+ struct cgroup *cgroup;
+#endif
};

/* argument container for ops.exit_task() */
@@ -110,6 +120,12 @@ struct scx_exit_task_args {
bool cancelled;
};

+/* argument container for ops->cgroup_init() */
+struct scx_cgroup_init_args {
+ /* the weight of the cgroup [1..10000] */
+ u32 weight;
+};
+
/**
* struct sched_ext_ops - Operation table for BPF scheduler implementation
*
@@ -366,6 +382,79 @@ struct sched_ext_ops {
*/
void (*disable)(struct task_struct *p);

+#ifdef CONFIG_EXT_GROUP_SCHED
+ /**
+ * cgroup_init - Initialize a cgroup
+ * @cgrp: cgroup being initialized
+ * @args: init arguments, see the struct definition
+ *
+ * Either the BPF scheduler is being loaded or @cgrp created, initialize
+ * @cgrp for sched_ext. This operation may block.
+ *
+ * Return 0 for success, -errno for failure. An error return while
+ * loading will abort loading of the BPF scheduler. During cgroup
+ * creation, it will abort the specific cgroup creation.
+ */
+ s32 (*cgroup_init)(struct cgroup *cgrp,
+ struct scx_cgroup_init_args *args);
+
+ /**
+ * cgroup_exit - Exit a cgroup
+ * @cgrp: cgroup being exited
+ *
+ * Either the BPF scheduler is being unloaded or @cgrp destroyed, exit
+ * @cgrp for sched_ext. This operation my block.
+ */
+ void (*cgroup_exit)(struct cgroup *cgrp);
+
+ /**
+ * cgroup_prep_move - Prepare a task to be moved to a different cgroup
+ * @p: task being moved
+ * @from: cgroup @p is being moved from
+ * @to: cgroup @p is being moved to
+ *
+ * Prepare @p for move from cgroup @from to @to. This operation may
+ * block and can be used for allocations.
+ *
+ * Return 0 for success, -errno for failure. An error return aborts the
+ * migration.
+ */
+ s32 (*cgroup_prep_move)(struct task_struct *p,
+ struct cgroup *from, struct cgroup *to);
+
+ /**
+ * cgroup_move - Commit cgroup move
+ * @p: task being moved
+ * @from: cgroup @p is being moved from
+ * @to: cgroup @p is being moved to
+ *
+ * Commit the move. @p is dequeued during this operation.
+ */
+ void (*cgroup_move)(struct task_struct *p,
+ struct cgroup *from, struct cgroup *to);
+
+ /**
+ * cgroup_cancel_move - Cancel cgroup move
+ * @p: task whose cgroup move is being canceled
+ * @from: cgroup @p was being moved from
+ * @to: cgroup @p was being moved to
+ *
+ * @p was cgroup_prep_move()'d but failed before reaching cgroup_move().
+ * Undo the preparation.
+ */
+ void (*cgroup_cancel_move)(struct task_struct *p,
+ struct cgroup *from, struct cgroup *to);
+
+ /**
+ * cgroup_set_weight - A cgroup's weight is being changed
+ * @cgrp: cgroup whose weight is being updated
+ * @weight: new weight [1..10000]
+ *
+ * Update @tg's weight to @weight.
+ */
+ void (*cgroup_set_weight)(struct cgroup *cgrp, u32 weight);
+#endif /* CONFIG_CGROUPS */
+
/*
* All online ops must come before ops.init().
*/
@@ -492,6 +581,11 @@ enum scx_kick_flags {
SCX_KICK_PREEMPT = 1LLU << 1,
};

+enum scx_tg_flags {
+ SCX_TG_ONLINE = 1U << 0,
+ SCX_TG_INITED = 1U << 1,
+};
+
enum scx_ops_enable_state {
SCX_OPS_PREPPING,
SCX_OPS_ENABLING,
@@ -2539,6 +2633,28 @@ static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued)
resched_curr(rq);
}

+#ifdef CONFIG_EXT_GROUP_SCHED
+static struct cgroup *tg_cgrp(struct task_group *tg)
+{
+ /*
+ * If CGROUP_SCHED is disabled, @tg is NULL. If @tg is an autogroup,
+ * @tg->css.cgroup is NULL. In both cases, @tg can be treated as the
+ * root cgroup.
+ */
+ if (tg && tg->css.cgroup)
+ return tg->css.cgroup;
+ else
+ return &cgrp_dfl_root.cgrp;
+}
+
+#define SCX_INIT_TASK_ARGS_CGROUP(tg) .cgroup = tg_cgrp(tg),
+
+#else /* CONFIG_EXT_GROUP_SCHED */
+
+#define SCX_INIT_TASK_ARGS_CGROUP(tg)
+
+#endif /* CONFIG_EXT_GROUP_SCHED */
+
static enum scx_task_state scx_get_task_state(const struct task_struct *p)
{
return (p->scx.flags & SCX_TASK_STATE_MASK) >> SCX_TASK_STATE_SHIFT;
@@ -2583,6 +2699,7 @@ static int scx_ops_init_task(struct task_struct *p, struct task_group *tg, bool

if (SCX_HAS_OP(init_task)) {
struct scx_init_task_args args = {
+ SCX_INIT_TASK_ARGS_CGROUP(tg)
.fork = fork,
};

@@ -2641,7 +2758,7 @@ static void scx_ops_enable_task(struct task_struct *p)
scx_set_task_state(p, SCX_TASK_ENABLED);

if (SCX_HAS_OP(set_weight))
- SCX_CALL_OP(SCX_KF_REST, set_weight, p, p->scx.weight);
+ SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, p, p->scx.weight);
}

static void scx_ops_disable_task(struct task_struct *p)
@@ -2851,6 +2968,180 @@ bool scx_can_stop_tick(struct rq *rq)
}
#endif

+#ifdef CONFIG_EXT_GROUP_SCHED
+
+DEFINE_STATIC_PERCPU_RWSEM(scx_cgroup_rwsem);
+
+int scx_tg_online(struct task_group *tg)
+{
+ int ret = 0;
+
+ WARN_ON_ONCE(tg->scx_flags & (SCX_TG_ONLINE | SCX_TG_INITED));
+
+ percpu_down_read(&scx_cgroup_rwsem);
+
+ if (SCX_HAS_OP(cgroup_init)) {
+ struct scx_cgroup_init_args args = { .weight = tg->scx_weight };
+
+ ret = SCX_CALL_OP_RET(SCX_KF_SLEEPABLE, cgroup_init,
+ tg->css.cgroup, &args);
+ if (!ret)
+ tg->scx_flags |= SCX_TG_ONLINE | SCX_TG_INITED;
+ else
+ ret = ops_sanitize_err("cgroup_init", ret);
+ } else {
+ tg->scx_flags |= SCX_TG_ONLINE;
+ }
+
+ percpu_up_read(&scx_cgroup_rwsem);
+ return ret;
+}
+
+void scx_tg_offline(struct task_group *tg)
+{
+ WARN_ON_ONCE(!(tg->scx_flags & SCX_TG_ONLINE));
+
+ percpu_down_read(&scx_cgroup_rwsem);
+
+ if (SCX_HAS_OP(cgroup_exit) && (tg->scx_flags & SCX_TG_INITED))
+ SCX_CALL_OP(SCX_KF_SLEEPABLE, cgroup_exit, tg->css.cgroup);
+ tg->scx_flags &= ~(SCX_TG_ONLINE | SCX_TG_INITED);
+
+ percpu_up_read(&scx_cgroup_rwsem);
+}
+
+int scx_cgroup_can_attach(struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *css;
+ struct task_struct *p;
+ int ret;
+
+ /* released in scx_finish/cancel_attach() */
+ percpu_down_read(&scx_cgroup_rwsem);
+
+ if (!scx_enabled())
+ return 0;
+
+ cgroup_taskset_for_each(p, css, tset) {
+ struct cgroup *from = tg_cgrp(task_group(p));
+ struct cgroup *to = tg_cgrp(css_tg(css));
+
+ WARN_ON_ONCE(p->scx.cgrp_moving_from);
+
+ /*
+ * sched_move_task() omits identity migrations. Let's match the
+ * behavior so that ops.cgroup_prep_move() and ops.cgroup_move()
+ * always match one-to-one.
+ */
+ if (from == to)
+ continue;
+
+ if (SCX_HAS_OP(cgroup_prep_move)) {
+ ret = SCX_CALL_OP_RET(SCX_KF_SLEEPABLE, cgroup_prep_move,
+ p, from, css->cgroup);
+ if (ret)
+ goto err;
+ }
+
+ p->scx.cgrp_moving_from = from;
+ }
+
+ return 0;
+
+err:
+ cgroup_taskset_for_each(p, css, tset) {
+ if (SCX_HAS_OP(cgroup_cancel_move) && p->scx.cgrp_moving_from)
+ SCX_CALL_OP(SCX_KF_SLEEPABLE, cgroup_cancel_move, p,
+ p->scx.cgrp_moving_from, css->cgroup);
+ p->scx.cgrp_moving_from = NULL;
+ }
+
+ percpu_up_read(&scx_cgroup_rwsem);
+ return ops_sanitize_err("cgroup_prep_move", ret);
+}
+
+void scx_move_task(struct task_struct *p)
+{
+ /*
+ * We're called from sched_move_task() which handles both cgroup and
+ * autogroup moves. Ignore the latter.
+ *
+ * Also ignore exiting tasks, because in the exit path tasks transition
+ * from the autogroup to the root group, so task_group_is_autogroup()
+ * alone isn't able to catch exiting autogroup tasks. This is safe for
+ * cgroup_move(), because cgroup migrations never happen for PF_EXITING
+ * tasks.
+ */
+ if (p->flags & PF_EXITING || task_group_is_autogroup(task_group(p)))
+ return;
+
+ if (!scx_enabled())
+ return;
+
+ /*
+ * @p must have ops.cgroup_prep_move() called on it and thus
+ * cgrp_moving_from set.
+ */
+ if (SCX_HAS_OP(cgroup_move) && !WARN_ON_ONCE(!p->scx.cgrp_moving_from))
+ SCX_CALL_OP_TASK(SCX_KF_UNLOCKED, cgroup_move, p,
+ p->scx.cgrp_moving_from, tg_cgrp(task_group(p)));
+ p->scx.cgrp_moving_from = NULL;
+}
+
+void scx_cgroup_finish_attach(void)
+{
+ percpu_up_read(&scx_cgroup_rwsem);
+}
+
+void scx_cgroup_cancel_attach(struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *css;
+ struct task_struct *p;
+
+ if (!scx_enabled())
+ goto out_unlock;
+
+ cgroup_taskset_for_each(p, css, tset) {
+ if (SCX_HAS_OP(cgroup_cancel_move) && p->scx.cgrp_moving_from)
+ SCX_CALL_OP(SCX_KF_SLEEPABLE, cgroup_cancel_move, p,
+ p->scx.cgrp_moving_from, css->cgroup);
+ p->scx.cgrp_moving_from = NULL;
+ }
+out_unlock:
+ percpu_up_read(&scx_cgroup_rwsem);
+}
+
+void scx_group_set_weight(struct task_group *tg, unsigned long weight)
+{
+ percpu_down_read(&scx_cgroup_rwsem);
+
+ if (tg->scx_weight != weight) {
+ if (SCX_HAS_OP(cgroup_set_weight))
+ SCX_CALL_OP(SCX_KF_SLEEPABLE, cgroup_set_weight,
+ tg_cgrp(tg), weight);
+ tg->scx_weight = weight;
+ }
+
+ percpu_up_read(&scx_cgroup_rwsem);
+}
+
+static void scx_cgroup_lock(void)
+{
+ percpu_down_write(&scx_cgroup_rwsem);
+}
+
+static void scx_cgroup_unlock(void)
+{
+ percpu_up_write(&scx_cgroup_rwsem);
+}
+
+#else /* CONFIG_EXT_GROUP_SCHED */
+
+static inline void scx_cgroup_lock(void) {}
+static inline void scx_cgroup_unlock(void) {}
+
+#endif /* CONFIG_EXT_GROUP_SCHED */
+
/*
* Omitted operations:
*
@@ -2980,6 +3271,131 @@ static void destroy_dsq(u64 dsq_id)
rcu_read_unlock();
}

+#ifdef CONFIG_EXT_GROUP_SCHED
+static void scx_cgroup_exit(void)
+{
+ struct cgroup_subsys_state *css;
+
+ percpu_rwsem_assert_held(&scx_cgroup_rwsem);
+
+ /*
+ * scx_tg_on/offline() are excluded through scx_cgroup_rwsem. If we walk
+ * cgroups and exit all the inited ones, all online cgroups are exited.
+ */
+ rcu_read_lock();
+ css_for_each_descendant_post(css, &root_task_group.css) {
+ struct task_group *tg = css_tg(css);
+
+ if (!(tg->scx_flags & SCX_TG_INITED))
+ continue;
+ tg->scx_flags &= ~SCX_TG_INITED;
+
+ if (!scx_ops.cgroup_exit)
+ continue;
+
+ if (WARN_ON_ONCE(!css_tryget(css)))
+ continue;
+ rcu_read_unlock();
+
+ SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_exit, css->cgroup);
+
+ rcu_read_lock();
+ css_put(css);
+ }
+ rcu_read_unlock();
+}
+
+static int scx_cgroup_init(void)
+{
+ struct cgroup_subsys_state *css;
+ int ret;
+
+ percpu_rwsem_assert_held(&scx_cgroup_rwsem);
+
+ /*
+ * scx_tg_on/offline() are excluded thorugh scx_cgroup_rwsem. If we walk
+ * cgroups and init, all online cgroups are initialized.
+ */
+ rcu_read_lock();
+ css_for_each_descendant_pre(css, &root_task_group.css) {
+ struct task_group *tg = css_tg(css);
+ struct scx_cgroup_init_args args = { .weight = tg->scx_weight };
+
+ if ((tg->scx_flags &
+ (SCX_TG_ONLINE | SCX_TG_INITED)) != SCX_TG_ONLINE)
+ continue;
+
+ if (!scx_ops.cgroup_init) {
+ tg->scx_flags |= SCX_TG_INITED;
+ continue;
+ }
+
+ if (WARN_ON_ONCE(!css_tryget(css)))
+ continue;
+ rcu_read_unlock();
+
+ ret = SCX_CALL_OP_RET(SCX_KF_SLEEPABLE, cgroup_init,
+ css->cgroup, &args);
+ if (ret) {
+ css_put(css);
+ return ret;
+ }
+ tg->scx_flags |= SCX_TG_INITED;
+
+ rcu_read_lock();
+ css_put(css);
+ }
+ rcu_read_unlock();
+
+ return 0;
+}
+
+static void scx_cgroup_config_knobs(void)
+{
+ static DEFINE_MUTEX(cgintf_mutex);
+ DECLARE_BITMAP(mask, CPU_CFTYPE_CNT) = { };
+ u64 knob_flags;
+ int i;
+
+ /*
+ * Called from both class switch and ops enable/disable paths,
+ * synchronize internally.
+ */
+ mutex_lock(&cgintf_mutex);
+
+ /* if fair is in use, all knobs should be shown */
+ if (!scx_switched_all()) {
+ bitmap_fill(mask, CPU_CFTYPE_CNT);
+ goto apply;
+ }
+
+ /*
+ * On ext, only show the supported knobs. Otherwise, show all possible
+ * knobs so that configuration attempts succeed and the states are
+ * remembered while ops is not loaded.
+ */
+ if (scx_enabled())
+ knob_flags = scx_ops.flags;
+ else
+ knob_flags = SCX_OPS_ALL_FLAGS;
+
+ if (knob_flags & SCX_OPS_CGROUP_KNOB_WEIGHT) {
+ __set_bit(CPU_CFTYPE_WEIGHT, mask);
+ __set_bit(CPU_CFTYPE_WEIGHT_NICE, mask);
+ }
+apply:
+ for (i = 0; i < CPU_CFTYPE_CNT; i++)
+ cgroup_show_cftype(&cpu_cftypes[i], test_bit(i, mask));
+
+ mutex_unlock(&cgintf_mutex);
+}
+
+#else
+static void scx_cgroup_exit(void) {}
+static int scx_cgroup_init(void) { return 0; }
+static void scx_cgroup_config_knobs(void) {}
+#endif
+

/********************************************************************************
* Sysfs interface and ops enable/disable.
@@ -3260,11 +3676,12 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
WRITE_ONCE(scx_switching_all, false);

/*
- * Avoid racing against fork. See scx_ops_enable() for explanation on
- * the locking order.
+ * Avoid racing against fork and cgroup changes. See scx_ops_enable()
+ * for explanation on the locking order.
*/
percpu_down_write(&scx_fork_rwsem);
cpus_read_lock();
+ scx_cgroup_lock();

spin_lock_irq(&scx_tasks_lock);
scx_task_iter_init(&sti);
@@ -3295,6 +3712,9 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
static_branch_disable_cpuslocked(&scx_builtin_idle_enabled);
synchronize_rcu();

+ scx_cgroup_exit();
+
+ scx_cgroup_unlock();
cpus_read_unlock();
percpu_up_write(&scx_fork_rwsem);

@@ -3348,6 +3768,8 @@ static void scx_ops_disable_workfn(struct kthread_work *work)

WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_DISABLED) !=
SCX_OPS_DISABLING);
+
+ scx_cgroup_config_knobs();
done:
scx_ops_bypass(false);
}
@@ -3636,11 +4058,17 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
scx_watchdog_timeout / 2);

/*
- * Lock out forks before opening the floodgate so that they don't wander
- * into the operations prematurely.
+ * Lock out forks, cgroup on/offlining and moves before opening the
+ * floodgate so that they don't wander into the operations prematurely.
*
- * We don't need to keep the CPUs stable but grab cpus_read_lock() to
- * ease future locking changes for cgroup suport.
+ * We don't need to keep the CPUs stable but static_branch_*() requires
+ * cpus_read_lock() and scx_cgroup_rwsem must nest inside
+ * cpu_hotplug_lock because of the following dependency chain:
+ *
+ * cpu_hotplug_lock --> cgroup_threadgroup_rwsem --> scx_cgroup_rwsem
+ *
+ * So, we need to do cpus_read_lock() before scx_cgroup_lock() and use
+ * static_branch_*_cpuslocked().
*
* Note that cpu_hotplug_lock must nest inside scx_fork_rwsem due to the
* following dependency chain:
@@ -3649,6 +4077,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
*/
percpu_down_write(&scx_fork_rwsem);
cpus_read_lock();
+ scx_cgroup_lock();

for (i = SCX_OPI_NORMAL_BEGIN; i < SCX_OPI_NORMAL_END; i++)
if (((void (**)(void))ops)[i])
@@ -3667,6 +4096,14 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
static_branch_disable_cpuslocked(&scx_builtin_idle_enabled);
}

+ /*
+ * All cgroups should be initialized before letting in tasks. cgroup
+ * on/offlining and task migrations are already locked out.
+ */
+ ret = scx_cgroup_init();
+ if (ret)
+ goto err_disable_unlock_all;
+
static_branch_enable_cpuslocked(&__scx_ops_enabled);

/*
@@ -3750,6 +4187,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops)

spin_unlock_irq(&scx_tasks_lock);
preempt_enable();
+ scx_cgroup_unlock();
cpus_read_unlock();
percpu_up_write(&scx_fork_rwsem);

@@ -3766,6 +4204,8 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
kobject_uevent(scx_root_kobj, KOBJ_ADD);
mutex_unlock(&scx_ops_enable_mutex);

+ scx_cgroup_config_knobs();
+
return 0;

err_del:
@@ -3782,6 +4222,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
return ret;

err_disable_unlock_all:
+ scx_cgroup_unlock();
percpu_up_write(&scx_fork_rwsem);
err_disable_unlock_cpus:
cpus_read_unlock();
@@ -3973,6 +4414,11 @@ static int bpf_scx_check_member(const struct btf_type *t,

switch (moff) {
case offsetof(struct sched_ext_ops, init_task):
+#ifdef CONFIG_EXT_GROUP_SCHED
+ case offsetof(struct sched_ext_ops, cgroup_init):
+ case offsetof(struct sched_ext_ops, cgroup_exit):
+ case offsetof(struct sched_ext_ops, cgroup_prep_move):
+#endif
case offsetof(struct sched_ext_ops, init):
case offsetof(struct sched_ext_ops, exit):
break;
@@ -4041,6 +4487,14 @@ static s32 init_task_stub(struct task_struct *p, struct scx_init_task_args *args
static void exit_task_stub(struct task_struct *p, struct scx_exit_task_args *args) {}
static void enable_stub(struct task_struct *p) {}
static void disable_stub(struct task_struct *p) {}
+#ifdef CONFIG_EXT_GROUP_SCHED
+static s32 cgroup_init_stub(struct cgroup *cgrp, struct scx_cgroup_init_args *args) { return -EINVAL; }
+static void cgroup_exit_stub(struct cgroup *cgrp) {}
+static s32 cgroup_prep_move_stub(struct task_struct *p, struct cgroup *from, struct cgroup *to) { return -EINVAL; }
+static void cgroup_move_stub(struct task_struct *p, struct cgroup *from, struct cgroup *to) {}
+static void cgroup_cancel_move_stub(struct task_struct *p, struct cgroup *from, struct cgroup *to) {}
+static void cgroup_set_weight_stub(struct cgroup *cgrp, u32 weight) {}
+#endif
static s32 init_stub(void) { return -EINVAL; }
static void exit_stub(struct scx_exit_info *info) {}

@@ -4061,6 +4515,14 @@ static struct sched_ext_ops __bpf_ops_sched_ext_ops = {
.exit_task = exit_task_stub,
.enable = enable_stub,
.disable = disable_stub,
+#ifdef CONFIG_EXT_GROUP_SCHED
+ .cgroup_init = cgroup_init_stub,
+ .cgroup_exit = cgroup_exit_stub,
+ .cgroup_prep_move = cgroup_prep_move_stub,
+ .cgroup_move = cgroup_move_stub,
+ .cgroup_cancel_move = cgroup_cancel_move_stub,
+ .cgroup_set_weight = cgroup_set_weight_stub,
+#endif
.init = init_stub,
.exit = exit_stub,
};
@@ -4229,7 +4691,8 @@ void __init init_sched_ext_class(void)
* definitions so that BPF scheduler implementations can use them
* through the generated vmlinux.h.
*/
- WRITE_ONCE(v, SCX_ENQ_WAKEUP | SCX_DEQ_SLEEP | SCX_KICK_PREEMPT);
+ WRITE_ONCE(v, SCX_ENQ_WAKEUP | SCX_DEQ_SLEEP | SCX_KICK_PREEMPT |
+ SCX_TG_ONLINE);

BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params));
init_dsq(&scx_dsq_global, SCX_DSQ_GLOBAL);
@@ -4251,6 +4714,7 @@ void __init init_sched_ext_class(void)

register_sysrq_key('S', &sysrq_sched_ext_reset_op);
INIT_DELAYED_WORK(&scx_watchdog_work, scx_watchdog_workfn);
+ scx_cgroup_config_knobs();
}


@@ -4266,8 +4730,8 @@ __bpf_kfunc_start_defs();
* @dsq_id: DSQ to create
* @node: NUMA node to allocate from
*
- * Create a custom DSQ identified by @dsq_id. Can be called from ops.init() and
- * ops.init_task().
+ * Create a custom DSQ identified by @dsq_id. Can be called from ops.init(),
+ * ops.init_task(), ops.cgroup_init() and ops.cgroup_prep_move().
*/
__bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node)
{
@@ -4941,6 +5405,41 @@ __bpf_kfunc s32 scx_bpf_task_cpu(const struct task_struct *p)
return task_cpu(p);
}

+/**
+ * scx_bpf_task_cgroup - Return the sched cgroup of a task
+ * @p: task of interest
+ *
+ * @p->sched_task_group->css.cgroup represents the cgroup @p is associated with
+ * from the scheduler's POV. SCX operations should use this function to
+ * determine @p's current cgroup as, unlike following @p->cgroups,
+ * @p->sched_task_group is protected by @p's rq lock and thus atomic w.r.t. all
+ * rq-locked operations. Can be called on the parameter tasks of rq-locked
+ * operations. The restriction guarantees that @p's rq is locked by the caller.
+ */
+#ifdef CONFIG_CGROUP_SCHED
+__bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p)
+{
+ struct task_group *tg = p->sched_task_group;
+ struct cgroup *cgrp = &cgrp_dfl_root.cgrp;
+
+ if (!scx_kf_allowed_on_arg_tasks(__SCX_KF_RQ_LOCKED, p))
+ goto out;
+
+ /*
+ * A task_group may either be a cgroup or an autogroup. In the latter
+ * case, @tg->css.cgroup is %NULL. A task_group can't become the other
+ * kind once created.
+ */
+ if (tg && tg->css.cgroup)
+ cgrp = tg->css.cgroup;
+ else
+ cgrp = &cgrp_dfl_root.cgrp;
+out:
+ cgroup_get(cgrp);
+ return cgrp;
+}
+#endif
+
__bpf_kfunc_end_defs();

BTF_KFUNCS_START(scx_kfunc_ids_any)
@@ -4961,6 +5460,9 @@ BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU)
BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU)
+#ifdef CONFIG_CGROUP_SCHED
+BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_RCU | KF_ACQUIRE)
+#endif
BTF_KFUNCS_END(scx_kfunc_ids_any)

static const struct btf_kfunc_id_set scx_kfunc_set_any = {
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 954ae4c2b53d..1017439dcc00 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -105,3 +105,23 @@ static inline void scx_update_idle(struct rq *rq, bool idle)
#else
static inline void scx_update_idle(struct rq *rq, bool idle) {}
#endif
+
+#ifdef CONFIG_CGROUP_SCHED
+#ifdef CONFIG_EXT_GROUP_SCHED
+int scx_tg_online(struct task_group *tg);
+void scx_tg_offline(struct task_group *tg);
+int scx_cgroup_can_attach(struct cgroup_taskset *tset);
+void scx_move_task(struct task_struct *p);
+void scx_cgroup_finish_attach(void);
+void scx_cgroup_cancel_attach(struct cgroup_taskset *tset);
+void scx_group_set_weight(struct task_group *tg, unsigned long cgrp_weight);
+#else /* CONFIG_EXT_GROUP_SCHED */
+static inline int scx_tg_online(struct task_group *tg) { return 0; }
+static inline void scx_tg_offline(struct task_group *tg) {}
+static inline int scx_cgroup_can_attach(struct cgroup_taskset *tset) { return 0; }
+static inline void scx_move_task(struct task_struct *p) {}
+static inline void scx_cgroup_finish_attach(void) {}
+static inline void scx_cgroup_cancel_attach(struct cgroup_taskset *tset) {}
+static inline void scx_group_set_weight(struct task_group *tg, unsigned long cgrp_weight) {}
+#endif /* CONFIG_EXT_GROUP_SCHED */
+#endif /* CONFIG_CGROUP_SCHED */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c6c9b46eeacc..0ca2378bb252 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -432,6 +432,11 @@ struct task_group {
struct rt_bandwidth rt_bandwidth;
#endif

+#ifdef CONFIG_EXT_GROUP_SCHED
+ u32 scx_flags; /* SCX_TG_* */
+ u32 scx_weight;
+#endif
+
struct rcu_head rcu;
struct list_head list;

@@ -548,6 +553,11 @@ extern void set_task_rq_fair(struct sched_entity *se,
static inline void set_task_rq_fair(struct sched_entity *se,
struct cfs_rq *prev, struct cfs_rq *next) { }
#endif /* CONFIG_SMP */
+#else /* CONFIG_FAIR_GROUP_SCHED */
+static inline int sched_group_set_shares(struct task_group *tg, unsigned long shares)
+{
+ return 0;
+}
#endif /* CONFIG_FAIR_GROUP_SCHED */

#else /* CONFIG_CGROUP_SCHED */
@@ -3549,7 +3559,7 @@ extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);

#ifdef CONFIG_CGROUP_SCHED
enum cpu_cftype_id {
-#ifdef CONFIG_FAIR_GROUP_SCHED
+#if defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_EXT_GROUP_SCHED)
CPU_CFTYPE_WEIGHT,
CPU_CFTYPE_WEIGHT_NICE,
CPU_CFTYPE_IDLE,
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 8b4052034f93..f0dbaa1826a7 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -51,6 +51,7 @@ s32 scx_bpf_pick_idle_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
s32 scx_bpf_pick_any_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
bool scx_bpf_task_running(const struct task_struct *p) __ksym;
s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym;
+struct cgroup *scx_bpf_task_cgroup(struct task_struct *p) __ksym;

static inline __attribute__((format(printf, 1, 2)))
void ___scx_bpf_exit_format_checker(const char *fmt, ...) {}
--
2.44.0


2024-05-01 15:20:31

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 30/39] sched_ext: Implement SCX_KICK_WAIT

From: David Vernet <[email protected]>

If set when calling scx_bpf_kick_cpu(), the invoking CPU will busy wait for
the kicked cpu to enter the scheduler. See the following for example usage:

https://github.com/sched-ext/scx/blob/main/scheds/c/scx_pair.bpf.c

v2: - Updated to fit the updated kick_cpus_irq_workfn() implementation.

- Include SCX_KICK_WAIT related information in debug dump.

Signed-off-by: David Vernet <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/core.c | 4 ++-
kernel/sched/ext.c | 82 ++++++++++++++++++++++++++++++++++++++++----
kernel/sched/ext.h | 4 +++
kernel/sched/sched.h | 2 ++
4 files changed, 85 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index de49b94844a9..d940c17dfe8a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6115,8 +6115,10 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

for_each_active_class(class) {
p = class->pick_next_task(rq);
- if (p)
+ if (p) {
+ scx_next_task_picked(rq, p, class);
return p;
+ }
}

BUG(); /* The idle class should always have a runnable task. */
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 80c313a56958..91c3d1851b45 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -579,6 +579,12 @@ enum scx_kick_flags {
* task expires and the dispatch path is invoked.
*/
SCX_KICK_PREEMPT = 1LLU << 1,
+
+ /*
+ * Wait for the CPU to be rescheduled. The scx_bpf_kick_cpu() call will
+ * return after the target CPU finishes picking the next task.
+ */
+ SCX_KICK_WAIT = 1LLU << 2,
};

enum scx_tg_flags {
@@ -713,6 +719,9 @@ static struct {

#endif /* CONFIG_SMP */

+/* for %SCX_KICK_WAIT */
+static unsigned long __percpu *scx_kick_cpus_pnt_seqs;
+
/*
* Direct dispatch marker.
*
@@ -2315,6 +2324,23 @@ static struct task_struct *pick_next_task_scx(struct rq *rq)
return p;
}

+void scx_next_task_picked(struct rq *rq, struct task_struct *p,
+ const struct sched_class *active)
+{
+ lockdep_assert_rq_held(rq);
+
+ if (!scx_enabled())
+ return;
+#ifdef CONFIG_SMP
+ /*
+ * Pairs with the smp_load_acquire() issued by a CPU in
+ * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
+ * resched.
+ */
+ smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
+#endif
+}
+
#ifdef CONFIG_SMP

static bool test_and_clear_cpu_idle(int cpu)
@@ -3868,9 +3894,9 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
rq->curr->sched_class == &idle_sched_class)
goto next;

- seq_buf_printf(&s, "\nCPU %-4d: nr_run=%u flags=0x%x ops_qseq=%lu\n",
+ seq_buf_printf(&s, "\nCPU %-4d: nr_run=%u flags=0x%x ops_qseq=%lu pnt_seq=%lu\n",
cpu, rq->scx.nr_running, rq->scx.flags,
- rq->scx.ops_qseq);
+ rq->scx.ops_qseq, rq->scx.pnt_seq);
seq_buf_printf(&s, " curr=%s[%d] class=%ps\n",
rq->curr->comm, rq->curr->pid,
rq->curr->sched_class);
@@ -3883,6 +3909,9 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
if (!cpumask_empty(rq->scx.cpus_to_preempt))
seq_buf_printf(&s, " cpus_to_preempt: %*pb\n",
cpumask_pr_args(rq->scx.cpus_to_preempt));
+ if (!cpumask_empty(rq->scx.cpus_to_wait))
+ seq_buf_printf(&s, " cpus_to_wait : %*pb\n",
+ cpumask_pr_args(rq->scx.cpus_to_wait));

if (rq->curr->sched_class == &ext_sched_class)
scx_dump_task(&s, rq->curr, '*', now);
@@ -4578,10 +4607,11 @@ static bool can_skip_idle_kick(struct rq *rq)
return !is_idle_task(rq->curr) && !(rq->scx.flags & SCX_RQ_BALANCING);
}

-static void kick_one_cpu(s32 cpu, struct rq *this_rq)
+static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *pseqs)
{
struct rq *rq = cpu_rq(cpu);
struct scx_rq *this_scx = &this_rq->scx;
+ bool should_wait = false;
unsigned long flags;

raw_spin_rq_lock_irqsave(rq, flags);
@@ -4597,12 +4627,20 @@ static void kick_one_cpu(s32 cpu, struct rq *this_rq)
cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt);
}

+ if (cpumask_test_cpu(cpu, this_scx->cpus_to_wait)) {
+ pseqs[cpu] = rq->scx.pnt_seq;
+ should_wait = true;
+ }
+
resched_curr(rq);
} else {
cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt);
+ cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
}

raw_spin_rq_unlock_irqrestore(rq, flags);
+
+ return should_wait;
}

static void kick_one_cpu_if_idle(s32 cpu, struct rq *this_rq)
@@ -4623,10 +4661,12 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
{
struct rq *this_rq = this_rq();
struct scx_rq *this_scx = &this_rq->scx;
+ unsigned long *pseqs = this_cpu_ptr(scx_kick_cpus_pnt_seqs);
+ bool should_wait = false;
s32 cpu;

for_each_cpu(cpu, this_scx->cpus_to_kick) {
- kick_one_cpu(cpu, this_rq);
+ should_wait |= kick_one_cpu(cpu, this_rq, pseqs);
cpumask_clear_cpu(cpu, this_scx->cpus_to_kick);
cpumask_clear_cpu(cpu, this_scx->cpus_to_kick_if_idle);
}
@@ -4635,6 +4675,28 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
kick_one_cpu_if_idle(cpu, this_rq);
cpumask_clear_cpu(cpu, this_scx->cpus_to_kick_if_idle);
}
+
+ if (!should_wait)
+ return;
+
+ for_each_cpu(cpu, this_scx->cpus_to_wait) {
+ unsigned long *wait_pnt_seq = &cpu_rq(cpu)->scx.pnt_seq;
+
+ if (cpu != cpu_of(this_rq)) {
+ /*
+ * Pairs with smp_store_release() issued by this CPU in
+ * scx_next_task_picked() on the resched path.
+ *
+ * We busy-wait here to guarantee that no other task can
+ * be scheduled on our core before the target CPU has
+ * entered the resched path.
+ */
+ while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu])
+ cpu_relax();
+ }
+
+ cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
+ }
}

/**
@@ -4700,6 +4762,11 @@ void __init init_sched_ext_class(void)
BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL));
BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL));
#endif
+ scx_kick_cpus_pnt_seqs =
+ __alloc_percpu(sizeof(scx_kick_cpus_pnt_seqs[0]) * nr_cpu_ids,
+ __alignof__(scx_kick_cpus_pnt_seqs[0]));
+ BUG_ON(!scx_kick_cpus_pnt_seqs);
+
for_each_possible_cpu(cpu) {
struct rq *rq = cpu_rq(cpu);

@@ -4709,6 +4776,7 @@ void __init init_sched_ext_class(void)
BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick, GFP_KERNEL));
BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick_if_idle, GFP_KERNEL));
BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_preempt, GFP_KERNEL));
+ BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_wait, GFP_KERNEL));
init_irq_work(&rq->scx.kick_cpus_irq_work, kick_cpus_irq_workfn);
}

@@ -5038,8 +5106,8 @@ __bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags)
if (flags & SCX_KICK_IDLE) {
struct rq *target_rq = cpu_rq(cpu);

- if (unlikely(flags & SCX_KICK_PREEMPT))
- scx_ops_error("PREEMPT cannot be used with SCX_KICK_IDLE");
+ if (unlikely(flags & (SCX_KICK_PREEMPT | SCX_KICK_WAIT)))
+ scx_ops_error("PREEMPT/WAIT cannot be used with SCX_KICK_IDLE");

if (raw_spin_rq_trylock(target_rq)) {
if (can_skip_idle_kick(target_rq)) {
@@ -5054,6 +5122,8 @@ __bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags)

if (flags & SCX_KICK_PREEMPT)
cpumask_set_cpu(cpu, this_rq->scx.cpus_to_preempt);
+ if (flags & SCX_KICK_WAIT)
+ cpumask_set_cpu(cpu, this_rq->scx.cpus_to_wait);
}

irq_work_queue(&this_rq->scx.kick_cpus_irq_work);
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 1017439dcc00..5db35f627ea3 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -29,6 +29,8 @@ static inline bool task_on_scx(const struct task_struct *p)
return scx_enabled() && p->sched_class == &ext_sched_class;
}

+void scx_next_task_picked(struct rq *rq, struct task_struct *p,
+ const struct sched_class *active);
void scx_tick(struct rq *rq);
void init_scx_entity(struct sched_ext_entity *scx);
void scx_pre_fork(struct task_struct *p);
@@ -78,6 +80,8 @@ static inline const struct sched_class *next_active_class(const struct sched_cla
#define scx_enabled() false
#define scx_switched_all() false

+static inline void scx_next_task_picked(struct rq *rq, struct task_struct *p,
+ const struct sched_class *active) {}
static inline void scx_tick(struct rq *rq) {}
static inline void scx_pre_fork(struct task_struct *p) {}
static inline int scx_fork(struct task_struct *p) { return 0; }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0ca2378bb252..c8cf6fbaed07 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -735,6 +735,8 @@ struct scx_rq {
cpumask_var_t cpus_to_kick;
cpumask_var_t cpus_to_kick_if_idle;
cpumask_var_t cpus_to_preempt;
+ cpumask_var_t cpus_to_wait;
+ unsigned long pnt_seq;
struct irq_work kick_cpus_irq_work;
};
#endif /* CONFIG_SCHED_CLASS_EXT */
--
2.44.0


2024-05-01 15:20:39

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 29/39] sched_ext: Add a cgroup scheduler which uses flattened hierarchy

This patch adds scx_flatcg example scheduler which implements hierarchical
weight-based cgroup CPU control by flattening the cgroup hierarchy into a
single layer by compounding the active weight share at each level.

This flattening of hierarchy can bring a substantial performance gain when
the cgroup hierarchy is nested multiple levels. in a simple benchmark using
wrk[8] on apache serving a CGI script calculating sha1sum of a small file,
it outperforms CFS by ~3% with CPU controller disabled and by ~10% with two
apache instances competing with 2:1 weight ratio nested four level deep.

However, the gain comes at the cost of not being able to properly handle
thundering herd of cgroups. For example, if many cgroups which are nested
behind a low priority parent cgroup wake up around the same time, they may
be able to consume more CPU cycles than they are entitled to. In many use
cases, this isn't a real concern especially given the performance gain.
Also, there are ways to mitigate the problem further by e.g. introducing an
extra scheduling layer on cgroup delegation boundaries.

v3: - Updated to reflect the core API changes including ops.init/exit_task()
and direct dispatch from ops.select_cpu(). Fixes and improvements
including additional statistics.

- Use reference counted kptr for cgv_node instead of xchg'ing against
stash location.

- Dropped '-p' option.

v2: - Use SCX_BUG[_ON]() to simplify error handling.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
tools/sched_ext/Makefile | 2 +-
tools/sched_ext/scx_flatcg.bpf.c | 861 +++++++++++++++++++++++++++++++
tools/sched_ext/scx_flatcg.c | 225 ++++++++
tools/sched_ext/scx_flatcg.h | 51 ++
4 files changed, 1138 insertions(+), 1 deletion(-)
create mode 100644 tools/sched_ext/scx_flatcg.bpf.c
create mode 100644 tools/sched_ext/scx_flatcg.c
create mode 100644 tools/sched_ext/scx_flatcg.h

diff --git a/tools/sched_ext/Makefile b/tools/sched_ext/Makefile
index bf7e108f5ae1..ca3815e572d8 100644
--- a/tools/sched_ext/Makefile
+++ b/tools/sched_ext/Makefile
@@ -176,7 +176,7 @@ $(INCLUDE_DIR)/%.bpf.skel.h: $(SCXOBJ_DIR)/%.bpf.o $(INCLUDE_DIR)/vmlinux.h $(BP

SCX_COMMON_DEPS := include/scx/common.h include/scx/user_exit_info.h | $(BINDIR)

-c-sched-targets = scx_simple scx_qmap scx_central
+c-sched-targets = scx_simple scx_qmap scx_central scx_flatcg

$(addprefix $(BINDIR)/,$(c-sched-targets)): \
$(BINDIR)/%: \
diff --git a/tools/sched_ext/scx_flatcg.bpf.c b/tools/sched_ext/scx_flatcg.bpf.c
new file mode 100644
index 000000000000..20a53e087d6c
--- /dev/null
+++ b/tools/sched_ext/scx_flatcg.bpf.c
@@ -0,0 +1,861 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A demo sched_ext flattened cgroup hierarchy scheduler. It implements
+ * hierarchical weight-based cgroup CPU control by flattening the cgroup
+ * hierarchy into a single layer by compounding the active weight share at each
+ * level. Consider the following hierarchy with weights in parentheses:
+ *
+ * R + A (100) + B (100)
+ * | \ C (100)
+ * \ D (200)
+ *
+ * Ignoring the root and threaded cgroups, only B, C and D can contain tasks.
+ * Let's say all three have runnable tasks. The total share that each of these
+ * three cgroups is entitled to can be calculated by compounding its share at
+ * each level.
+ *
+ * For example, B is competing against C and in that competition its share is
+ * 100/(100+100) == 1/2. At its parent level, A is competing against D and A's
+ * share in that competition is 200/(200+100) == 1/3. B's eventual share in the
+ * system can be calculated by multiplying the two shares, 1/2 * 1/3 == 1/6. C's
+ * eventual shaer is the same at 1/6. D is only competing at the top level and
+ * its share is 200/(100+200) == 2/3.
+ *
+ * So, instead of hierarchically scheduling level-by-level, we can consider it
+ * as B, C and D competing each other with respective share of 1/6, 1/6 and 2/3
+ * and keep updating the eventual shares as the cgroups' runnable states change.
+ *
+ * This flattening of hierarchy can bring a substantial performance gain when
+ * the cgroup hierarchy is nested multiple levels. in a simple benchmark using
+ * wrk[8] on apache serving a CGI script calculating sha1sum of a small file, it
+ * outperforms CFS by ~3% with CPU controller disabled and by ~10% with two
+ * apache instances competing with 2:1 weight ratio nested four level deep.
+ *
+ * However, the gain comes at the cost of not being able to properly handle
+ * thundering herd of cgroups. For example, if many cgroups which are nested
+ * behind a low priority parent cgroup wake up around the same time, they may be
+ * able to consume more CPU cycles than they are entitled to. In many use cases,
+ * this isn't a real concern especially given the performance gain. Also, there
+ * are ways to mitigate the problem further by e.g. introducing an extra
+ * scheduling layer on cgroup delegation boundaries.
+ */
+#include <scx/common.bpf.h>
+#include "scx_flatcg.h"
+
+/*
+ * Maximum amount of retries to find a valid cgroup.
+ */
+#define CGROUP_MAX_RETRIES 1024
+
+char _license[] SEC("license") = "GPL";
+
+const volatile u32 nr_cpus = 32; /* !0 for veristat, set during init */
+const volatile u64 cgrp_slice_ns = SCX_SLICE_DFL;
+
+u64 cvtime_now;
+UEI_DEFINE(uei);
+
+struct {
+ __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+ __type(key, u32);
+ __type(value, u64);
+ __uint(max_entries, FCG_NR_STATS);
+} stats SEC(".maps");
+
+static void stat_inc(enum fcg_stat_idx idx)
+{
+ u32 idx_v = idx;
+
+ u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx_v);
+ if (cnt_p)
+ (*cnt_p)++;
+}
+
+struct fcg_cpu_ctx {
+ u64 cur_cgid;
+ u64 cur_at;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+ __type(key, u32);
+ __type(value, struct fcg_cpu_ctx);
+ __uint(max_entries, 1);
+} cpu_ctx SEC(".maps");
+
+struct {
+ __uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+ __type(key, int);
+ __type(value, struct fcg_cgrp_ctx);
+} cgrp_ctx SEC(".maps");
+
+struct cgv_node {
+ struct bpf_rb_node rb_node;
+ __u64 cvtime;
+ __u64 cgid;
+ struct bpf_refcount refcount;
+};
+
+private(CGV_TREE) struct bpf_spin_lock cgv_tree_lock;
+private(CGV_TREE) struct bpf_rb_root cgv_tree __contains(cgv_node, rb_node);
+
+struct cgv_node_stash {
+ struct cgv_node __kptr *node;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_HASH);
+ __uint(max_entries, 16384);
+ __type(key, __u64);
+ __type(value, struct cgv_node_stash);
+} cgv_node_stash SEC(".maps");
+
+struct fcg_task_ctx {
+ u64 bypassed_at;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+ __type(key, int);
+ __type(value, struct fcg_task_ctx);
+} task_ctx SEC(".maps");
+
+/* gets inc'd on weight tree changes to expire the cached hweights */
+u64 hweight_gen = 1;
+
+static u64 div_round_up(u64 dividend, u64 divisor)
+{
+ return (dividend + divisor - 1) / divisor;
+}
+
+static bool vtime_before(u64 a, u64 b)
+{
+ return (s64)(a - b) < 0;
+}
+
+static bool cgv_node_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+ struct cgv_node *cgc_a, *cgc_b;
+
+ cgc_a = container_of(a, struct cgv_node, rb_node);
+ cgc_b = container_of(b, struct cgv_node, rb_node);
+
+ return cgc_a->cvtime < cgc_b->cvtime;
+}
+
+static struct fcg_cpu_ctx *find_cpu_ctx(void)
+{
+ struct fcg_cpu_ctx *cpuc;
+ u32 idx = 0;
+
+ cpuc = bpf_map_lookup_elem(&cpu_ctx, &idx);
+ if (!cpuc) {
+ scx_bpf_error("cpu_ctx lookup failed");
+ return NULL;
+ }
+ return cpuc;
+}
+
+static struct fcg_cgrp_ctx *find_cgrp_ctx(struct cgroup *cgrp)
+{
+ struct fcg_cgrp_ctx *cgc;
+
+ cgc = bpf_cgrp_storage_get(&cgrp_ctx, cgrp, 0, 0);
+ if (!cgc) {
+ scx_bpf_error("cgrp_ctx lookup failed for cgid %llu", cgrp->kn->id);
+ return NULL;
+ }
+ return cgc;
+}
+
+static struct fcg_cgrp_ctx *find_ancestor_cgrp_ctx(struct cgroup *cgrp, int level)
+{
+ struct fcg_cgrp_ctx *cgc;
+
+ cgrp = bpf_cgroup_ancestor(cgrp, level);
+ if (!cgrp) {
+ scx_bpf_error("ancestor cgroup lookup failed");
+ return NULL;
+ }
+
+ cgc = find_cgrp_ctx(cgrp);
+ if (!cgc)
+ scx_bpf_error("ancestor cgrp_ctx lookup failed");
+ bpf_cgroup_release(cgrp);
+ return cgc;
+}
+
+static void cgrp_refresh_hweight(struct cgroup *cgrp, struct fcg_cgrp_ctx *cgc)
+{
+ int level;
+
+ if (!cgc->nr_active) {
+ stat_inc(FCG_STAT_HWT_SKIP);
+ return;
+ }
+
+ if (cgc->hweight_gen == hweight_gen) {
+ stat_inc(FCG_STAT_HWT_CACHE);
+ return;
+ }
+
+ stat_inc(FCG_STAT_HWT_UPDATES);
+ bpf_for(level, 0, cgrp->level + 1) {
+ struct fcg_cgrp_ctx *cgc;
+ bool is_active;
+
+ cgc = find_ancestor_cgrp_ctx(cgrp, level);
+ if (!cgc)
+ break;
+
+ if (!level) {
+ cgc->hweight = FCG_HWEIGHT_ONE;
+ cgc->hweight_gen = hweight_gen;
+ } else {
+ struct fcg_cgrp_ctx *pcgc;
+
+ pcgc = find_ancestor_cgrp_ctx(cgrp, level - 1);
+ if (!pcgc)
+ break;
+
+ /*
+ * We can be oppotunistic here and not grab the
+ * cgv_tree_lock and deal with the occasional races.
+ * However, hweight updates are already cached and
+ * relatively low-frequency. Let's just do the
+ * straightforward thing.
+ */
+ bpf_spin_lock(&cgv_tree_lock);
+ is_active = cgc->nr_active;
+ if (is_active) {
+ cgc->hweight_gen = pcgc->hweight_gen;
+ cgc->hweight =
+ div_round_up(pcgc->hweight * cgc->weight,
+ pcgc->child_weight_sum);
+ }
+ bpf_spin_unlock(&cgv_tree_lock);
+
+ if (!is_active) {
+ stat_inc(FCG_STAT_HWT_RACE);
+ break;
+ }
+ }
+ }
+}
+
+static void cgrp_cap_budget(struct cgv_node *cgv_node, struct fcg_cgrp_ctx *cgc)
+{
+ u64 delta, cvtime, max_budget;
+
+ /*
+ * A node which is on the rbtree can't be pointed to from elsewhere yet
+ * and thus can't be updated and repositioned. Instead, we collect the
+ * vtime deltas separately and apply it asynchronously here.
+ */
+ delta = cgc->cvtime_delta;
+ __sync_fetch_and_sub(&cgc->cvtime_delta, delta);
+ cvtime = cgv_node->cvtime + delta;
+
+ /*
+ * Allow a cgroup to carry the maximum budget proportional to its
+ * hweight such that a full-hweight cgroup can immediately take up half
+ * of the CPUs at the most while staying at the front of the rbtree.
+ */
+ max_budget = (cgrp_slice_ns * nr_cpus * cgc->hweight) /
+ (2 * FCG_HWEIGHT_ONE);
+ if (vtime_before(cvtime, cvtime_now - max_budget))
+ cvtime = cvtime_now - max_budget;
+
+ cgv_node->cvtime = cvtime;
+}
+
+static void cgrp_enqueued(struct cgroup *cgrp, struct fcg_cgrp_ctx *cgc)
+{
+ struct cgv_node_stash *stash;
+ struct cgv_node *cgv_node;
+ u64 cgid = cgrp->kn->id;
+
+ /* paired with cmpxchg in try_pick_next_cgroup() */
+ if (__sync_val_compare_and_swap(&cgc->queued, 0, 1)) {
+ stat_inc(FCG_STAT_ENQ_SKIP);
+ return;
+ }
+
+ stash = bpf_map_lookup_elem(&cgv_node_stash, &cgid);
+ if (!stash || !stash->node) {
+ scx_bpf_error("cgv_node lookup failed for cgid %llu", cgid);
+ return;
+ }
+
+ cgv_node = bpf_refcount_acquire(stash->node);
+ if (!cgv_node) {
+ /*
+ * Node never leaves cgv_node_stash, this should only happen if
+ * fcg_cgroup_exit deletes the stashed node
+ */
+ stat_inc(FCG_STAT_ENQ_RACE);
+ return;
+ }
+
+ bpf_spin_lock(&cgv_tree_lock);
+ cgrp_cap_budget(cgv_node, cgc);
+ bpf_rbtree_add(&cgv_tree, &cgv_node->rb_node, cgv_node_less);
+ bpf_spin_unlock(&cgv_tree_lock);
+}
+
+static void set_bypassed_at(struct task_struct *p, struct fcg_task_ctx *taskc)
+{
+ /*
+ * Tell fcg_stopping() that this bypassed the regular scheduling path
+ * and should be force charged to the cgroup. 0 is used to indicate that
+ * the task isn't bypassing, so if the current runtime is 0, go back by
+ * one nanosecond.
+ */
+ taskc->bypassed_at = p->se.sum_exec_runtime ?: (u64)-1;
+}
+
+s32 BPF_STRUCT_OPS(fcg_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags)
+{
+ struct fcg_task_ctx *taskc;
+ bool is_idle = false;
+ s32 cpu;
+
+ cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &is_idle);
+
+ taskc = bpf_task_storage_get(&task_ctx, p, 0, 0);
+ if (!taskc) {
+ scx_bpf_error("task_ctx lookup failed");
+ return cpu;
+ }
+
+ /*
+ * If select_cpu_dfl() is recommending local enqueue, the target CPU is
+ * idle. Follow it and charge the cgroup later in fcg_stopping() after
+ * the fact.
+ */
+ if (is_idle) {
+ set_bypassed_at(p, taskc);
+ stat_inc(FCG_STAT_LOCAL);
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
+ }
+
+ return cpu;
+}
+
+void BPF_STRUCT_OPS(fcg_enqueue, struct task_struct *p, u64 enq_flags)
+{
+ struct fcg_task_ctx *taskc;
+ struct cgroup *cgrp;
+ struct fcg_cgrp_ctx *cgc;
+
+ taskc = bpf_task_storage_get(&task_ctx, p, 0, 0);
+ if (!taskc) {
+ scx_bpf_error("task_ctx lookup failed");
+ return;
+ }
+
+ /*
+ * Use the direct dispatching and force charging to deal with tasks with
+ * custom affinities so that we don't have to worry about per-cgroup
+ * dq's containing tasks that can't be executed from some CPUs.
+ */
+ if (p->nr_cpus_allowed != nr_cpus) {
+ set_bypassed_at(p, taskc);
+
+ /*
+ * The global dq is deprioritized as we don't want to let tasks
+ * to boost themselves by constraining its cpumask. The
+ * deprioritization is rather severe, so let's not apply that to
+ * per-cpu kernel threads. This is ham-fisted. We probably wanna
+ * implement per-cgroup fallback dq's instead so that we have
+ * more control over when tasks with custom cpumask get issued.
+ */
+ if (p->nr_cpus_allowed == 1 && (p->flags & PF_KTHREAD)) {
+ stat_inc(FCG_STAT_LOCAL);
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags);
+ } else {
+ stat_inc(FCG_STAT_GLOBAL);
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+ }
+ return;
+ }
+
+ cgrp = scx_bpf_task_cgroup(p);
+ cgc = find_cgrp_ctx(cgrp);
+ if (!cgc)
+ goto out_release;
+
+ scx_bpf_dispatch(p, cgrp->kn->id, SCX_SLICE_DFL, enq_flags);
+
+ cgrp_enqueued(cgrp, cgc);
+out_release:
+ bpf_cgroup_release(cgrp);
+}
+
+/*
+ * Walk the cgroup tree to update the active weight sums as tasks wake up and
+ * sleep. The weight sums are used as the base when calculating the proportion a
+ * given cgroup or task is entitled to at each level.
+ */
+static void update_active_weight_sums(struct cgroup *cgrp, bool runnable)
+{
+ struct fcg_cgrp_ctx *cgc;
+ bool updated = false;
+ int idx;
+
+ cgc = find_cgrp_ctx(cgrp);
+ if (!cgc)
+ return;
+
+ /*
+ * In most cases, a hot cgroup would have multiple threads going to
+ * sleep and waking up while the whole cgroup stays active. In leaf
+ * cgroups, ->nr_runnable which is updated with __sync operations gates
+ * ->nr_active updates, so that we don't have to grab the cgv_tree_lock
+ * repeatedly for a busy cgroup which is staying active.
+ */
+ if (runnable) {
+ if (__sync_fetch_and_add(&cgc->nr_runnable, 1))
+ return;
+ stat_inc(FCG_STAT_ACT);
+ } else {
+ if (__sync_sub_and_fetch(&cgc->nr_runnable, 1))
+ return;
+ stat_inc(FCG_STAT_DEACT);
+ }
+
+ /*
+ * If @cgrp is becoming runnable, its hweight should be refreshed after
+ * it's added to the weight tree so that enqueue has the up-to-date
+ * value. If @cgrp is becoming quiescent, the hweight should be
+ * refreshed before it's removed from the weight tree so that the usage
+ * charging which happens afterwards has access to the latest value.
+ */
+ if (!runnable)
+ cgrp_refresh_hweight(cgrp, cgc);
+
+ /* propagate upwards */
+ bpf_for(idx, 0, cgrp->level) {
+ int level = cgrp->level - idx;
+ struct fcg_cgrp_ctx *cgc, *pcgc = NULL;
+ bool propagate = false;
+
+ cgc = find_ancestor_cgrp_ctx(cgrp, level);
+ if (!cgc)
+ break;
+ if (level) {
+ pcgc = find_ancestor_cgrp_ctx(cgrp, level - 1);
+ if (!pcgc)
+ break;
+ }
+
+ /*
+ * We need the propagation protected by a lock to synchronize
+ * against weight changes. There's no reason to drop the lock at
+ * each level but bpf_spin_lock() doesn't want any function
+ * calls while locked.
+ */
+ bpf_spin_lock(&cgv_tree_lock);
+
+ if (runnable) {
+ if (!cgc->nr_active++) {
+ updated = true;
+ if (pcgc) {
+ propagate = true;
+ pcgc->child_weight_sum += cgc->weight;
+ }
+ }
+ } else {
+ if (!--cgc->nr_active) {
+ updated = true;
+ if (pcgc) {
+ propagate = true;
+ pcgc->child_weight_sum -= cgc->weight;
+ }
+ }
+ }
+
+ bpf_spin_unlock(&cgv_tree_lock);
+
+ if (!propagate)
+ break;
+ }
+
+ if (updated)
+ __sync_fetch_and_add(&hweight_gen, 1);
+
+ if (runnable)
+ cgrp_refresh_hweight(cgrp, cgc);
+}
+
+void BPF_STRUCT_OPS(fcg_runnable, struct task_struct *p, u64 enq_flags)
+{
+ struct cgroup *cgrp;
+
+ cgrp = scx_bpf_task_cgroup(p);
+ update_active_weight_sums(cgrp, true);
+ bpf_cgroup_release(cgrp);
+}
+
+void BPF_STRUCT_OPS(fcg_stopping, struct task_struct *p, bool runnable)
+{
+ struct fcg_task_ctx *taskc;
+ struct cgroup *cgrp;
+ struct fcg_cgrp_ctx *cgc;
+
+ taskc = bpf_task_storage_get(&task_ctx, p, 0, 0);
+ if (!taskc) {
+ scx_bpf_error("task_ctx lookup failed");
+ return;
+ }
+
+ if (!taskc->bypassed_at)
+ return;
+
+ cgrp = scx_bpf_task_cgroup(p);
+ cgc = find_cgrp_ctx(cgrp);
+ if (cgc) {
+ __sync_fetch_and_add(&cgc->cvtime_delta,
+ p->se.sum_exec_runtime - taskc->bypassed_at);
+ taskc->bypassed_at = 0;
+ }
+ bpf_cgroup_release(cgrp);
+}
+
+void BPF_STRUCT_OPS(fcg_quiescent, struct task_struct *p, u64 deq_flags)
+{
+ struct cgroup *cgrp;
+
+ cgrp = scx_bpf_task_cgroup(p);
+ update_active_weight_sums(cgrp, false);
+ bpf_cgroup_release(cgrp);
+}
+
+void BPF_STRUCT_OPS(fcg_cgroup_set_weight, struct cgroup *cgrp, u32 weight)
+{
+ struct fcg_cgrp_ctx *cgc, *pcgc = NULL;
+
+ cgc = find_cgrp_ctx(cgrp);
+ if (!cgc)
+ return;
+
+ if (cgrp->level) {
+ pcgc = find_ancestor_cgrp_ctx(cgrp, cgrp->level - 1);
+ if (!pcgc)
+ return;
+ }
+
+ bpf_spin_lock(&cgv_tree_lock);
+ if (pcgc && cgc->nr_active)
+ pcgc->child_weight_sum += (s64)weight - cgc->weight;
+ cgc->weight = weight;
+ bpf_spin_unlock(&cgv_tree_lock);
+}
+
+static bool try_pick_next_cgroup(u64 *cgidp)
+{
+ struct bpf_rb_node *rb_node;
+ struct cgv_node *cgv_node;
+ struct fcg_cgrp_ctx *cgc;
+ struct cgroup *cgrp;
+ u64 cgid;
+
+ /* pop the front cgroup and wind cvtime_now accordingly */
+ bpf_spin_lock(&cgv_tree_lock);
+
+ rb_node = bpf_rbtree_first(&cgv_tree);
+ if (!rb_node) {
+ bpf_spin_unlock(&cgv_tree_lock);
+ stat_inc(FCG_STAT_PNC_NO_CGRP);
+ *cgidp = 0;
+ return true;
+ }
+
+ rb_node = bpf_rbtree_remove(&cgv_tree, rb_node);
+ bpf_spin_unlock(&cgv_tree_lock);
+
+ if (!rb_node) {
+ /*
+ * This should never happen. bpf_rbtree_first() was called
+ * above while the tree lock was held, so the node should
+ * always be present.
+ */
+ scx_bpf_error("node could not be removed");
+ return true;
+ }
+
+ cgv_node = container_of(rb_node, struct cgv_node, rb_node);
+ cgid = cgv_node->cgid;
+
+ if (vtime_before(cvtime_now, cgv_node->cvtime))
+ cvtime_now = cgv_node->cvtime;
+
+ /*
+ * If lookup fails, the cgroup's gone. Free and move on. See
+ * fcg_cgroup_exit().
+ */
+ cgrp = bpf_cgroup_from_id(cgid);
+ if (!cgrp) {
+ stat_inc(FCG_STAT_PNC_GONE);
+ goto out_free;
+ }
+
+ cgc = bpf_cgrp_storage_get(&cgrp_ctx, cgrp, 0, 0);
+ if (!cgc) {
+ bpf_cgroup_release(cgrp);
+ stat_inc(FCG_STAT_PNC_GONE);
+ goto out_free;
+ }
+
+ if (!scx_bpf_consume(cgid)) {
+ bpf_cgroup_release(cgrp);
+ stat_inc(FCG_STAT_PNC_EMPTY);
+ goto out_stash;
+ }
+
+ /*
+ * Successfully consumed from the cgroup. This will be our current
+ * cgroup for the new slice. Refresh its hweight.
+ */
+ cgrp_refresh_hweight(cgrp, cgc);
+
+ bpf_cgroup_release(cgrp);
+
+ /*
+ * As the cgroup may have more tasks, add it back to the rbtree. Note
+ * that here we charge the full slice upfront and then exact later
+ * according to the actual consumption. This prevents lowpri thundering
+ * herd from saturating the machine.
+ */
+ bpf_spin_lock(&cgv_tree_lock);
+ cgv_node->cvtime += cgrp_slice_ns * FCG_HWEIGHT_ONE / (cgc->hweight ?: 1);
+ cgrp_cap_budget(cgv_node, cgc);
+ bpf_rbtree_add(&cgv_tree, &cgv_node->rb_node, cgv_node_less);
+ bpf_spin_unlock(&cgv_tree_lock);
+
+ *cgidp = cgid;
+ stat_inc(FCG_STAT_PNC_NEXT);
+ return true;
+
+out_stash:
+ /*
+ * Paired with cmpxchg in cgrp_enqueued(). If they see the following
+ * transition, they'll enqueue the cgroup. If they are earlier, we'll
+ * see their task in the dq below and requeue the cgroup.
+ */
+ __sync_val_compare_and_swap(&cgc->queued, 1, 0);
+
+ if (scx_bpf_dsq_nr_queued(cgid)) {
+ bpf_spin_lock(&cgv_tree_lock);
+ bpf_rbtree_add(&cgv_tree, &cgv_node->rb_node, cgv_node_less);
+ bpf_spin_unlock(&cgv_tree_lock);
+ stat_inc(FCG_STAT_PNC_RACE);
+ return false;
+ }
+
+out_free:
+ bpf_obj_drop(cgv_node);
+ return false;
+}
+
+void BPF_STRUCT_OPS(fcg_dispatch, s32 cpu, struct task_struct *prev)
+{
+ struct fcg_cpu_ctx *cpuc;
+ struct fcg_cgrp_ctx *cgc;
+ struct cgroup *cgrp;
+ u64 now = bpf_ktime_get_ns();
+ bool picked_next = false;
+
+ cpuc = find_cpu_ctx();
+ if (!cpuc)
+ return;
+
+ if (!cpuc->cur_cgid)
+ goto pick_next_cgroup;
+
+ if (vtime_before(now, cpuc->cur_at + cgrp_slice_ns)) {
+ if (scx_bpf_consume(cpuc->cur_cgid)) {
+ stat_inc(FCG_STAT_CNS_KEEP);
+ return;
+ }
+ stat_inc(FCG_STAT_CNS_EMPTY);
+ } else {
+ stat_inc(FCG_STAT_CNS_EXPIRE);
+ }
+
+ /*
+ * The current cgroup is expiring. It was already charged a full slice.
+ * Calculate the actual usage and accumulate the delta.
+ */
+ cgrp = bpf_cgroup_from_id(cpuc->cur_cgid);
+ if (!cgrp) {
+ stat_inc(FCG_STAT_CNS_GONE);
+ goto pick_next_cgroup;
+ }
+
+ cgc = bpf_cgrp_storage_get(&cgrp_ctx, cgrp, 0, 0);
+ if (cgc) {
+ /*
+ * We want to update the vtime delta and then look for the next
+ * cgroup to execute but the latter needs to be done in a loop
+ * and we can't keep the lock held. Oh well...
+ */
+ bpf_spin_lock(&cgv_tree_lock);
+ __sync_fetch_and_add(&cgc->cvtime_delta,
+ (cpuc->cur_at + cgrp_slice_ns - now) *
+ FCG_HWEIGHT_ONE / (cgc->hweight ?: 1));
+ bpf_spin_unlock(&cgv_tree_lock);
+ } else {
+ stat_inc(FCG_STAT_CNS_GONE);
+ }
+
+ bpf_cgroup_release(cgrp);
+
+pick_next_cgroup:
+ cpuc->cur_at = now;
+
+ if (scx_bpf_consume(SCX_DSQ_GLOBAL)) {
+ cpuc->cur_cgid = 0;
+ return;
+ }
+
+ bpf_repeat(CGROUP_MAX_RETRIES) {
+ if (try_pick_next_cgroup(&cpuc->cur_cgid)) {
+ picked_next = true;
+ break;
+ }
+ }
+
+ /*
+ * This only happens if try_pick_next_cgroup() races against enqueue
+ * path for more than CGROUP_MAX_RETRIES times, which is extremely
+ * unlikely and likely indicates an underlying bug. There shouldn't be
+ * any stall risk as the race is against enqueue.
+ */
+ if (!picked_next)
+ stat_inc(FCG_STAT_PNC_FAIL);
+}
+
+s32 BPF_STRUCT_OPS(fcg_init_task, struct task_struct *p,
+ struct scx_init_task_args *args)
+{
+ struct fcg_task_ctx *taskc;
+
+ /*
+ * @p is new. Let's ensure that its task_ctx is available. We can sleep
+ * in this function and the following will automatically use GFP_KERNEL.
+ */
+ taskc = bpf_task_storage_get(&task_ctx, p, 0,
+ BPF_LOCAL_STORAGE_GET_F_CREATE);
+ if (!taskc)
+ return -ENOMEM;
+
+ taskc->bypassed_at = 0;
+ return 0;
+}
+
+int BPF_STRUCT_OPS_SLEEPABLE(fcg_cgroup_init, struct cgroup *cgrp,
+ struct scx_cgroup_init_args *args)
+{
+ struct fcg_cgrp_ctx *cgc;
+ struct cgv_node *cgv_node;
+ struct cgv_node_stash empty_stash = {}, *stash;
+ u64 cgid = cgrp->kn->id;
+ int ret;
+
+ /*
+ * Technically incorrect as cgroup ID is full 64bit while dq ID is
+ * 63bit. Should not be a problem in practice and easy to spot in the
+ * unlikely case that it breaks.
+ */
+ ret = scx_bpf_create_dsq(cgid, -1);
+ if (ret)
+ return ret;
+
+ cgc = bpf_cgrp_storage_get(&cgrp_ctx, cgrp, 0,
+ BPF_LOCAL_STORAGE_GET_F_CREATE);
+ if (!cgc) {
+ ret = -ENOMEM;
+ goto err_destroy_dsq;
+ }
+
+ cgc->weight = args->weight;
+ cgc->hweight = FCG_HWEIGHT_ONE;
+
+ ret = bpf_map_update_elem(&cgv_node_stash, &cgid, &empty_stash,
+ BPF_NOEXIST);
+ if (ret) {
+ if (ret != -ENOMEM)
+ scx_bpf_error("unexpected stash creation error (%d)",
+ ret);
+ goto err_destroy_dsq;
+ }
+
+ stash = bpf_map_lookup_elem(&cgv_node_stash, &cgid);
+ if (!stash) {
+ scx_bpf_error("unexpected cgv_node stash lookup failure");
+ ret = -ENOENT;
+ goto err_destroy_dsq;
+ }
+
+ cgv_node = bpf_obj_new(struct cgv_node);
+ if (!cgv_node) {
+ ret = -ENOMEM;
+ goto err_del_cgv_node;
+ }
+
+ cgv_node->cgid = cgid;
+ cgv_node->cvtime = cvtime_now;
+
+ cgv_node = bpf_kptr_xchg(&stash->node, cgv_node);
+ if (cgv_node) {
+ scx_bpf_error("unexpected !NULL cgv_node stash");
+ ret = -EBUSY;
+ goto err_drop;
+ }
+
+ return 0;
+
+err_drop:
+ bpf_obj_drop(cgv_node);
+err_del_cgv_node:
+ bpf_map_delete_elem(&cgv_node_stash, &cgid);
+err_destroy_dsq:
+ scx_bpf_destroy_dsq(cgid);
+ return ret;
+}
+
+void BPF_STRUCT_OPS(fcg_cgroup_exit, struct cgroup *cgrp)
+{
+ u64 cgid = cgrp->kn->id;
+
+ /*
+ * For now, there's no way find and remove the cgv_node if it's on the
+ * cgv_tree. Let's drain them in the dispatch path as they get popped
+ * off the front of the tree.
+ */
+ bpf_map_delete_elem(&cgv_node_stash, &cgid);
+ scx_bpf_destroy_dsq(cgid);
+}
+
+void BPF_STRUCT_OPS(fcg_exit, struct scx_exit_info *ei)
+{
+ UEI_RECORD(uei, ei);
+}
+
+SCX_OPS_DEFINE(flatcg_ops,
+ .select_cpu = (void *)fcg_select_cpu,
+ .enqueue = (void *)fcg_enqueue,
+ .dispatch = (void *)fcg_dispatch,
+ .runnable = (void *)fcg_runnable,
+ .stopping = (void *)fcg_stopping,
+ .quiescent = (void *)fcg_quiescent,
+ .init_task = (void *)fcg_init_task,
+ .cgroup_set_weight = (void *)fcg_cgroup_set_weight,
+ .cgroup_init = (void *)fcg_cgroup_init,
+ .cgroup_exit = (void *)fcg_cgroup_exit,
+ .exit = (void *)fcg_exit,
+ .flags = SCX_OPS_CGROUP_KNOB_WEIGHT | SCX_OPS_ENQ_EXITING,
+ .name = "flatcg");
diff --git a/tools/sched_ext/scx_flatcg.c b/tools/sched_ext/scx_flatcg.c
new file mode 100644
index 000000000000..bb0a832a0cfd
--- /dev/null
+++ b/tools/sched_ext/scx_flatcg.c
@@ -0,0 +1,225 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ */
+#include <stdio.h>
+#include <signal.h>
+#include <unistd.h>
+#include <libgen.h>
+#include <limits.h>
+#include <inttypes.h>
+#include <fcntl.h>
+#include <time.h>
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include "scx_flatcg.h"
+#include "scx_flatcg.bpf.skel.h"
+
+#ifndef FILEID_KERNFS
+#define FILEID_KERNFS 0xfe
+#endif
+
+const char help_fmt[] =
+"A flattened cgroup hierarchy sched_ext scheduler.\n"
+"\n"
+"See the top-level comment in .bpf.c for more details.\n"
+"\n"
+"Usage: %s [-s SLICE_US] [-i INTERVAL] [-v]\n"
+"\n"
+" -s SLICE_US Override slice duration\n"
+" -i INTERVAL Report interval\n"
+" -v Print libbpf debug messages\n"
+" -h Display this help and exit\n";
+
+static bool verbose;
+static volatile int exit_req;
+
+static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
+{
+ if (level == LIBBPF_DEBUG && !verbose)
+ return 0;
+ return vfprintf(stderr, format, args);
+}
+
+static void sigint_handler(int dummy)
+{
+ exit_req = 1;
+}
+
+static float read_cpu_util(__u64 *last_sum, __u64 *last_idle)
+{
+ FILE *fp;
+ char buf[4096];
+ char *line, *cur = NULL, *tok;
+ __u64 sum = 0, idle = 0;
+ __u64 delta_sum, delta_idle;
+ int idx;
+
+ fp = fopen("/proc/stat", "r");
+ if (!fp) {
+ perror("fopen(\"/proc/stat\")");
+ return 0.0;
+ }
+
+ if (!fgets(buf, sizeof(buf), fp)) {
+ perror("fgets(\"/proc/stat\")");
+ fclose(fp);
+ return 0.0;
+ }
+ fclose(fp);
+
+ line = buf;
+ for (idx = 0; (tok = strtok_r(line, " \n", &cur)); idx++) {
+ char *endp = NULL;
+ __u64 v;
+
+ if (idx == 0) {
+ line = NULL;
+ continue;
+ }
+ v = strtoull(tok, &endp, 0);
+ if (!endp || *endp != '\0') {
+ fprintf(stderr, "failed to parse %dth field of /proc/stat (\"%s\")\n",
+ idx, tok);
+ continue;
+ }
+ sum += v;
+ if (idx == 4)
+ idle = v;
+ }
+
+ delta_sum = sum - *last_sum;
+ delta_idle = idle - *last_idle;
+ *last_sum = sum;
+ *last_idle = idle;
+
+ return delta_sum ? (float)(delta_sum - delta_idle) / delta_sum : 0.0;
+}
+
+static void fcg_read_stats(struct scx_flatcg *skel, __u64 *stats)
+{
+ __u64 cnts[FCG_NR_STATS][skel->rodata->nr_cpus];
+ __u32 idx;
+
+ memset(stats, 0, sizeof(stats[0]) * FCG_NR_STATS);
+
+ for (idx = 0; idx < FCG_NR_STATS; idx++) {
+ int ret, cpu;
+
+ ret = bpf_map_lookup_elem(bpf_map__fd(skel->maps.stats),
+ &idx, cnts[idx]);
+ if (ret < 0)
+ continue;
+ for (cpu = 0; cpu < skel->rodata->nr_cpus; cpu++)
+ stats[idx] += cnts[idx][cpu];
+ }
+}
+
+int main(int argc, char **argv)
+{
+ struct scx_flatcg *skel;
+ struct bpf_link *link;
+ struct timespec intv_ts = { .tv_sec = 2, .tv_nsec = 0 };
+ bool dump_cgrps = false;
+ __u64 last_cpu_sum = 0, last_cpu_idle = 0;
+ __u64 last_stats[FCG_NR_STATS] = {};
+ unsigned long seq = 0;
+ __s32 opt;
+
+ libbpf_set_print(libbpf_print_fn);
+ signal(SIGINT, sigint_handler);
+ signal(SIGTERM, sigint_handler);
+
+ skel = SCX_OPS_OPEN(flatcg_ops, scx_flatcg);
+
+ skel->rodata->nr_cpus = libbpf_num_possible_cpus();
+
+ while ((opt = getopt(argc, argv, "s:i:dvh")) != -1) {
+ double v;
+
+ switch (opt) {
+ case 's':
+ v = strtod(optarg, NULL);
+ skel->rodata->cgrp_slice_ns = v * 1000;
+ break;
+ case 'i':
+ v = strtod(optarg, NULL);
+ intv_ts.tv_sec = v;
+ intv_ts.tv_nsec = (v - (float)intv_ts.tv_sec) * 1000000000;
+ break;
+ case 'd':
+ dump_cgrps = true;
+ break;
+ case 'v':
+ verbose = true;
+ break;
+ case 'h':
+ default:
+ fprintf(stderr, help_fmt, basename(argv[0]));
+ return opt != 'h';
+ }
+ }
+
+ printf("slice=%.1lfms intv=%.1lfs dump_cgrps=%d",
+ (double)skel->rodata->cgrp_slice_ns / 1000000.0,
+ (double)intv_ts.tv_sec + (double)intv_ts.tv_nsec / 1000000000.0,
+ dump_cgrps);
+
+ SCX_OPS_LOAD(skel, flatcg_ops, scx_flatcg, uei);
+ link = SCX_OPS_ATTACH(skel, flatcg_ops);
+
+ while (!exit_req && !UEI_EXITED(skel, uei)) {
+ __u64 acc_stats[FCG_NR_STATS];
+ __u64 stats[FCG_NR_STATS];
+ float cpu_util;
+ int i;
+
+ cpu_util = read_cpu_util(&last_cpu_sum, &last_cpu_idle);
+
+ fcg_read_stats(skel, acc_stats);
+ for (i = 0; i < FCG_NR_STATS; i++)
+ stats[i] = acc_stats[i] - last_stats[i];
+
+ memcpy(last_stats, acc_stats, sizeof(acc_stats));
+
+ printf("\n[SEQ %6lu cpu=%5.1lf hweight_gen=%" PRIu64 "]\n",
+ seq++, cpu_util * 100.0, skel->data->hweight_gen);
+ printf(" act:%6llu deact:%6llu global:%6llu local:%6llu\n",
+ stats[FCG_STAT_ACT],
+ stats[FCG_STAT_DEACT],
+ stats[FCG_STAT_GLOBAL],
+ stats[FCG_STAT_LOCAL]);
+ printf("HWT cache:%6llu update:%6llu skip:%6llu race:%6llu\n",
+ stats[FCG_STAT_HWT_CACHE],
+ stats[FCG_STAT_HWT_UPDATES],
+ stats[FCG_STAT_HWT_SKIP],
+ stats[FCG_STAT_HWT_RACE]);
+ printf("ENQ skip:%6llu race:%6llu\n",
+ stats[FCG_STAT_ENQ_SKIP],
+ stats[FCG_STAT_ENQ_RACE]);
+ printf("CNS keep:%6llu expire:%6llu empty:%6llu gone:%6llu\n",
+ stats[FCG_STAT_CNS_KEEP],
+ stats[FCG_STAT_CNS_EXPIRE],
+ stats[FCG_STAT_CNS_EMPTY],
+ stats[FCG_STAT_CNS_GONE]);
+ printf("PNC next:%6llu empty:%6llu nocgrp:%6llu gone:%6llu race:%6llu fail:%6llu\n",
+ stats[FCG_STAT_PNC_NEXT],
+ stats[FCG_STAT_PNC_EMPTY],
+ stats[FCG_STAT_PNC_NO_CGRP],
+ stats[FCG_STAT_PNC_GONE],
+ stats[FCG_STAT_PNC_RACE],
+ stats[FCG_STAT_PNC_FAIL]);
+ printf("BAD remove:%6llu\n",
+ acc_stats[FCG_STAT_BAD_REMOVAL]);
+ fflush(stdout);
+
+ nanosleep(&intv_ts, NULL);
+ }
+
+ bpf_link__destroy(link);
+ UEI_REPORT(skel, uei);
+ scx_flatcg__destroy(skel);
+ return 0;
+}
diff --git a/tools/sched_ext/scx_flatcg.h b/tools/sched_ext/scx_flatcg.h
new file mode 100644
index 000000000000..6f2ea50acb1c
--- /dev/null
+++ b/tools/sched_ext/scx_flatcg.h
@@ -0,0 +1,51 @@
+#ifndef __SCX_EXAMPLE_FLATCG_H
+#define __SCX_EXAMPLE_FLATCG_H
+
+enum {
+ FCG_HWEIGHT_ONE = 1LLU << 16,
+};
+
+enum fcg_stat_idx {
+ FCG_STAT_ACT,
+ FCG_STAT_DEACT,
+ FCG_STAT_LOCAL,
+ FCG_STAT_GLOBAL,
+
+ FCG_STAT_HWT_UPDATES,
+ FCG_STAT_HWT_CACHE,
+ FCG_STAT_HWT_SKIP,
+ FCG_STAT_HWT_RACE,
+
+ FCG_STAT_ENQ_SKIP,
+ FCG_STAT_ENQ_RACE,
+
+ FCG_STAT_CNS_KEEP,
+ FCG_STAT_CNS_EXPIRE,
+ FCG_STAT_CNS_EMPTY,
+ FCG_STAT_CNS_GONE,
+
+ FCG_STAT_PNC_NO_CGRP,
+ FCG_STAT_PNC_NEXT,
+ FCG_STAT_PNC_EMPTY,
+ FCG_STAT_PNC_GONE,
+ FCG_STAT_PNC_RACE,
+ FCG_STAT_PNC_FAIL,
+
+ FCG_STAT_BAD_REMOVAL,
+
+ FCG_NR_STATS,
+};
+
+struct fcg_cgrp_ctx {
+ u32 nr_active;
+ u32 nr_runnable;
+ u32 queued;
+ u32 weight;
+ u32 hweight;
+ u64 child_weight_sum;
+ u64 hweight_gen;
+ s64 cvtime_delta;
+ u64 tvtime_now;
+};
+
+#endif /* __SCX_EXAMPLE_FLATCG_H */
--
2.44.0


2024-05-01 15:21:03

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 31/39] sched_ext: Implement sched_ext_ops.cpu_acquire/release()

From: David Vernet <[email protected]>

Scheduler classes are strictly ordered and when a higher priority class has
tasks to run, the lower priority ones lose access to the CPU. Being able to
monitor and act on these events are necessary for use cases includling
strict core-scheduling and latency management.

This patch adds two operations ops.cpu_acquire() and .cpu_release(). The
former is invoked when a CPU becomes available to the BPF scheduler and the
opposite for the latter. This patch also implements
scx_bpf_reenqueue_local() which can be called from .cpu_release() to trigger
requeueing of all tasks in the local dsq of the CPU so that the tasks can be
reassigned to other available CPUs.

scx_pair is updated to use .cpu_acquire/release() along with
%SCX_KICK_WAIT to make the pair scheduling guarantee strict even when a CPU
is preempted by a higher priority scheduler class.

scx_qmap is updated to use .cpu_acquire/release() to empty the local
dsq of a preempted CPU. A similar approach can be adopted by BPF schedulers
that want to have a tight control over latency.

v4: Use the new SCX_KICK_IDLE to wake up a CPU after re-enqueueing.

v3: Drop the const qualifier from scx_cpu_release_args.task. BPF enforces
access control through the verifier, so the qualifier isn't actually
operative and only gets in the way when interacting with various
helpers.

v2: Add p->scx.kf_mask annotation to allow calling scx_bpf_reenqueue_local()
from ops.cpu_release() nested inside ops.init() and other sleepable
operations.

Signed-off-by: David Vernet <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/sched/ext.h | 4 +-
kernel/sched/ext.c | 200 ++++++++++++++++++++++-
kernel/sched/ext.h | 2 +
kernel/sched/sched.h | 1 +
tools/sched_ext/include/scx/common.bpf.h | 1 +
tools/sched_ext/scx_qmap.bpf.c | 37 ++++-
tools/sched_ext/scx_qmap.c | 4 +-
7 files changed, 242 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 0a9f8e5a46af..1dc0182fb1c8 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -98,13 +98,15 @@ enum scx_kf_mask {
SCX_KF_UNLOCKED = 0, /* not sleepable, not rq locked */
/* all non-sleepables may be nested inside SLEEPABLE */
SCX_KF_SLEEPABLE = 1 << 0, /* sleepable init operations */
+ /* ENQUEUE and DISPATCH may be nested inside CPU_RELEASE */
+ SCX_KF_CPU_RELEASE = 1 << 1, /* ops.cpu_release() */
/* ops.dequeue (in REST) may be nested inside DISPATCH */
SCX_KF_DISPATCH = 1 << 2, /* ops.dispatch() */
SCX_KF_ENQUEUE = 1 << 3, /* ops.enqueue() and ops.select_cpu() */
SCX_KF_SELECT_CPU = 1 << 4, /* ops.select_cpu() */
SCX_KF_REST = 1 << 5, /* other rq-locked operations */

- __SCX_KF_RQ_LOCKED = SCX_KF_DISPATCH |
+ __SCX_KF_RQ_LOCKED = SCX_KF_CPU_RELEASE | SCX_KF_DISPATCH |
SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST,
__SCX_KF_TERMINAL = SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST,
};
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 91c3d1851b45..9bc03533cf5e 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -126,6 +126,32 @@ struct scx_cgroup_init_args {
u32 weight;
};

+enum scx_cpu_preempt_reason {
+ /* next task is being scheduled by &sched_class_rt */
+ SCX_CPU_PREEMPT_RT,
+ /* next task is being scheduled by &sched_class_dl */
+ SCX_CPU_PREEMPT_DL,
+ /* next task is being scheduled by &sched_class_stop */
+ SCX_CPU_PREEMPT_STOP,
+ /* unknown reason for SCX being preempted */
+ SCX_CPU_PREEMPT_UNKNOWN,
+};
+
+/*
+ * Argument container for ops->cpu_acquire(). Currently empty, but may be
+ * expanded in the future.
+ */
+struct scx_cpu_acquire_args {};
+
+/* argument container for ops->cpu_release() */
+struct scx_cpu_release_args {
+ /* the reason the CPU was preempted */
+ enum scx_cpu_preempt_reason reason;
+
+ /* the task that's going to be scheduled on the CPU */
+ struct task_struct *task;
+};
+
/**
* struct sched_ext_ops - Operation table for BPF scheduler implementation
*
@@ -339,6 +365,28 @@ struct sched_ext_ops {
*/
void (*update_idle)(s32 cpu, bool idle);

+ /**
+ * cpu_acquire - A CPU is becoming available to the BPF scheduler
+ * @cpu: The CPU being acquired by the BPF scheduler.
+ * @args: Acquire arguments, see the struct definition.
+ *
+ * A CPU that was previously released from the BPF scheduler is now once
+ * again under its control.
+ */
+ void (*cpu_acquire)(s32 cpu, struct scx_cpu_acquire_args *args);
+
+ /**
+ * cpu_release - A CPU is taken away from the BPF scheduler
+ * @cpu: The CPU being released by the BPF scheduler.
+ * @args: Release arguments, see the struct definition.
+ *
+ * The specified CPU is no longer under the control of the BPF
+ * scheduler. This could be because it was preempted by a higher
+ * priority sched_class, though there may be other reasons as well. The
+ * caller should consult @args->reason to determine the cause.
+ */
+ void (*cpu_release)(s32 cpu, struct scx_cpu_release_args *args);
+
/**
* init_task - Initialize a task to run in a BPF scheduler
* @p: task to initialize for BPF scheduling
@@ -534,6 +582,17 @@ enum scx_enq_flags {
*/
SCX_ENQ_PREEMPT = 1LLU << 32,

+ /*
+ * The task being enqueued was previously enqueued on the current CPU's
+ * %SCX_DSQ_LOCAL, but was removed from it in a call to the
+ * bpf_scx_reenqueue_local() kfunc. If bpf_scx_reenqueue_local() was
+ * invoked in a ->cpu_release() callback, and the task is again
+ * dispatched back to %SCX_LOCAL_DSQ by this current ->enqueue(), the
+ * task will not be scheduled on the CPU until at least the next invocation
+ * of the ->cpu_acquire() callback.
+ */
+ SCX_ENQ_REENQ = 1LLU << 40,
+
/*
* The task being enqueued is the only task available for the cpu. By
* default, ext core keeps executing such tasks but when
@@ -677,6 +736,7 @@ static bool scx_warned_zero_slice;

static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_last);
static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_exiting);
+DEFINE_STATIC_KEY_FALSE(scx_ops_cpu_preempt);
static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);

struct static_key_false scx_has_op[SCX_OPI_END] =
@@ -913,6 +973,12 @@ static __always_inline bool scx_kf_allowed(u32 mask)
* inside ops.dispatch(). We don't need to check the SCX_KF_SLEEPABLE
* boundary thanks to the above in_interrupt() check.
*/
+ if (unlikely(highest_bit(mask) == SCX_KF_CPU_RELEASE &&
+ (current->scx.kf_mask & higher_bits(SCX_KF_CPU_RELEASE)))) {
+ scx_ops_error("cpu_release kfunc called from a nested operation");
+ return false;
+ }
+
if (unlikely(highest_bit(mask) == SCX_KF_DISPATCH &&
(current->scx.kf_mask & higher_bits(SCX_KF_DISPATCH)))) {
scx_ops_error("dispatch kfunc called from a nested operation");
@@ -2097,6 +2163,19 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
lockdep_assert_rq_held(rq);
scx_rq->flags |= SCX_RQ_BALANCING;

+ if (static_branch_unlikely(&scx_ops_cpu_preempt) &&
+ unlikely(rq->scx.cpu_released)) {
+ /*
+ * If the previous sched_class for the current CPU was not SCX,
+ * notify the BPF scheduler that it again has control of the
+ * core. This callback complements ->cpu_release(), which is
+ * emitted in scx_next_task_picked().
+ */
+ if (SCX_HAS_OP(cpu_acquire))
+ SCX_CALL_OP(0, cpu_acquire, cpu_of(rq), NULL);
+ rq->scx.cpu_released = false;
+ }
+
if (prev_on_scx) {
WARN_ON_ONCE(prev->scx.flags & SCX_TASK_BAL_KEEP);
update_curr_scx(rq);
@@ -2104,7 +2183,9 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
/*
* If @prev is runnable & has slice left, it has priority and
* fetching more just increases latency for the fetched tasks.
- * Tell put_prev_task_scx() to put @prev on local_dsq.
+ * Tell put_prev_task_scx() to put @prev on local_dsq. If the
+ * BPF scheduler wants to handle this explicitly, it should
+ * implement ->cpu_released().
*
* See scx_ops_disable_workfn() for the explanation on the
* bypassing test.
@@ -2324,6 +2405,20 @@ static struct task_struct *pick_next_task_scx(struct rq *rq)
return p;
}

+static enum scx_cpu_preempt_reason
+preempt_reason_from_class(const struct sched_class *class)
+{
+#ifdef CONFIG_SMP
+ if (class == &stop_sched_class)
+ return SCX_CPU_PREEMPT_STOP;
+#endif
+ if (class == &dl_sched_class)
+ return SCX_CPU_PREEMPT_DL;
+ if (class == &rt_sched_class)
+ return SCX_CPU_PREEMPT_RT;
+ return SCX_CPU_PREEMPT_UNKNOWN;
+}
+
void scx_next_task_picked(struct rq *rq, struct task_struct *p,
const struct sched_class *active)
{
@@ -2339,6 +2434,40 @@ void scx_next_task_picked(struct rq *rq, struct task_struct *p,
*/
smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
#endif
+ if (!static_branch_unlikely(&scx_ops_cpu_preempt))
+ return;
+
+ /*
+ * The callback is conceptually meant to convey that the CPU is no
+ * longer under the control of SCX. Therefore, don't invoke the
+ * callback if the CPU is is staying on SCX, or going idle (in which
+ * case the SCX scheduler has actively decided not to schedule any
+ * tasks on the CPU).
+ */
+ if (likely(active >= &ext_sched_class))
+ return;
+
+ /*
+ * At this point we know that SCX was preempted by a higher priority
+ * sched_class, so invoke the ->cpu_release() callback if we have not
+ * done so already. We only send the callback once between SCX being
+ * preempted, and it regaining control of the CPU.
+ *
+ * ->cpu_release() complements ->cpu_acquire(), which is emitted the
+ * next time that balance_scx() is invoked.
+ */
+ if (!rq->scx.cpu_released) {
+ if (SCX_HAS_OP(cpu_release)) {
+ struct scx_cpu_release_args args = {
+ .reason = preempt_reason_from_class(active),
+ .task = p,
+ };
+
+ SCX_CALL_OP(SCX_KF_CPU_RELEASE,
+ cpu_release, cpu_of(rq), &args);
+ }
+ rq->scx.cpu_released = true;
+ }
}

#ifdef CONFIG_SMP
@@ -3735,6 +3864,7 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
static_branch_disable_cpuslocked(&scx_has_op[i]);
static_branch_disable_cpuslocked(&scx_ops_enq_last);
static_branch_disable_cpuslocked(&scx_ops_enq_exiting);
+ static_branch_disable_cpuslocked(&scx_ops_cpu_preempt);
static_branch_disable_cpuslocked(&scx_builtin_idle_enabled);
synchronize_rcu();

@@ -3894,9 +4024,10 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
rq->curr->sched_class == &idle_sched_class)
goto next;

- seq_buf_printf(&s, "\nCPU %-4d: nr_run=%u flags=0x%x ops_qseq=%lu pnt_seq=%lu\n",
+ seq_buf_printf(&s, "\nCPU %-4d: nr_run=%u flags=0x%x cpu_rel=%d ops_qseq=%lu pnt_seq=%lu\n",
cpu, rq->scx.nr_running, rq->scx.flags,
- rq->scx.ops_qseq, rq->scx.pnt_seq);
+ rq->scx.cpu_released, rq->scx.ops_qseq,
+ rq->scx.pnt_seq);
seq_buf_printf(&s, " curr=%s[%d] class=%ps\n",
rq->curr->comm, rq->curr->pid,
rq->curr->sched_class);
@@ -4117,6 +4248,8 @@ static int scx_ops_enable(struct sched_ext_ops *ops)

if (ops->flags & SCX_OPS_ENQ_EXITING)
static_branch_enable_cpuslocked(&scx_ops_enq_exiting);
+ if (scx_ops.cpu_acquire || scx_ops.cpu_release)
+ static_branch_enable_cpuslocked(&scx_ops_cpu_preempt);

if (!ops->update_idle || (ops->flags & SCX_OPS_KEEP_BUILTIN_IDLE)) {
reset_idle_masks();
@@ -4512,6 +4645,8 @@ static bool yield_stub(struct task_struct *from, struct task_struct *to) { retur
static void set_weight_stub(struct task_struct *p, u32 weight) {}
static void set_cpumask_stub(struct task_struct *p, const struct cpumask *mask) {}
static void update_idle_stub(s32 cpu, bool idle) {}
+static void cpu_acquire_stub(s32 cpu, struct scx_cpu_acquire_args *args) {}
+static void cpu_release_stub(s32 cpu, struct scx_cpu_release_args *args) {}
static s32 init_task_stub(struct task_struct *p, struct scx_init_task_args *args) { return -EINVAL; }
static void exit_task_stub(struct task_struct *p, struct scx_exit_task_args *args) {}
static void enable_stub(struct task_struct *p) {}
@@ -4540,6 +4675,8 @@ static struct sched_ext_ops __bpf_ops_sched_ext_ops = {
.set_weight = set_weight_stub,
.set_cpumask = set_cpumask_stub,
.update_idle = update_idle_stub,
+ .cpu_acquire = cpu_acquire_stub,
+ .cpu_release = cpu_release_stub,
.init_task = init_task_stub,
.exit_task = exit_task_stub,
.enable = enable_stub,
@@ -5068,6 +5205,61 @@ static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = {

__bpf_kfunc_start_defs();

+/**
+ * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ
+ *
+ * Iterate over all of the tasks currently enqueued on the local DSQ of the
+ * caller's CPU, and re-enqueue them in the BPF scheduler. Returns the number of
+ * processed tasks. Can only be called from ops.cpu_release().
+ */
+__bpf_kfunc u32 scx_bpf_reenqueue_local(void)
+{
+ u32 nr_enqueued, i;
+ struct rq *rq;
+ struct scx_rq *scx_rq;
+
+ if (!scx_kf_allowed(SCX_KF_CPU_RELEASE))
+ return 0;
+
+ rq = cpu_rq(smp_processor_id());
+ lockdep_assert_rq_held(rq);
+ scx_rq = &rq->scx;
+
+ /*
+ * Get the number of tasks on the local DSQ before iterating over it to
+ * pull off tasks. The enqueue callback below can signal that it wants
+ * the task to stay on the local DSQ, and we want to prevent the BPF
+ * scheduler from causing us to loop indefinitely.
+ */
+ nr_enqueued = scx_rq->local_dsq.nr;
+ for (i = 0; i < nr_enqueued; i++) {
+ struct task_struct *p;
+
+ p = first_local_task(rq);
+ WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) !=
+ SCX_OPSS_NONE);
+ WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_QUEUED));
+ WARN_ON_ONCE(p->scx.holding_cpu != -1);
+ dispatch_dequeue(scx_rq, p);
+ do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
+ }
+
+ return nr_enqueued;
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(scx_kfunc_ids_cpu_release)
+BTF_ID_FLAGS(func, scx_bpf_reenqueue_local)
+BTF_KFUNCS_END(scx_kfunc_ids_cpu_release)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_cpu_release = {
+ .owner = THIS_MODULE,
+ .set = &scx_kfunc_ids_cpu_release,
+};
+
+__bpf_kfunc_start_defs();
+
/**
* scx_bpf_kick_cpu - Trigger reschedule on a CPU
* @cpu: cpu to kick
@@ -5563,6 +5755,8 @@ static int __init scx_init(void)
&scx_kfunc_set_enqueue_dispatch)) ||
(ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
&scx_kfunc_set_dispatch)) ||
+ (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+ &scx_kfunc_set_cpu_release)) ||
(ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
&scx_kfunc_set_any)) ||
(ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 5db35f627ea3..10f4717839c0 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -24,6 +24,8 @@ DECLARE_STATIC_KEY_FALSE(__scx_switched_all);
#define scx_enabled() static_branch_unlikely(&__scx_ops_enabled)
#define scx_switched_all() static_branch_unlikely(&__scx_switched_all)

+DECLARE_STATIC_KEY_FALSE(scx_ops_cpu_preempt);
+
static inline bool task_on_scx(const struct task_struct *p)
{
return scx_enabled() && p->sched_class == &ext_sched_class;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c8cf6fbaed07..e8ef7309f347 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -732,6 +732,7 @@ struct scx_rq {
u64 extra_enq_flags; /* see move_task_to_local_dsq() */
u32 nr_running;
u32 flags;
+ bool cpu_released;
cpumask_var_t cpus_to_kick;
cpumask_var_t cpus_to_kick_if_idle;
cpumask_var_t cpus_to_preempt;
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index f0dbaa1826a7..a3979e13aade 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -34,6 +34,7 @@ void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice, u64 enq_flag
u32 scx_bpf_dispatch_nr_slots(void) __ksym;
void scx_bpf_dispatch_cancel(void) __ksym;
bool scx_bpf_consume(u64 dsq_id) __ksym;
+u32 scx_bpf_reenqueue_local(void) __ksym;
void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym;
s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym;
void scx_bpf_destroy_dsq(u64 dsq_id) __ksym;
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index 812004bf027a..7c3b0dcae1e0 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -11,6 +11,8 @@
*
* - BPF-side queueing using PIDs.
* - Sleepable per-task storage allocation using ops.prep_enable().
+ * - Using ops.cpu_release() to handle a higher priority scheduling class taking
+ * the CPU away.
*
* This scheduler is primarily for demonstration and testing of sched_ext
* features and unlikely to be useful for actual workloads.
@@ -90,7 +92,7 @@ struct {
} cpu_ctx_stor SEC(".maps");

/* Statistics */
-u64 nr_enqueued, nr_dispatched, nr_dequeued;
+u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued;

s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
s32 prev_cpu, u64 wake_flags)
@@ -164,6 +166,22 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
return;
}

+ /*
+ * If the task was re-enqueued due to the CPU being preempted by a
+ * higher priority scheduling class, just re-enqueue the task directly
+ * on the global DSQ. As we want another CPU to pick it up, find and
+ * kick an idle CPU.
+ */
+ if (enq_flags & SCX_ENQ_REENQ) {
+ s32 cpu;
+
+ scx_bpf_dispatch(p, SHARED_DSQ, 0, enq_flags);
+ cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
+ if (cpu >= 0)
+ scx_bpf_kick_cpu(cpu, __COMPAT_SCX_KICK_IDLE);
+ return;
+ }
+
ring = bpf_map_lookup_elem(&queue_arr, &idx);
if (!ring) {
scx_bpf_error("failed to find ring %d", idx);
@@ -257,6 +275,22 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
}
}

+void BPF_STRUCT_OPS(qmap_cpu_release, s32 cpu, struct scx_cpu_release_args *args)
+{
+ u32 cnt;
+
+ /*
+ * Called when @cpu is taken by a higher priority scheduling class. This
+ * makes @cpu no longer available for executing sched_ext tasks. As we
+ * don't want the tasks in @cpu's local dsq to sit there until @cpu
+ * becomes available again, re-enqueue them into the global dsq. See
+ * %SCX_ENQ_REENQ handling in qmap_enqueue().
+ */
+ cnt = scx_bpf_reenqueue_local();
+ if (cnt)
+ __sync_fetch_and_add(&nr_reenqueued, cnt);
+}
+
s32 BPF_STRUCT_OPS(qmap_init_task, struct task_struct *p,
struct scx_init_task_args *args)
{
@@ -292,6 +326,7 @@ SCX_OPS_DEFINE(qmap_ops,
.enqueue = (void *)qmap_enqueue,
.dequeue = (void *)qmap_dequeue,
.dispatch = (void *)qmap_dispatch,
+ .cpu_release = (void *)qmap_cpu_release,
.init_task = (void *)qmap_init_task,
.init = (void *)qmap_init,
.exit = (void *)qmap_exit,
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index 36254631589e..048b31eed17d 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -109,9 +109,9 @@ int main(int argc, char **argv)
long nr_enqueued = skel->bss->nr_enqueued;
long nr_dispatched = skel->bss->nr_dispatched;

- printf("stats : enq=%lu dsp=%lu delta=%ld deq=%"PRIu64"\n",
+ printf("stats : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64"\n",
nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
- skel->bss->nr_dequeued);
+ skel->bss->nr_reenqueued, skel->bss->nr_dequeued);
fflush(stdout);
sleep(1);
}
--
2.44.0


2024-05-01 15:21:06

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 32/39] sched_ext: Implement sched_ext_ops.cpu_online/offline()

Add ops.cpu_online/offline() which are invoked when CPUs come online and
offline respectively. As the enqueue path already automatically bypasses
tasks to the local dsq on a deactivated CPU, BPF schedulers are guaranteed
to see tasks only on CPUs which are between online() and offline().

If the BPF scheduler doesn't implement ops.cpu_online/offline(), the
scheduler is automatically exited with SCX_ECODE_RESTART |
SCX_ECODE_RSN_HOTPLUG. Userspace can implement CPU hotpplug support
trivially by simply reinitializing and reloading the scheduler.

scx_qmap is updated to print out online CPUs on hotplug events. Other
schedulers are updated to restart based on ecode.

v2: - To accommodate lock ordering change between scx_cgroup_rwsem and
cpus_read_lock(), CPU hotplug operations are put into its own SCX_OPI
block and enabled eariler during scx_ope_enable() so that
cpus_read_lock() can be dropped before acquiring scx_cgroup_rwsem.

- Auto exit with ECODE added.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/ext.c | 135 ++++++++++++++++++-
tools/sched_ext/include/scx/compat.h | 29 ++++
tools/sched_ext/include/scx/user_exit_info.h | 28 ++++
tools/sched_ext/scx_central.c | 9 +-
tools/sched_ext/scx_flatcg.c | 8 +-
tools/sched_ext/scx_qmap.bpf.c | 60 +++++++++
tools/sched_ext/scx_qmap.c | 4 +
tools/sched_ext/scx_simple.c | 8 +-
8 files changed, 271 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 9bc03533cf5e..ed2452d42862 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -30,6 +30,29 @@ enum scx_exit_kind {
SCX_EXIT_ERROR_STALL, /* watchdog detected stalled runnable tasks */
};

+/*
+ * An exit code can be specified when exiting with scx_bpf_exit() or
+ * scx_ops_exit(), corresponding to exit_kind UNREG_BPF and UNREG_KERN
+ * respectively. The codes are 64bit of the format:
+ *
+ * Bits: [63 .. 48 47 .. 32 31 .. 0]
+ * [ SYS ACT ] [ SYS RSN ] [ USR ]
+ *
+ * SYS ACT: System-defined exit actions
+ * SYS RSN: System-defined exit reasons
+ * USR : User-defined exit codes and reasons
+ *
+ * Using the above, users may communicate intention and context by ORing system
+ * actions and/or system reasons with a user-defined exit code.
+ */
+enum scx_exit_code {
+ /* Reasons */
+ SCX_ECODE_RSN_HOTPLUG = 1LLU << 32,
+
+ /* Actions */
+ SCX_ECODE_ACT_RESTART = 1LLU << 48,
+};
+
/*
* scx_exit_info is passed to ops.exit() to describe why the BPF scheduler is
* being disabled.
@@ -504,7 +527,29 @@ struct sched_ext_ops {
#endif /* CONFIG_CGROUPS */

/*
- * All online ops must come before ops.init().
+ * All online ops must come before ops.cpu_online().
+ */
+
+ /**
+ * cpu_online - A CPU became online
+ * @cpu: CPU which just came up
+ *
+ * @cpu just came online. @cpu doesn't call ops.enqueue() or run tasks
+ * associated with other CPUs beforehand.
+ */
+ void (*cpu_online)(s32 cpu);
+
+ /**
+ * cpu_offline - A CPU is going offline
+ * @cpu: CPU which is going offline
+ *
+ * @cpu is going offline. @cpu doesn't call ops.enqueue() or run tasks
+ * associated with other CPUs afterwards.
+ */
+ void (*cpu_offline)(s32 cpu);
+
+ /*
+ * All CPU hotplug ops must come before ops.init().
*/

/**
@@ -543,6 +588,15 @@ struct sched_ext_ops {
*/
u32 exit_dump_len;

+ /**
+ * hotplug_seq - A sequence number that may be set by the scheduler to
+ * detect when a hotplug event has occurred during the loading process.
+ * If 0, no detection occurs. Otherwise, the scheduler will fail to
+ * load if the sequence number does not match @scx_hotplug_seq on the
+ * enable path.
+ */
+ u64 hotplug_seq;
+
/**
* name - BPF scheduler's name
*
@@ -556,7 +610,9 @@ struct sched_ext_ops {
enum scx_opi {
SCX_OPI_BEGIN = 0,
SCX_OPI_NORMAL_BEGIN = 0,
- SCX_OPI_NORMAL_END = SCX_OP_IDX(init),
+ SCX_OPI_NORMAL_END = SCX_OP_IDX(cpu_online),
+ SCX_OPI_CPU_HOTPLUG_BEGIN = SCX_OP_IDX(cpu_online),
+ SCX_OPI_CPU_HOTPLUG_END = SCX_OP_IDX(init),
SCX_OPI_END = SCX_OP_IDX(init),
};

@@ -746,6 +802,7 @@ static atomic_t scx_exit_kind = ATOMIC_INIT(SCX_EXIT_DONE);
static struct scx_exit_info *scx_exit_info;

static atomic_long_t scx_nr_rejected = ATOMIC_LONG_INIT(0);
+static atomic_long_t scx_hotplug_seq = ATOMIC_LONG_INIT(0);

/*
* The maximum amount of time in jiffies that a task may be runnable without
@@ -2172,7 +2229,8 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
* emitted in scx_next_task_picked().
*/
if (SCX_HAS_OP(cpu_acquire))
- SCX_CALL_OP(0, cpu_acquire, cpu_of(rq), NULL);
+ SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_acquire, cpu_of(rq),
+ NULL);
rq->scx.cpu_released = false;
}

@@ -2700,6 +2758,34 @@ void __scx_update_idle(struct rq *rq, bool idle)
#endif
}

+static void handle_hotplug(struct rq *rq, bool online)
+{
+ int cpu = cpu_of(rq);
+
+ atomic_long_inc(&scx_hotplug_seq);
+
+ if (online && SCX_HAS_OP(cpu_online))
+ SCX_CALL_OP(SCX_KF_REST, cpu_online, cpu);
+ else if (!online && SCX_HAS_OP(cpu_offline))
+ SCX_CALL_OP(SCX_KF_REST, cpu_offline, cpu);
+ else
+ scx_ops_exit(SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG,
+ "cpu %d going %s, exiting scheduler", cpu,
+ online ? "online" : "offline");
+}
+
+static void rq_online_scx(struct rq *rq, enum rq_onoff_reason reason)
+{
+ if (reason == RQ_ONOFF_HOTPLUG)
+ handle_hotplug(rq, true);
+}
+
+static void rq_offline_scx(struct rq *rq, enum rq_onoff_reason reason)
+{
+ if (reason == RQ_ONOFF_HOTPLUG)
+ handle_hotplug(rq, false);
+}
+
#else /* CONFIG_SMP */

static bool test_and_clear_cpu_idle(int cpu) { return false; }
@@ -3328,6 +3414,9 @@ DEFINE_SCHED_CLASS(ext) = {
.balance = balance_scx,
.select_task_rq = select_task_rq_scx,
.set_cpus_allowed = set_cpus_allowed_scx,
+
+ .rq_online = rq_online_scx,
+ .rq_offline = rq_offline_scx,
#endif

.task_tick = task_tick_scx,
@@ -3584,10 +3673,18 @@ static ssize_t scx_attr_nr_rejected_show(struct kobject *kobj,
}
SCX_ATTR(nr_rejected);

+static ssize_t scx_attr_hotplug_seq_show(struct kobject *kobj,
+ struct kobj_attribute *ka, char *buf)
+{
+ return sysfs_emit(buf, "%ld\n", atomic_long_read(&scx_hotplug_seq));
+}
+SCX_ATTR(hotplug_seq);
+
static struct attribute *scx_global_attrs[] = {
&scx_attr_state.attr,
&scx_attr_switch_all.attr,
&scx_attr_nr_rejected.attr,
+ &scx_attr_hotplug_seq.attr,
NULL,
};

@@ -4110,6 +4207,25 @@ static struct kthread_worker *scx_create_rt_helper(const char *name)
return helper;
}

+static void check_hotplug_seq(const struct sched_ext_ops *ops)
+{
+ unsigned long long global_hotplug_seq;
+
+ /*
+ * If a hotplug event has occurred between when a scheduler was
+ * initialized, and when we were able to attach, exit and notify user
+ * space about it.
+ */
+ if (ops->hotplug_seq) {
+ global_hotplug_seq = atomic_long_read(&scx_hotplug_seq);
+ if (ops->hotplug_seq != global_hotplug_seq) {
+ scx_ops_exit(SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG,
+ "expected hotplug seq %llu did not match actual %llu",
+ ops->hotplug_seq, global_hotplug_seq);
+ }
+ }
+}
+
static int validate_ops(const struct sched_ext_ops *ops)
{
/*
@@ -4192,6 +4308,10 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
}
}

+ for (i = SCX_OPI_CPU_HOTPLUG_BEGIN; i < SCX_OPI_CPU_HOTPLUG_END; i++)
+ if (((void (**)(void))ops)[i])
+ static_branch_enable_cpuslocked(&scx_has_op[i]);
+
cpus_read_unlock();

ret = validate_ops(ops);
@@ -4239,6 +4359,8 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
cpus_read_lock();
scx_cgroup_lock();

+ check_hotplug_seq(ops);
+
for (i = SCX_OPI_NORMAL_BEGIN; i < SCX_OPI_NORMAL_END; i++)
if (((void (**)(void))ops)[i])
static_branch_enable_cpuslocked(&scx_has_op[i]);
@@ -4563,6 +4685,9 @@ static int bpf_scx_init_member(const struct btf_type *t,
ops->exit_dump_len =
*(u32 *)(udata + moff) ?: SCX_EXIT_DUMP_DFL_LEN;
return 1;
+ case offsetof(struct sched_ext_ops, hotplug_seq):
+ ops->hotplug_seq = *(u64 *)(udata + moff);
+ return 1;
}

return 0;
@@ -4659,6 +4784,8 @@ static void cgroup_move_stub(struct task_struct *p, struct cgroup *from, struct
static void cgroup_cancel_move_stub(struct task_struct *p, struct cgroup *from, struct cgroup *to) {}
static void cgroup_set_weight_stub(struct cgroup *cgrp, u32 weight) {}
#endif
+static void cpu_online_stub(s32 cpu) {}
+static void cpu_offline_stub(s32 cpu) {}
static s32 init_stub(void) { return -EINVAL; }
static void exit_stub(struct scx_exit_info *info) {}

@@ -4689,6 +4816,8 @@ static struct sched_ext_ops __bpf_ops_sched_ext_ops = {
.cgroup_cancel_move = cgroup_cancel_move_stub,
.cgroup_set_weight = cgroup_set_weight_stub,
#endif
+ .cpu_online = cpu_online_stub,
+ .cpu_offline = cpu_offline_stub,
.init = init_stub,
.exit = exit_stub,
};
diff --git a/tools/sched_ext/include/scx/compat.h b/tools/sched_ext/include/scx/compat.h
index 2be79bd88a25..7155b69150ff 100644
--- a/tools/sched_ext/include/scx/compat.h
+++ b/tools/sched_ext/include/scx/compat.h
@@ -8,6 +8,9 @@
#define __SCX_COMPAT_H

#include <bpf/btf.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <unistd.h>

struct btf *__COMPAT_vmlinux_btf __attribute__((weak));

@@ -120,6 +123,28 @@ static inline bool __COMPAT_struct_has_field(const char *type, const char *field
#define __COMPAT_HAS_CPUMASKS \
__COMPAT_has_ksym("scx_bpf_nr_cpu_ids")

+static inline long scx_hotplug_seq(void)
+{
+ int fd;
+ char buf[32];
+ ssize_t len;
+ long val;
+
+ fd = open("/sys/kernel/sched_ext/hotplug_seq", O_RDONLY);
+ if (fd < 0)
+ return -ENOENT;
+
+ len = read(fd, buf, sizeof(buf) - 1);
+ SCX_BUG_ON(len <= 0, "read failed (%ld)", len);
+ buf[len] = 0;
+ close(fd);
+
+ val = strtoul(buf, NULL, 10);
+ SCX_BUG_ON(val < 0, "invalid num hotplug events: %lu", val);
+
+ return val;
+}
+
/*
* struct sched_ext_ops can change over time. If compat.bpf.h::SCX_OPS_DEFINE()
* is used to define ops and compat.h::SCX_OPS_LOAD/ATTACH() are used to load
@@ -128,12 +153,16 @@ static inline bool __COMPAT_struct_has_field(const char *type, const char *field
*
* - ops.tick(): Ignored on older kernels with a warning.
* - ops.exit_dump_len: Cleared to zero on older kernels with a warning.
+ * - ops.hotplug_seq: Ignored on older kernels.
*/
#define SCX_OPS_OPEN(__ops_name, __scx_name) ({ \
struct __scx_name *__skel; \
\
__skel = __scx_name##__open(); \
SCX_BUG_ON(!__skel, "Could not open " #__scx_name); \
+ \
+ if (__COMPAT_struct_has_field("sched_ext_ops", "hotplug_seq")) \
+ __skel->struct_ops.__ops_name->hotplug_seq = scx_hotplug_seq(); \
__skel; \
})

diff --git a/tools/sched_ext/include/scx/user_exit_info.h b/tools/sched_ext/include/scx/user_exit_info.h
index cf4293cb250e..2d86d01a9575 100644
--- a/tools/sched_ext/include/scx/user_exit_info.h
+++ b/tools/sched_ext/include/scx/user_exit_info.h
@@ -77,7 +77,35 @@ struct user_exit_info {
if (__uei->msg[0] != '\0') \
fprintf(stderr, " (%s)", __uei->msg); \
fputs("\n", stderr); \
+ __uei->exit_code; \
})

+/*
+ * We can't import vmlinux.h while compiling user C code. Let's duplicate
+ * scx_exit_code definition.
+ */
+enum scx_exit_code {
+ /* Reasons */
+ SCX_ECODE_RSN_HOTPLUG = 1LLU << 32,
+
+ /* Actions */
+ SCX_ECODE_ACT_RESTART = 1LLU << 48,
+};
+
+enum uei_ecode_mask {
+ UEI_ECODE_USER_MASK = ((1LLU << 32) - 1),
+ UEI_ECODE_SYS_RSN_MASK = ((1LLU << 16) - 1) << 32,
+ UEI_ECODE_SYS_ACT_MASK = ((1LLU << 16) - 1) << 48,
+};
+
+/*
+ * These macro interpret the ecode returned from UEI_REPORT().
+ */
+#define UEI_ECODE_USER(__ecode) ((__ecode) & UEI_ECODE_USER_MASK)
+#define UEI_ECODE_SYS_RSN(__ecode) ((__ecode) & UEI_ECODE_SYS_RSN_MASK)
+#define UEI_ECODE_SYS_ACT(__ecode) ((__ecode) & UEI_ECODE_SYS_ACT_MASK)
+
+#define UEI_ECODE_RESTART(__ecode) (UEI_ECODE_SYS_ACT((__ecode)) == SCX_ECODE_ACT_RESTART)
+
#endif /* __bpf__ */
#endif /* __USER_EXIT_INFO_H */
diff --git a/tools/sched_ext/scx_central.c b/tools/sched_ext/scx_central.c
index 2908add16880..df692dc0ccb1 100644
--- a/tools/sched_ext/scx_central.c
+++ b/tools/sched_ext/scx_central.c
@@ -46,14 +46,14 @@ int main(int argc, char **argv)
{
struct scx_central *skel;
struct bpf_link *link;
- __u64 seq = 0;
+ __u64 seq = 0, ecode;
__s32 opt;
cpu_set_t *cpuset;

libbpf_set_print(libbpf_print_fn);
signal(SIGINT, sigint_handler);
signal(SIGTERM, sigint_handler);
-
+restart:
skel = SCX_OPS_OPEN(central_ops, scx_central);

skel->rodata->central_cpu = 0;
@@ -126,7 +126,10 @@ int main(int argc, char **argv)
}

bpf_link__destroy(link);
- UEI_REPORT(skel, uei);
+ ecode = UEI_REPORT(skel, uei);
scx_central__destroy(skel);
+
+ if (UEI_ECODE_RESTART(ecode))
+ goto restart;
return 0;
}
diff --git a/tools/sched_ext/scx_flatcg.c b/tools/sched_ext/scx_flatcg.c
index bb0a832a0cfd..20c5a132b610 100644
--- a/tools/sched_ext/scx_flatcg.c
+++ b/tools/sched_ext/scx_flatcg.c
@@ -127,11 +127,12 @@ int main(int argc, char **argv)
__u64 last_stats[FCG_NR_STATS] = {};
unsigned long seq = 0;
__s32 opt;
+ __u64 ecode;

libbpf_set_print(libbpf_print_fn);
signal(SIGINT, sigint_handler);
signal(SIGTERM, sigint_handler);
-
+restart:
skel = SCX_OPS_OPEN(flatcg_ops, scx_flatcg);

skel->rodata->nr_cpus = libbpf_num_possible_cpus();
@@ -219,7 +220,10 @@ int main(int argc, char **argv)
}

bpf_link__destroy(link);
- UEI_REPORT(skel, uei);
+ ecode = UEI_REPORT(skel, uei);
scx_flatcg__destroy(skel);
+
+ if (UEI_ECODE_RESTART(ecode))
+ goto restart;
return 0;
}
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index 7c3b0dcae1e0..c2edc080d7e5 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -308,11 +308,69 @@ s32 BPF_STRUCT_OPS(qmap_init_task, struct task_struct *p,
return -ENOMEM;
}

+/*
+ * Print out the online and possible CPU map using bpf_printk() as a
+ * demonstration of using the cpumask kfuncs and ops.cpu_on/offline().
+ */
+static void print_cpus(void)
+{
+ const struct cpumask *possible, *online;
+ s32 cpu;
+ char buf[128] = "", *p;
+ int idx;
+
+ if (!__COMPAT_HAS_CPUMASKS)
+ return;
+
+ possible = scx_bpf_get_possible_cpumask();
+ online = scx_bpf_get_online_cpumask();
+
+ idx = 0;
+ bpf_for(cpu, 0, scx_bpf_nr_cpu_ids()) {
+ if (!(p = MEMBER_VPTR(buf, [idx++])))
+ break;
+ if (bpf_cpumask_test_cpu(cpu, online))
+ *p++ = 'O';
+ else if (bpf_cpumask_test_cpu(cpu, possible))
+ *p++ = 'X';
+ else
+ *p++ = ' ';
+
+ if ((cpu & 7) == 7) {
+ if (!(p = MEMBER_VPTR(buf, [idx++])))
+ break;
+ *p++ = '|';
+ }
+ }
+ buf[sizeof(buf) - 1] = '\0';
+
+ scx_bpf_put_cpumask(online);
+ scx_bpf_put_cpumask(possible);
+
+ bpf_printk("CPUS: |%s", buf);
+}
+
+void BPF_STRUCT_OPS(qmap_cpu_online, s32 cpu)
+{
+ bpf_printk("CPU %d coming online", cpu);
+ /* @cpu is already online at this point */
+ print_cpus();
+}
+
+void BPF_STRUCT_OPS(qmap_cpu_offline, s32 cpu)
+{
+ bpf_printk("CPU %d going offline", cpu);
+ /* @cpu is still online at this point */
+ print_cpus();
+}
+
s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
{
if (!switch_partial)
__COMPAT_scx_bpf_switch_all();

+ print_cpus();
+
return scx_bpf_create_dsq(SHARED_DSQ, -1);
}

@@ -328,6 +386,8 @@ SCX_OPS_DEFINE(qmap_ops,
.dispatch = (void *)qmap_dispatch,
.cpu_release = (void *)qmap_cpu_release,
.init_task = (void *)qmap_init_task,
+ .cpu_online = (void *)qmap_cpu_online,
+ .cpu_offline = (void *)qmap_cpu_offline,
.init = (void *)qmap_init,
.exit = (void *)qmap_exit,
.timeout_ms = 5000U,
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index 048b31eed17d..e82f58b5c131 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -119,5 +119,9 @@ int main(int argc, char **argv)
bpf_link__destroy(link);
UEI_REPORT(skel, uei);
scx_qmap__destroy(skel);
+ /*
+ * scx_qmap implements ops.cpu_on/offline() and doesn't need to restart
+ * on CPU hotplug events.
+ */
return 0;
}
diff --git a/tools/sched_ext/scx_simple.c b/tools/sched_ext/scx_simple.c
index 9ffa8d084228..acee683a3ec9 100644
--- a/tools/sched_ext/scx_simple.c
+++ b/tools/sched_ext/scx_simple.c
@@ -62,11 +62,12 @@ int main(int argc, char **argv)
struct scx_simple *skel;
struct bpf_link *link;
__u32 opt;
+ __u64 ecode;

libbpf_set_print(libbpf_print_fn);
signal(SIGINT, sigint_handler);
signal(SIGTERM, sigint_handler);
-
+restart:
skel = SCX_OPS_OPEN(simple_ops, scx_simple);

while ((opt = getopt(argc, argv, "vh")) != -1) {
@@ -93,7 +94,10 @@ int main(int argc, char **argv)
}

bpf_link__destroy(link);
- UEI_REPORT(skel, uei);
+ ecode = UEI_REPORT(skel, uei);
scx_simple__destroy(skel);
+
+ if (UEI_ECODE_RESTART(ecode))
+ goto restart;
return 0;
}
--
2.44.0


2024-05-01 15:21:22

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 33/39] sched_ext: Bypass BPF scheduler while PM events are in progress

PM operations freeze userspace. Some BPF schedulers have active userspace
component and may misbehave as expected across PM events. While the system
is frozen, nothing too interesting is happening in terms of scheduling and
we can get by just fine with the fallback FIFO behavior. Let's make things
easier by always bypassing the BPF scheduler while PM events are in
progress.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
---
kernel/sched/ext.c | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index ed2452d42862..619ae7e814be 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5010,6 +5010,34 @@ void print_scx_info(const char *log_lvl, struct task_struct *p)
runnable_at_buf);
}

+static int scx_pm_handler(struct notifier_block *nb, unsigned long event, void *ptr)
+{
+ /*
+ * SCX schedulers often have userspace components which are sometimes
+ * involved in critial scheduling paths. PM operations involve freezing
+ * userspace which can lead to scheduling misbehaviors including stalls.
+ * Let's bypass while PM operations are in progress.
+ */
+ switch (event) {
+ case PM_HIBERNATION_PREPARE:
+ case PM_SUSPEND_PREPARE:
+ case PM_RESTORE_PREPARE:
+ scx_ops_bypass(true);
+ break;
+ case PM_POST_HIBERNATION:
+ case PM_POST_SUSPEND:
+ case PM_POST_RESTORE:
+ scx_ops_bypass(false);
+ break;
+ }
+
+ return NOTIFY_OK;
+}
+
+static struct notifier_block scx_pm_notifier = {
+ .notifier_call = scx_pm_handler,
+};
+
void __init init_sched_ext_class(void)
{
s32 cpu, v;
@@ -5902,6 +5930,12 @@ static int __init scx_init(void)
return ret;
}

+ ret = register_pm_notifier(&scx_pm_notifier);
+ if (ret) {
+ pr_err("sched_ext: Failed to register PM notifier (%d)\n", ret);
+ return ret;
+ }
+
scx_kset = kset_create_and_add("sched_ext", &scx_uevent_ops, kernel_kobj);
if (!scx_kset) {
pr_err("sched_ext: Failed to create /sys/kernel/sched_ext\n");
--
2.44.0


2024-05-01 15:22:02

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 35/39] sched_ext: Add vtime-ordered priority queue to dispatch_q's

Currently, a dsq is always a FIFO. A task which is dispatched earlier gets
consumed or executed earlier. While this is sufficient when dsq's are used
for simple staging areas for tasks which are ready to execute, it'd make
dsq's a lot more useful if they can implement custom ordering.

This patch adds a vtime-ordered priority queue to dsq's. When the BPF
scheduler dispatches a task with the new scx_bpf_dispatch_vtime() helper, it
can specify the vtime tha the task should be inserted at and the task is
inserted into the priority queue in the dsq which is ordered according to
time_before64() comparison of the vtime values.

A DSQ can either be a FIFO or priority queue and automatically switches
between the two depending on whether scx_bpf_dispatch() or
scx_bpf_dispatch_vtime() is used. Using the wrong variant while the DSQ
already has the other type queued is not allowed and triggers an ops error.
Built-in DSQs must always be FIFOs.

This makes it very easy for the BPF schedulers to implement proper vtime
based scheduling within each dsq very easy and efficient at a negligible
cost in terms of code complexity and overhead.

scx_simple and scx_example_flatcg are updated to default to weighted
vtime scheduling (the latter within each cgroup). FIFO scheduling can be
selected with -f option.

v4: - As allowing mixing priority queue and FIFO on the same DSQ sometimes
led to unexpected starvations, DSQs now error out if both modes are
used at the same time and the built-in DSQs are no longer allowed to
be priority queues.

- Explicit type struct scx_dsq_node added to contain fields needed to be
linked on DSQs. This will be used to implement stateful iterator.

- Tasks are now always linked on dsq->list whether the DSQ is in FIFO or
PRIQ mode. This confines PRIQ related complexities to the enqueue and
dequeue paths. Other paths only need to look at dsq->list. This will
also ease implementing BPF iterator.

- Print p->scx.dsq_flags in debug dump.

v3: - SCX_TASK_DSQ_ON_PRIQ flag is moved from p->scx.flags into its own
p->scx.dsq_flags. The flag is protected with the dsq lock unlike other
flags in p->scx.flags. This led to flag corruption in some cases.

- Add comments explaining the interaction between using consumption of
p->scx.slice to determine vtime progress and yielding.

v2: - p->scx.dsq_vtime was not initialized on load or across cgroup
migrations leading to some tasks being stalled for extended period of
time depending on how saturated the machine is. Fixed.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
---
include/linux/sched/ext.h | 29 ++++-
init/init_task.c | 2 +-
kernel/sched/ext.c | 150 ++++++++++++++++++++---
tools/sched_ext/include/scx/common.bpf.h | 1 +
tools/sched_ext/scx_flatcg.bpf.c | 80 +++++++++++-
tools/sched_ext/scx_flatcg.c | 8 +-
tools/sched_ext/scx_simple.bpf.c | 90 +++++++++++++-
tools/sched_ext/scx_simple.c | 8 +-
8 files changed, 342 insertions(+), 26 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index d89b4d907b26..8c6299915800 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -49,12 +49,15 @@ enum scx_dsq_id_flags {
};

/*
- * Dispatch queue (dsq) is a simple FIFO which is used to buffer between the
- * scheduler core and the BPF scheduler. See the documentation for more details.
+ * A dispatch queue (DSQ) can be either a FIFO or p->scx.dsq_vtime ordered
+ * queue. A built-in DSQ is always a FIFO. The built-in local DSQs are used to
+ * buffer between the scheduler core and the BPF scheduler. See the
+ * documentation for more details.
*/
struct scx_dispatch_q {
raw_spinlock_t lock;
struct list_head list; /* tasks in dispatch order */
+ struct rb_root priq; /* used to order by p->scx.dsq_vtime */
u32 nr;
u64 id;
struct rhash_head hash_node;
@@ -86,6 +89,11 @@ enum scx_task_state {
SCX_TASK_NR_STATES,
};

+/* scx_entity.dsq_flags */
+enum scx_ent_dsq_flags {
+ SCX_TASK_DSQ_ON_PRIQ = 1 << 0, /* task is queued on the priority queue of a dsq */
+};
+
/*
* Mask bits for scx_entity.kf_mask. Not all kfuncs can be called from
* everywhere and the following bits track which kfunc sets are currently
@@ -111,13 +119,19 @@ enum scx_kf_mask {
__SCX_KF_TERMINAL = SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST,
};

+struct scx_dsq_node {
+ struct list_head list; /* dispatch order */
+ struct rb_node priq; /* p->scx.dsq_vtime order */
+ u32 flags; /* SCX_TASK_DSQ_* flags */
+};
+
/*
* The following is embedded in task_struct and contains all fields necessary
* for a task to be scheduled by SCX.
*/
struct sched_ext_entity {
struct scx_dispatch_q *dsq;
- struct list_head dsq_node;
+ struct scx_dsq_node dsq_node; /* protected by dsq lock */
u32 flags; /* protected by rq lock */
u32 weight;
s32 sticky_cpu;
@@ -149,6 +163,15 @@ struct sched_ext_entity {
*/
u64 slice;

+ /*
+ * Used to order tasks when dispatching to the vtime-ordered priority
+ * queue of a dsq. This is usually set through scx_bpf_dispatch_vtime()
+ * but can also be modified directly by the BPF scheduler. Modifying it
+ * while a task is queued on a dsq may mangle the ordering and is not
+ * recommended.
+ */
+ u64 dsq_vtime;
+
/*
* If set, reject future sched_setscheduler(2) calls updating the policy
* to %SCHED_EXT with -%EACCES.
diff --git a/init/init_task.c b/init/init_task.c
index b85635b7eed0..6bac934219d8 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -101,7 +101,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
#endif
#ifdef CONFIG_SCHED_CLASS_EXT
.scx = {
- .dsq_node = LIST_HEAD_INIT(init_task.scx.dsq_node),
+ .dsq_node.list = LIST_HEAD_INIT(init_task.scx.dsq_node.list),
.sticky_cpu = -1,
.holding_cpu = -1,
.runnable_node = LIST_HEAD_INIT(init_task.scx.runnable_node),
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 83a56b5bb72b..13ba4d3d39bd 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -685,6 +685,7 @@ enum scx_enq_flags {
__SCX_ENQ_INTERNAL_MASK = 0xffLLU << 56,

SCX_ENQ_CLEAR_OPSS = 1LLU << 56,
+ SCX_ENQ_DSQ_PRIQ = 1LLU << 57,
};

enum scx_deq_flags {
@@ -1369,6 +1370,17 @@ static void update_curr_scx(struct rq *rq)
}
}

+static bool scx_dsq_priq_less(struct rb_node *node_a,
+ const struct rb_node *node_b)
+{
+ const struct task_struct *a =
+ container_of(node_a, struct task_struct, scx.dsq_node.priq);
+ const struct task_struct *b =
+ container_of(node_b, struct task_struct, scx.dsq_node.priq);
+
+ return time_before64(a->scx.dsq_vtime, b->scx.dsq_vtime);
+}
+
static void dsq_mod_nr(struct scx_dispatch_q *dsq, s32 delta)
{
/* scx_bpf_dsq_nr_queued() reads ->nr without locking, use WRITE_ONCE() */
@@ -1380,7 +1392,9 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,
{
bool is_local = dsq->id == SCX_DSQ_LOCAL;

- WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_node));
+ WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_node.list));
+ WARN_ON_ONCE((p->scx.dsq_node.flags & SCX_TASK_DSQ_ON_PRIQ) ||
+ !RB_EMPTY_NODE(&p->scx.dsq_node.priq));

if (!is_local) {
raw_spin_lock(&dsq->lock);
@@ -1393,10 +1407,59 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,
}
}

- if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
- list_add(&p->scx.dsq_node, &dsq->list);
- else
- list_add_tail(&p->scx.dsq_node, &dsq->list);
+ if (unlikely((dsq->id & SCX_DSQ_FLAG_BUILTIN) &&
+ (enq_flags & SCX_ENQ_DSQ_PRIQ))) {
+ /*
+ * SCX_DSQ_LOCAL and SCX_DSQ_GLOBAL DSQs always consume from
+ * their FIFO queues. To avoid confusion and accidentally
+ * starving vtime-dispatched tasks by FIFO-dispatched tasks, we
+ * disallow any internal DSQ from doing vtime ordering of
+ * tasks.
+ */
+ scx_ops_error("cannot use vtime ordering for built-in DSQs");
+ enq_flags &= ~SCX_ENQ_DSQ_PRIQ;
+ }
+
+ if (enq_flags & SCX_ENQ_DSQ_PRIQ) {
+ struct rb_node *rbp;
+
+ /*
+ * A PRIQ DSQ shouldn't be using FIFO enqueueing. As tasks are
+ * linked to both the rbtree and list on PRIQs, this can only be
+ * tested easily when adding the first task.
+ */
+ if (unlikely(RB_EMPTY_ROOT(&dsq->priq) &&
+ !list_empty(&dsq->list)))
+ scx_ops_error("DSQ ID 0x%016llx already had FIFO-enqueued tasks",
+ dsq->id);
+
+ p->scx.dsq_node.flags |= SCX_TASK_DSQ_ON_PRIQ;
+ rb_add(&p->scx.dsq_node.priq, &dsq->priq, scx_dsq_priq_less);
+
+ /*
+ * Find the previous task and insert after it on the list so
+ * that @dsq->list is vtime ordered.
+ */
+ rbp = rb_prev(&p->scx.dsq_node.priq);
+ if (rbp) {
+ struct task_struct *prev =
+ container_of(rbp, struct task_struct,
+ scx.dsq_node.priq);
+ list_add(&p->scx.dsq_node.list, &prev->scx.dsq_node.list);
+ } else {
+ list_add(&p->scx.dsq_node.list, &dsq->list);
+ }
+ } else {
+ /* a FIFO DSQ shouldn't be using PRIQ enqueuing */
+ if (unlikely(!RB_EMPTY_ROOT(&dsq->priq)))
+ scx_ops_error("DSQ ID 0x%016llx already had PRIQ-enqueued tasks",
+ dsq->id);
+
+ if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
+ list_add(&p->scx.dsq_node.list, &dsq->list);
+ else
+ list_add_tail(&p->scx.dsq_node.list, &dsq->list);
+ }

dsq_mod_nr(dsq, 1);
p->scx.dsq = dsq;
@@ -1435,13 +1498,30 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,
}
}

+static void task_unlink_from_dsq(struct task_struct *p,
+ struct scx_dispatch_q *dsq)
+{
+ if (p->scx.dsq_node.flags & SCX_TASK_DSQ_ON_PRIQ) {
+ rb_erase(&p->scx.dsq_node.priq, &dsq->priq);
+ RB_CLEAR_NODE(&p->scx.dsq_node.priq);
+ p->scx.dsq_node.flags &= ~SCX_TASK_DSQ_ON_PRIQ;
+ }
+
+ list_del_init(&p->scx.dsq_node.list);
+}
+
+static bool task_linked_on_dsq(struct task_struct *p)
+{
+ return !list_empty(&p->scx.dsq_node.list);
+}
+
static void dispatch_dequeue(struct scx_rq *scx_rq, struct task_struct *p)
{
struct scx_dispatch_q *dsq = p->scx.dsq;
bool is_local = dsq == &scx_rq->local_dsq;

if (!dsq) {
- WARN_ON_ONCE(!list_empty(&p->scx.dsq_node));
+ WARN_ON_ONCE(task_linked_on_dsq(p));
/*
* When dispatching directly from the BPF scheduler to a local
* DSQ, the task isn't associated with any DSQ but
@@ -1462,8 +1542,8 @@ static void dispatch_dequeue(struct scx_rq *scx_rq, struct task_struct *p)
*/
if (p->scx.holding_cpu < 0) {
/* @p must still be on @dsq, dequeue */
- WARN_ON_ONCE(list_empty(&p->scx.dsq_node));
- list_del_init(&p->scx.dsq_node);
+ WARN_ON_ONCE(!task_linked_on_dsq(p));
+ task_unlink_from_dsq(p, dsq);
dsq_mod_nr(dsq, -1);
} else {
/*
@@ -1472,7 +1552,7 @@ static void dispatch_dequeue(struct scx_rq *scx_rq, struct task_struct *p)
* holding_cpu which tells dispatch_to_local_dsq() that it lost
* the race.
*/
- WARN_ON_ONCE(!list_empty(&p->scx.dsq_node));
+ WARN_ON_ONCE(task_linked_on_dsq(p));
p->scx.holding_cpu = -1;
}
p->scx.dsq = NULL;
@@ -1975,7 +2055,8 @@ static void consume_local_task(struct rq *rq, struct scx_dispatch_q *dsq,

/* @dsq is locked and @p is on this rq */
WARN_ON_ONCE(p->scx.holding_cpu >= 0);
- list_move_tail(&p->scx.dsq_node, &scx_rq->local_dsq.list);
+ task_unlink_from_dsq(p, dsq);
+ list_add_tail(&p->scx.dsq_node.list, &scx_rq->local_dsq.list);
dsq_mod_nr(dsq, -1);
dsq_mod_nr(&scx_rq->local_dsq, 1);
p->scx.dsq = &scx_rq->local_dsq;
@@ -2018,7 +2099,7 @@ static bool consume_remote_task(struct rq *rq, struct rq_flags *rf,
* move_task_to_local_dsq().
*/
WARN_ON_ONCE(p->scx.holding_cpu >= 0);
- list_del_init(&p->scx.dsq_node);
+ task_unlink_from_dsq(p, dsq);
dsq_mod_nr(dsq, -1);
p->scx.holding_cpu = raw_smp_processor_id();
raw_spin_unlock(&dsq->lock);
@@ -2050,7 +2131,7 @@ static bool consume_dispatch_q(struct rq *rq, struct rq_flags *rf,

raw_spin_lock(&dsq->lock);

- list_for_each_entry(p, &dsq->list, scx.dsq_node) {
+ list_for_each_entry(p, &dsq->list, scx.dsq_node.list) {
struct rq *task_rq = task_rq(p);

if (rq == task_rq) {
@@ -2570,7 +2651,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
static struct task_struct *first_local_task(struct rq *rq)
{
return list_first_entry_or_null(&rq->scx.local_dsq.list,
- struct task_struct, scx.dsq_node);
+ struct task_struct, scx.dsq_node.list);
}

static struct task_struct *pick_next_task_scx(struct rq *rq)
@@ -3267,7 +3348,8 @@ void init_scx_entity(struct sched_ext_entity *scx)
*/
memset(scx, 0, offsetof(struct sched_ext_entity, tasks_node));

- INIT_LIST_HEAD(&scx->dsq_node);
+ INIT_LIST_HEAD(&scx->dsq_node.list);
+ RB_CLEAR_NODE(&scx->dsq_node.priq);
scx->sticky_cpu = -1;
scx->holding_cpu = -1;
INIT_LIST_HEAD(&scx->runnable_node);
@@ -4294,9 +4376,10 @@ static void scx_dump_task(struct seq_buf *s, struct task_struct *p, char marker,
seq_buf_printf(s, "\n %c%c %s[%d] %+ldms\n",
marker, task_state_to_char(p), p->comm, p->pid,
jiffies_delta_msecs(p->scx.runnable_at, now));
- seq_buf_printf(s, " scx_state/flags=%u/0x%x ops_state/qseq=%lu/%lu\n",
+ seq_buf_printf(s, " scx_state/flags=%u/0x%x dsq_flags=0x%x ops_state/qseq=%lu/%lu\n",
scx_get_task_state(p),
p->scx.flags & ~SCX_TASK_STATE_MASK,
+ p->scx.dsq_node.flags,
ops_state & SCX_OPSS_STATE_MASK,
ops_state >> SCX_OPSS_QSEQ_SHIFT);
seq_buf_printf(s, " sticky/holding_cpu=%d/%d dsq_id=%s\n",
@@ -4844,6 +4927,9 @@ static int bpf_scx_btf_struct_access(struct bpf_verifier_log *log,
if (off >= offsetof(struct task_struct, scx.slice) &&
off + size <= offsetofend(struct task_struct, scx.slice))
return SCALAR_VALUE;
+ if (off >= offsetof(struct task_struct, scx.dsq_vtime) &&
+ off + size <= offsetofend(struct task_struct, scx.dsq_vtime))
+ return SCALAR_VALUE;
if (off >= offsetof(struct task_struct, scx.disallow) &&
off + size <= offsetofend(struct task_struct, scx.disallow))
return SCALAR_VALUE;
@@ -5483,10 +5569,44 @@ __bpf_kfunc void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice,
scx_dispatch_commit(p, dsq_id, enq_flags);
}

+/**
+ * scx_bpf_dispatch_vtime - Dispatch a task into the vtime priority queue of a DSQ
+ * @p: task_struct to dispatch
+ * @dsq_id: DSQ to dispatch to
+ * @slice: duration @p can run for in nsecs
+ * @vtime: @p's ordering inside the vtime-sorted queue of the target DSQ
+ * @enq_flags: SCX_ENQ_*
+ *
+ * Dispatch @p into the vtime priority queue of the DSQ identified by @dsq_id.
+ * Tasks queued into the priority queue are ordered by @vtime and always
+ * consumed after the tasks in the FIFO queue. All other aspects are identical
+ * to scx_bpf_dispatch().
+ *
+ * @vtime ordering is according to time_before64() which considers wrapping. A
+ * numerically larger vtime may indicate an earlier position in the ordering and
+ * vice-versa.
+ */
+__bpf_kfunc void scx_bpf_dispatch_vtime(struct task_struct *p, u64 dsq_id,
+ u64 slice, u64 vtime, u64 enq_flags)
+{
+ if (!scx_dispatch_preamble(p, enq_flags))
+ return;
+
+ if (slice)
+ p->scx.slice = slice;
+ else
+ p->scx.slice = p->scx.slice ?: 1;
+
+ p->scx.dsq_vtime = vtime;
+
+ scx_dispatch_commit(p, dsq_id, enq_flags | SCX_ENQ_DSQ_PRIQ);
+}
+
__bpf_kfunc_end_defs();

BTF_KFUNCS_START(scx_kfunc_ids_enqueue_dispatch)
BTF_ID_FLAGS(func, scx_bpf_dispatch, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime, KF_RCU)
BTF_KFUNCS_END(scx_kfunc_ids_enqueue_dispatch)

static const struct btf_kfunc_id_set scx_kfunc_set_enqueue_dispatch = {
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index a3979e13aade..0b26046339ee 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -31,6 +31,7 @@ static inline void ___vmlinux_h_sanity_check___(void)
s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) __ksym;
s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool *is_idle) __ksym;
void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice, u64 enq_flags) __ksym;
+void scx_bpf_dispatch_vtime(struct task_struct *p, u64 dsq_id, u64 slice, u64 vtime, u64 enq_flags) __ksym;
u32 scx_bpf_dispatch_nr_slots(void) __ksym;
void scx_bpf_dispatch_cancel(void) __ksym;
bool scx_bpf_consume(u64 dsq_id) __ksym;
diff --git a/tools/sched_ext/scx_flatcg.bpf.c b/tools/sched_ext/scx_flatcg.bpf.c
index 20a53e087d6c..a9c2fa41a1bb 100644
--- a/tools/sched_ext/scx_flatcg.bpf.c
+++ b/tools/sched_ext/scx_flatcg.bpf.c
@@ -38,6 +38,10 @@
* this isn't a real concern especially given the performance gain. Also, there
* are ways to mitigate the problem further by e.g. introducing an extra
* scheduling layer on cgroup delegation boundaries.
+ *
+ * The scheduler first picks the cgroup to run and then schedule the tasks
+ * within by using nested weighted vtime scheduling by default. The
+ * cgroup-internal scheduling can be switched to FIFO with the -f option.
*/
#include <scx/common.bpf.h>
#include "scx_flatcg.h"
@@ -51,6 +55,7 @@ char _license[] SEC("license") = "GPL";

const volatile u32 nr_cpus = 32; /* !0 for veristat, set during init */
const volatile u64 cgrp_slice_ns = SCX_SLICE_DFL;
+const volatile bool fifo_sched;

u64 cvtime_now;
UEI_DEFINE(uei);
@@ -387,7 +392,21 @@ void BPF_STRUCT_OPS(fcg_enqueue, struct task_struct *p, u64 enq_flags)
if (!cgc)
goto out_release;

- scx_bpf_dispatch(p, cgrp->kn->id, SCX_SLICE_DFL, enq_flags);
+ if (fifo_sched) {
+ scx_bpf_dispatch(p, cgrp->kn->id, SCX_SLICE_DFL, enq_flags);
+ } else {
+ u64 tvtime = p->scx.dsq_vtime;
+
+ /*
+ * Limit the amount of budget that an idling task can accumulate
+ * to one slice.
+ */
+ if (vtime_before(tvtime, cgc->tvtime_now - SCX_SLICE_DFL))
+ tvtime = cgc->tvtime_now - SCX_SLICE_DFL;
+
+ scx_bpf_dispatch_vtime(p, cgrp->kn->id, SCX_SLICE_DFL,
+ tvtime, enq_flags);
+ }

cgrp_enqueued(cgrp, cgc);
out_release:
@@ -499,12 +518,48 @@ void BPF_STRUCT_OPS(fcg_runnable, struct task_struct *p, u64 enq_flags)
bpf_cgroup_release(cgrp);
}

+void BPF_STRUCT_OPS(fcg_running, struct task_struct *p)
+{
+ struct cgroup *cgrp;
+ struct fcg_cgrp_ctx *cgc;
+
+ if (fifo_sched)
+ return;
+
+ cgrp = scx_bpf_task_cgroup(p);
+ cgc = find_cgrp_ctx(cgrp);
+ if (cgc) {
+ /*
+ * @cgc->tvtime_now always progresses forward as tasks start
+ * executing. The test and update can be performed concurrently
+ * from multiple CPUs and thus racy. Any error should be
+ * contained and temporary. Let's just live with it.
+ */
+ if (vtime_before(cgc->tvtime_now, p->scx.dsq_vtime))
+ cgc->tvtime_now = p->scx.dsq_vtime;
+ }
+ bpf_cgroup_release(cgrp);
+}
+
void BPF_STRUCT_OPS(fcg_stopping, struct task_struct *p, bool runnable)
{
struct fcg_task_ctx *taskc;
struct cgroup *cgrp;
struct fcg_cgrp_ctx *cgc;

+ /*
+ * Scale the execution time by the inverse of the weight and charge.
+ *
+ * Note that the default yield implementation yields by setting
+ * @p->scx.slice to zero and the following would treat the yielding task
+ * as if it has consumed all its slice. If this penalizes yielding tasks
+ * too much, determine the execution time by taking explicit timestamps
+ * instead of depending on @p->scx.slice.
+ */
+ if (!fifo_sched)
+ p->scx.dsq_vtime +=
+ (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight;
+
taskc = bpf_task_storage_get(&task_ctx, p, 0, 0);
if (!taskc) {
scx_bpf_error("task_ctx lookup failed");
@@ -742,6 +797,7 @@ s32 BPF_STRUCT_OPS(fcg_init_task, struct task_struct *p,
struct scx_init_task_args *args)
{
struct fcg_task_ctx *taskc;
+ struct fcg_cgrp_ctx *cgc;

/*
* @p is new. Let's ensure that its task_ctx is available. We can sleep
@@ -753,6 +809,12 @@ s32 BPF_STRUCT_OPS(fcg_init_task, struct task_struct *p,
return -ENOMEM;

taskc->bypassed_at = 0;
+
+ if (!(cgc = find_cgrp_ctx(args->cgroup)))
+ return -ENOENT;
+
+ p->scx.dsq_vtime = cgc->tvtime_now;
+
return 0;
}

@@ -840,6 +902,20 @@ void BPF_STRUCT_OPS(fcg_cgroup_exit, struct cgroup *cgrp)
scx_bpf_destroy_dsq(cgid);
}

+void BPF_STRUCT_OPS(fcg_cgroup_move, struct task_struct *p,
+ struct cgroup *from, struct cgroup *to)
+{
+ struct fcg_cgrp_ctx *from_cgc, *to_cgc;
+ s64 vtime_delta;
+
+ /* find_cgrp_ctx() triggers scx_ops_error() on lookup failures */
+ if (!(from_cgc = find_cgrp_ctx(from)) || !(to_cgc = find_cgrp_ctx(to)))
+ return;
+
+ vtime_delta = p->scx.dsq_vtime - from_cgc->tvtime_now;
+ p->scx.dsq_vtime = to_cgc->tvtime_now + vtime_delta;
+}
+
void BPF_STRUCT_OPS(fcg_exit, struct scx_exit_info *ei)
{
UEI_RECORD(uei, ei);
@@ -850,12 +926,14 @@ SCX_OPS_DEFINE(flatcg_ops,
.enqueue = (void *)fcg_enqueue,
.dispatch = (void *)fcg_dispatch,
.runnable = (void *)fcg_runnable,
+ .running = (void *)fcg_running,
.stopping = (void *)fcg_stopping,
.quiescent = (void *)fcg_quiescent,
.init_task = (void *)fcg_init_task,
.cgroup_set_weight = (void *)fcg_cgroup_set_weight,
.cgroup_init = (void *)fcg_cgroup_init,
.cgroup_exit = (void *)fcg_cgroup_exit,
+ .cgroup_move = (void *)fcg_cgroup_move,
.exit = (void *)fcg_exit,
.flags = SCX_OPS_CGROUP_KNOB_WEIGHT | SCX_OPS_ENQ_EXITING,
.name = "flatcg");
diff --git a/tools/sched_ext/scx_flatcg.c b/tools/sched_ext/scx_flatcg.c
index 20c5a132b610..1143d5eb389a 100644
--- a/tools/sched_ext/scx_flatcg.c
+++ b/tools/sched_ext/scx_flatcg.c
@@ -26,10 +26,11 @@ const char help_fmt[] =
"\n"
"See the top-level comment in .bpf.c for more details.\n"
"\n"
-"Usage: %s [-s SLICE_US] [-i INTERVAL] [-v]\n"
+"Usage: %s [-s SLICE_US] [-i INTERVAL] [-f] [-v]\n"
"\n"
" -s SLICE_US Override slice duration\n"
" -i INTERVAL Report interval\n"
+" -f Use FIFO scheduling instead of weighted vtime scheduling\n"
" -v Print libbpf debug messages\n"
" -h Display this help and exit\n";

@@ -137,7 +138,7 @@ int main(int argc, char **argv)

skel->rodata->nr_cpus = libbpf_num_possible_cpus();

- while ((opt = getopt(argc, argv, "s:i:dvh")) != -1) {
+ while ((opt = getopt(argc, argv, "s:i:dfvh")) != -1) {
double v;

switch (opt) {
@@ -153,6 +154,9 @@ int main(int argc, char **argv)
case 'd':
dump_cgrps = true;
break;
+ case 'f':
+ skel->rodata->fifo_sched = true;
+ break;
case 'v':
verbose = true;
break;
diff --git a/tools/sched_ext/scx_simple.bpf.c b/tools/sched_ext/scx_simple.bpf.c
index 6bb13a3c801b..535fe52f5b73 100644
--- a/tools/sched_ext/scx_simple.bpf.c
+++ b/tools/sched_ext/scx_simple.bpf.c
@@ -2,11 +2,20 @@
/*
* A simple scheduler.
*
- * A simple global FIFO scheduler. It also demonstrates the following niceties.
+ * By default, it operates as a simple global weighted vtime scheduler and can
+ * be switched to FIFO scheduling. It also demonstrates the following niceties.
*
* - Statistics tracking how many tasks are queued to local and global dsq's.
* - Termination notification for userspace.
*
+ * While very simple, this scheduler should work reasonably well on CPUs with a
+ * uniform L3 cache topology. While preemption is not implemented, the fact that
+ * the scheduling queue is shared across all CPUs means that whatever is at the
+ * front of the queue is likely to be executed fairly quickly given enough
+ * number of CPUs. The FIFO scheduling mode may be beneficial to some workloads
+ * but comes with the usual problems with FIFO scheduling where saturating
+ * threads can easily drown out interactive ones.
+ *
* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
* Copyright (c) 2022 Tejun Heo <[email protected]>
* Copyright (c) 2022 David Vernet <[email protected]>
@@ -15,8 +24,13 @@

char _license[] SEC("license") = "GPL";

+const volatile bool fifo_sched;
+
+static u64 vtime_now;
UEI_DEFINE(uei);

+#define SHARED_DSQ 0
+
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
__uint(key_size, sizeof(u32));
@@ -31,6 +45,11 @@ static void stat_inc(u32 idx)
(*cnt_p)++;
}

+static inline bool vtime_before(u64 a, u64 b)
+{
+ return (s64)(a - b) < 0;
+}
+
s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags)
{
bool is_idle = false;
@@ -48,7 +67,69 @@ s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, s32 prev_cpu, u64 w
void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
{
stat_inc(1); /* count global queueing */
- scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+
+ if (fifo_sched) {
+ scx_bpf_dispatch(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags);
+ } else {
+ u64 vtime = p->scx.dsq_vtime;
+
+ /*
+ * Limit the amount of budget that an idling task can accumulate
+ * to one slice.
+ */
+ if (vtime_before(vtime, vtime_now - SCX_SLICE_DFL))
+ vtime = vtime_now - SCX_SLICE_DFL;
+
+ scx_bpf_dispatch_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, vtime,
+ enq_flags);
+ }
+}
+
+void BPF_STRUCT_OPS(simple_dispatch, s32 cpu, struct task_struct *prev)
+{
+ scx_bpf_consume(SHARED_DSQ);
+}
+
+void BPF_STRUCT_OPS(simple_running, struct task_struct *p)
+{
+ if (fifo_sched)
+ return;
+
+ /*
+ * Global vtime always progresses forward as tasks start executing. The
+ * test and update can be performed concurrently from multiple CPUs and
+ * thus racy. Any error should be contained and temporary. Let's just
+ * live with it.
+ */
+ if (vtime_before(vtime_now, p->scx.dsq_vtime))
+ vtime_now = p->scx.dsq_vtime;
+}
+
+void BPF_STRUCT_OPS(simple_stopping, struct task_struct *p, bool runnable)
+{
+ if (fifo_sched)
+ return;
+
+ /*
+ * Scale the execution time by the inverse of the weight and charge.
+ *
+ * Note that the default yield implementation yields by setting
+ * @p->scx.slice to zero and the following would treat the yielding task
+ * as if it has consumed all its slice. If this penalizes yielding tasks
+ * too much, determine the execution time by taking explicit timestamps
+ * instead of depending on @p->scx.slice.
+ */
+ p->scx.dsq_vtime += (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight;
+}
+
+void BPF_STRUCT_OPS(simple_enable, struct task_struct *p)
+{
+ p->scx.dsq_vtime = vtime_now;
+}
+
+s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init)
+{
+ return scx_bpf_create_dsq(SHARED_DSQ, -1);
}

void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei)
@@ -59,5 +140,10 @@ void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei)
SCX_OPS_DEFINE(simple_ops,
.select_cpu = (void *)simple_select_cpu,
.enqueue = (void *)simple_enqueue,
+ .dispatch = (void *)simple_dispatch,
+ .running = (void *)simple_running,
+ .stopping = (void *)simple_stopping,
+ .enable = (void *)simple_enable,
+ .init = (void *)simple_init,
.exit = (void *)simple_exit,
.name = "simple");
diff --git a/tools/sched_ext/scx_simple.c b/tools/sched_ext/scx_simple.c
index acee683a3ec9..b88c058090b6 100644
--- a/tools/sched_ext/scx_simple.c
+++ b/tools/sched_ext/scx_simple.c
@@ -17,8 +17,9 @@ const char help_fmt[] =
"\n"
"See the top-level comment in .bpf.c for more details.\n"
"\n"
-"Usage: %s [-v]\n"
+"Usage: %s [-f] [-v]\n"
"\n"
+" -f Use FIFO scheduling instead of weighted vtime scheduling\n"
" -v Print libbpf debug messages\n"
" -h Display this help and exit\n";

@@ -70,8 +71,11 @@ int main(int argc, char **argv)
restart:
skel = SCX_OPS_OPEN(simple_ops, scx_simple);

- while ((opt = getopt(argc, argv, "vh")) != -1) {
+ while ((opt = getopt(argc, argv, "fvh")) != -1) {
switch (opt) {
+ case 'f':
+ skel->rodata->fifo_sched = true;
+ break;
case 'v':
verbose = true;
break;
--
2.44.0


2024-05-01 15:22:17

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 37/39] sched_ext: Add cpuperf support

This patch adds CPU performance monitoring and scaling support by
integrating into schedutil. The following kfuncs are added:

- scx_bpf_cpuperf_cap(): Query the relative performance capacity of
different CPUs in the system.

- scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
relative to its max performance.

- scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.

This gives direct control over CPU performance setting to the BPF scheduler.
The only change on the schedutil side is disabling frequency holding
heuristics as it doesn't apply well to sched_ext schedulers which may have a
lot weaker conneciton between tasks and their current / last CPU.

A toy implementation of cpuperf is added to scx_qmap as a demonstration of
the feature.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 8 ++
kernel/sched/ext.c | 83 +++++++++++++++++-
kernel/sched/ext.h | 3 +-
kernel/sched/sched.h | 1 +
tools/sched_ext/include/scx/common.bpf.h | 3 +
tools/sched_ext/include/scx/compat.bpf.h | 26 ++++++
tools/sched_ext/scx_qmap.bpf.c | 106 +++++++++++++++++++++++
tools/sched_ext/scx_qmap.c | 8 ++
8 files changed, 234 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 0827864c35ff..12174c0137a5 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -332,6 +332,14 @@ static bool sugov_hold_freq(struct sugov_cpu *sg_cpu)
unsigned long idle_calls;
bool ret;

+ /*
+ * The heuristics in this function is for the fair class. For SCX, the
+ * performance target comes directly from the BPF scheduler. Let's just
+ * follow it.
+ */
+ if (scx_switched_all())
+ return false;
+
/* if capped by uclamp_max, always update to be in compliance */
if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)))
return false;
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index fb4849fb7afd..6b7990f56845 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -14,6 +14,8 @@ enum scx_consts {
SCX_EXIT_BT_LEN = 64,
SCX_EXIT_MSG_LEN = 1024,
SCX_EXIT_DUMP_DFL_LEN = 32768,
+
+ SCX_CPUPERF_ONE = SCHED_CAPACITY_SCALE,
};

enum scx_exit_kind {
@@ -3810,7 +3812,7 @@ DEFINE_SCHED_CLASS(ext) = {
.update_curr = update_curr_scx,

#ifdef CONFIG_UCLAMP_TASK
- .uclamp_enabled = 0,
+ .uclamp_enabled = 1,
#endif
};

@@ -4628,7 +4630,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
struct scx_task_iter sti;
struct task_struct *p;
unsigned long timeout;
- int i, ret;
+ int i, cpu, ret;

mutex_lock(&scx_ops_enable_mutex);

@@ -4677,6 +4679,9 @@ static int scx_ops_enable(struct sched_ext_ops *ops)

atomic_long_set(&scx_nr_rejected, 0);

+ for_each_possible_cpu(cpu)
+ cpu_rq(cpu)->scx.cpuperf_target = SCX_CPUPERF_ONE;
+
/*
* Keep CPUs stable during enable so that the BPF scheduler can track
* online CPUs by watching ->on/offline_cpu() after ->init().
@@ -6227,6 +6232,77 @@ __bpf_kfunc void scx_bpf_error_bstr(char *fmt, unsigned long long *data,
data__sz);
}

+/**
+ * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU
+ * @cpu: CPU of interest
+ *
+ * Return the maximum relative capacity of @cpu in relation to the most
+ * performant CPU in the system. The return value is in the range [1,
+ * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur().
+ */
+__bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu)
+{
+ if (ops_cpu_valid(cpu, NULL))
+ return arch_scale_cpu_capacity(cpu);
+ else
+ return SCX_CPUPERF_ONE;
+}
+
+/**
+ * scx_bpf_cpuperf_cur - Query the current relative performance of a CPU
+ * @cpu: CPU of interest
+ *
+ * Return the current relative performance of @cpu in relation to its maximum.
+ * The return value is in the range [1, %SCX_CPUPERF_ONE].
+ *
+ * The current performance level of a CPU in relation to the maximum performance
+ * available in the system can be calculated as follows:
+ *
+ * scx_bpf_cpuperf_cap() * scx_bpf_cpuperf_cur() / %SCX_CPUPERF_ONE
+ *
+ * The result is in the range [1, %SCX_CPUPERF_ONE].
+ */
+__bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu)
+{
+ if (ops_cpu_valid(cpu, NULL))
+ return arch_scale_freq_capacity(cpu);
+ else
+ return SCX_CPUPERF_ONE;
+}
+
+/**
+ * scx_bpf_cpuperf_set - Set the relative performance target of a CPU
+ * @cpu: CPU of interest
+ * @perf: target performance level [0, %SCX_CPUPERF_ONE]
+ * @flags: %SCX_CPUPERF_* flags
+ *
+ * Set the target performance level of @cpu to @perf. @perf is in linear
+ * relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the
+ * schedutil cpufreq governor chooses the target frequency.
+ *
+ * The actual performance level chosen, CPU grouping, and the overhead and
+ * latency of the operations are dependent on the hardware and cpufreq driver in
+ * use. Consult hardware and cpufreq documentation for more information. The
+ * current performance level can be monitored using scx_bpf_cpuperf_cur().
+ */
+__bpf_kfunc void scx_bpf_cpuperf_set(u32 cpu, u32 perf)
+{
+ if (unlikely(perf > SCX_CPUPERF_ONE)) {
+ scx_ops_error("Invalid cpuperf target %u for CPU %d", perf, cpu);
+ return;
+ }
+
+ if (ops_cpu_valid(cpu, NULL)) {
+ struct rq *rq = cpu_rq(cpu);
+
+ rq->scx.cpuperf_target = perf;
+
+ rcu_read_lock_sched_notrace();
+ cpufreq_update_util(cpu_rq(cpu), 0);
+ rcu_read_unlock_sched_notrace();
+ }
+}
+
/**
* scx_bpf_nr_cpu_ids - Return the number of possible CPU IDs
*
@@ -6474,6 +6550,9 @@ BTF_ID_FLAGS(func, bpf_iter_scx_dsq_next, KF_ITER_NEXT | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_iter_scx_dsq_destroy, KF_ITER_DESTROY)
BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS)
BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_set)
BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids)
BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE)
BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE)
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 6fd646450065..f555395d9783 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -46,9 +46,8 @@ void init_sched_ext_class(void);

static inline u32 scx_cpuperf_target(s32 cpu)
{
- /* for now, peg cpus at max performance while enabled */
if (scx_enabled())
- return SCHED_CAPACITY_SCALE;
+ return cpu_rq(cpu)->scx.cpuperf_target;
else
return 0;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e8ef7309f347..d31db189977a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -732,6 +732,7 @@ struct scx_rq {
u64 extra_enq_flags; /* see move_task_to_local_dsq() */
u32 nr_running;
u32 flags;
+ u32 cpuperf_target; /* [0, SCHED_CAPACITY_SCALE] */
bool cpu_released;
cpumask_var_t cpus_to_kick;
cpumask_var_t cpus_to_kick_if_idle;
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 17c76919e450..20ebb407148e 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -45,6 +45,9 @@ struct task_struct *bpf_iter_scx_dsq_next(struct bpf_iter_scx_dsq *it) __ksym __
void bpf_iter_scx_dsq_destroy(struct bpf_iter_scx_dsq *it) __ksym __weak;
void scx_bpf_exit_bstr(s64 exit_code, char *fmt, unsigned long long *data, u32 data__sz) __ksym __weak;
void scx_bpf_error_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym;
+u32 scx_bpf_cpuperf_cap(s32 cpu) __ksym __weak;
+u32 scx_bpf_cpuperf_cur(s32 cpu) __ksym __weak;
+void scx_bpf_cpuperf_set(s32 cpu, u32 perf) __ksym __weak;
u32 scx_bpf_nr_cpu_ids(void) __ksym __weak;
const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak;
const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak;
diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
index c17ef3757b31..8fcba2505ad5 100644
--- a/tools/sched_ext/include/scx/compat.bpf.h
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@@ -56,6 +56,32 @@ static inline void __COMPAT_scx_bpf_switch_all(void)
#define __COMPAT_HAS_CPUMASKS \
bpf_ksym_exists(scx_bpf_nr_cpu_ids)

+/*
+ * cpuperf is new. The followings become noop on older kernels. Callers can be
+ * updated to call cpuperf kfuncs directly in the future.
+ */
+static inline u32 __COMPAT_scx_bpf_cpuperf_cap(s32 cpu)
+{
+ if (bpf_ksym_exists(scx_bpf_cpuperf_cap))
+ return scx_bpf_cpuperf_cap(cpu);
+ else
+ return 1024;
+}
+
+static inline u32 __COMPAT_scx_bpf_cpuperf_cur(s32 cpu)
+{
+ if (bpf_ksym_exists(scx_bpf_cpuperf_cur))
+ return scx_bpf_cpuperf_cur(cpu);
+ else
+ return 1024;
+}
+
+static inline void __COMPAT_scx_bpf_cpuperf_set(s32 cpu, u32 perf)
+{
+ if (bpf_ksym_exists(scx_bpf_cpuperf_set))
+ return scx_bpf_cpuperf_set(cpu, perf);
+}
+
/*
* Iteration and scx_bpf_consume_task() are new. The following become noop on
* older kernels. The users can switch to bpf_for_each(scx_dsq) and directly
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index 924e7e2b8c4c..977c7cff7b34 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -71,6 +71,18 @@ struct {
},
};

+/*
+ * If enabled, CPU performance target is set according to the queue index
+ * according to the following table.
+ */
+static const u32 qidx_to_cpuperf_target[] = {
+ [0] = SCX_CPUPERF_ONE * 0 / 4,
+ [1] = SCX_CPUPERF_ONE * 1 / 4,
+ [2] = SCX_CPUPERF_ONE * 2 / 4,
+ [3] = SCX_CPUPERF_ONE * 3 / 4,
+ [4] = SCX_CPUPERF_ONE * 4 / 4,
+};
+
/*
* Per-queue sequence numbers to implement core-sched ordering.
*
@@ -98,6 +110,8 @@ struct {
struct cpu_ctx {
u64 dsp_idx; /* dispatch index */
u64 dsp_cnt; /* remaining count */
+ u32 avg_weight;
+ u32 cpuperf_target;
};

struct {
@@ -110,6 +124,8 @@ struct {
/* Statistics */
u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued;
u64 nr_core_sched_execed, nr_expedited;
+u32 cpuperf_min, cpuperf_avg, cpuperf_max;
+u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max;

s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
s32 prev_cpu, u64 wake_flags)
@@ -347,6 +363,29 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
}
}

+void BPF_STRUCT_OPS(qmap_tick, struct task_struct *p)
+{
+ struct cpu_ctx *cpuc;
+ u32 zero = 0;
+ int idx;
+
+ if (!(cpuc = bpf_map_lookup_elem(&cpu_ctx_stor, &zero))) {
+ scx_bpf_error("failed to look up cpu_ctx");
+ return;
+ }
+
+ /*
+ * Use the running avg of weights to select the target cpuperf level.
+ * This is a demonstration of the cpuperf feature rather than a
+ * practical strategy to regulate CPU frequency.
+ */
+ cpuc->avg_weight = cpuc->avg_weight * 3 / 4 + p->scx.weight / 4;
+ idx = weight_to_idx(cpuc->avg_weight);
+ cpuc->cpuperf_target = qidx_to_cpuperf_target[idx];
+
+ scx_bpf_cpuperf_set(scx_bpf_task_cpu(p), cpuc->cpuperf_target);
+}
+
/*
* The distance from the head of the queue scaled by the weight of the queue.
* The lower the number, the older the task and the higher the priority.
@@ -490,6 +529,70 @@ struct {
__type(value, struct monitor_timer);
} central_timer SEC(".maps");

+/*
+ * Print out the min, avg and max performance levels of CPUs every second to
+ * demonstrate the cpuperf interface.
+ */
+static void monitor_cpuperf(void)
+{
+ u32 zero = 0, nr_cpu_ids;
+ u64 cap_sum = 0, cur_sum = 0, cur_min = SCX_CPUPERF_ONE, cur_max = 0;
+ u64 target_sum = 0, target_min = SCX_CPUPERF_ONE, target_max = 0;
+ const struct cpumask *online;
+ int i, nr_online_cpus = 0;
+
+ if (!__COMPAT_HAS_CPUMASKS)
+ return;
+
+ nr_cpu_ids = scx_bpf_nr_cpu_ids();
+ online = scx_bpf_get_online_cpumask();
+
+ bpf_for(i, 0, nr_cpu_ids) {
+ struct cpu_ctx *cpuc;
+ u32 cap, cur;
+
+ if (!bpf_cpumask_test_cpu(i, online))
+ continue;
+ nr_online_cpus++;
+
+ /* collect the capacity and current cpuperf */
+ cap = scx_bpf_cpuperf_cap(i);
+ cur = scx_bpf_cpuperf_cur(i);
+
+ cur_min = cur < cur_min ? cur : cur_min;
+ cur_max = cur > cur_max ? cur : cur_max;
+
+ /*
+ * $cur is relative to $cap. Scale it down accordingly so that
+ * it's in the same scale as other CPUs and $cur_sum/$cap_sum
+ * makes sense.
+ */
+ cur_sum += cur * cap / SCX_CPUPERF_ONE;
+ cap_sum += cap;
+
+ if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, i))) {
+ scx_bpf_error("failed to look up cpu_ctx");
+ goto out;
+ }
+
+ /* collect target */
+ cur = cpuc->cpuperf_target;
+ target_sum += cur;
+ target_min = cur < target_min ? cur : target_min;
+ target_max = cur > target_max ? cur : target_max;
+ }
+
+ cpuperf_min = cur_min;
+ cpuperf_avg = cur_sum * SCX_CPUPERF_ONE / cap_sum;
+ cpuperf_max = cur_max;
+
+ cpuperf_target_min = target_min;
+ cpuperf_target_avg = target_sum / nr_online_cpus;
+ cpuperf_target_max = target_max;
+out:
+ scx_bpf_put_cpumask(online);
+}
+
/*
* Dump the currently queued tasks in the shared DSQ to demonstrate the usage of
* scx_bpf_dsq_nr_queued() and DSQ iterator. Raise the dispatch batch count to
@@ -513,6 +616,8 @@ static void dump_shared_dsq(void)

static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer)
{
+ monitor_cpuperf();
+
if (print_shared_dsq)
dump_shared_dsq();

@@ -555,6 +660,7 @@ SCX_OPS_DEFINE(qmap_ops,
.enqueue = (void *)qmap_enqueue,
.dequeue = (void *)qmap_dequeue,
.dispatch = (void *)qmap_dispatch,
+ .tick = (void *)qmap_tick,
.core_sched_before = (void *)qmap_core_sched_before,
.cpu_release = (void *)qmap_cpu_release,
.init_task = (void *)qmap_init_task,
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index 1b8cd2993ee2..eb221f0a0580 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -127,6 +127,14 @@ int main(int argc, char **argv)
nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
skel->bss->nr_reenqueued, skel->bss->nr_dequeued,
skel->bss->nr_core_sched_execed, skel->bss->nr_expedited);
+ if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur"))
+ printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n",
+ skel->bss->cpuperf_min,
+ skel->bss->cpuperf_avg,
+ skel->bss->cpuperf_max,
+ skel->bss->cpuperf_target_min,
+ skel->bss->cpuperf_target_avg,
+ skel->bss->cpuperf_target_max);
fflush(stdout);
sleep(1);
}
--
2.44.0


2024-05-01 15:22:44

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 38/39] sched_ext: Documentation: scheduler: Document extensible scheduler class

Add Documentation/scheduler/sched-ext.rst which gives a high-level overview
and pointers to the examples.

v5: - Updated to reflect /sys/kernel interface change. Kconfig options
added.

v4: - README improved, reformatted in markdown and renamed to README.md.

v3: - Added tools/sched_ext/README.

- Dropped _example prefix from scheduler names.

v2: - Apply minor edits suggested by Bagas. Caveats section dropped as all
of them are addressed.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
Cc: Bagas Sanjaya <[email protected]>
---
Documentation/scheduler/index.rst | 1 +
Documentation/scheduler/sched-ext.rst | 307 ++++++++++++++++++++++++++
include/linux/sched/ext.h | 2 +
kernel/Kconfig.preempt | 2 +
kernel/sched/ext.c | 2 +
kernel/sched/ext.h | 2 +
tools/sched_ext/README.md | 270 ++++++++++++++++++++++
7 files changed, 586 insertions(+)
create mode 100644 Documentation/scheduler/sched-ext.rst
create mode 100644 tools/sched_ext/README.md

diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index 43bd8a145b7a..0611dc3dda8e 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -20,6 +20,7 @@ Scheduler
sched-nice-design
sched-rt-group
sched-stats
+ sched-ext
sched-debug

text_files
diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
new file mode 100644
index 000000000000..b01c4f22f73e
--- /dev/null
+++ b/Documentation/scheduler/sched-ext.rst
@@ -0,0 +1,307 @@
+==========================
+Extensible Scheduler Class
+==========================
+
+sched_ext is a scheduler class whose behavior can be defined by a set of BPF
+programs - the BPF scheduler.
+
+* sched_ext exports a full scheduling interface so that any scheduling
+ algorithm can be implemented on top.
+
+* The BPF scheduler can group CPUs however it sees fit and schedule them
+ together, as tasks aren't tied to specific CPUs at the time of wakeup.
+
+* The BPF scheduler can be turned on and off dynamically anytime.
+
+* The system integrity is maintained no matter what the BPF scheduler does.
+ The default scheduling behavior is restored anytime an error is detected,
+ a runnable task stalls, or on invoking the SysRq key sequence
+ :kbd:`SysRq-S`.
+
+Switching to and from sched_ext
+===============================
+
+``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and
+``tools/sched_ext`` contains the example schedulers. The following config
+options should be enabled to use sched_ext:
+
+.. code-block:: none
+
+ CONFIG_BPF=y
+ CONFIG_SCHED_CLASS_EXT=y
+ CONFIG_BPF_SYSCALL=y
+ CONFIG_BPF_JIT=y
+ CONFIG_DEBUG_INFO_BTF=y
+ CONFIG_BPF_JIT_ALWAYS_ON=y
+ CONFIG_BPF_JIT_DEFAULT_ON=y
+ CONFIG_PAHOLE_HAS_SPLIT_BTF=y
+ CONFIG_PAHOLE_HAS_BTF_TAG=y
+
+sched_ext is used only when the BPF scheduler is loaded and running.
+
+If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be
+treated as ``SCHED_NORMAL`` and scheduled by CFS until the BPF scheduler is
+loaded. On load, such tasks will be switched to and scheduled by sched_ext.
+
+The BPF scheduler can choose to schedule all normal and lower class tasks by
+calling ``scx_bpf_switch_all()`` from its ``init()`` operation. In this
+case, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE`` and
+``SCHED_EXT`` tasks are scheduled by sched_ext. In the example schedulers,
+this mode can be selected with the ``-a`` option.
+
+Terminating the sched_ext scheduler program, triggering :kbd:`SysRq-S`, or
+detection of any internal error including stalled runnable tasks aborts the
+BPF scheduler and reverts all tasks back to CFS.
+
+.. code-block:: none
+
+ # make -j16 -C tools/sched_ext
+ # tools/sched_ext/scx_simple
+ local=0 global=3
+ local=5 global=24
+ local=9 global=44
+ local=13 global=56
+ local=17 global=72
+ ^CEXIT: BPF scheduler unregistered
+
+The current status of the BPF scheduler can be determined as follows:
+
+.. code-block:: none
+
+ # cat /sys/kernel/sched_ext/state
+ enabled
+ # cat /sys/kernel/sched_ext/root/ops
+ simple
+
+``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more
+detailed information:
+
+.. code-block:: none
+
+ # tools/sched_ext/scx_show_state.py
+ ops : simple
+ enabled : 1
+ switching_all : 1
+ switched_all : 1
+ enable_state : enabled (2)
+ bypass_depth : 0
+ nr_rejected : 0
+
+If ``CONFIG_SCHED_DEBUG`` is set, whether a given task is on sched_ext can
+be determined as follows:
+
+.. code-block:: none
+
+ # grep ext /proc/self/sched
+ ext.enabled : 1
+
+The Basics
+==========
+
+Userspace can implement an arbitrary BPF scheduler by loading a set of BPF
+programs that implement ``struct sched_ext_ops``. The only mandatory field
+is ``ops.name`` which must be a valid BPF object name. All operations are
+optional. The following modified excerpt is from
+``tools/sched/scx_simple.bpf.c`` showing a minimal global FIFO scheduler.
+
+.. code-block:: c
+
+ /*
+ * Decide which CPU a task should be migrated to before being
+ * enqueued (either at wakeup, fork time, or exec time). If an
+ * idle core is found by the default ops.select_cpu() implementation,
+ * then dispatch the task directly to SCX_DSQ_LOCAL and skip the
+ * ops.enqueue() callback.
+ *
+ * Note that this implemenation has exactly the same behavior as the
+ * default ops.select_cpu implementation. The behavior of the scheduler
+ * would be exactly same if the implementation just didn't define the
+ * simple_select_cpu() struct_ops prog.
+ */
+ s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p,
+ s32 prev_cpu, u64 wake_flags)
+ {
+ s32 cpu;
+ /* Need to initialize or the BPF verifier will reject the program */
+ bool direct = false;
+
+ cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct);
+
+ if (direct)
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
+
+ return cpu;
+ }
+
+ /*
+ * Do a direct dispatch of a task to the global DSQ. This ops.enqueue()
+ * callback will only be invoked if we failed to find a core to dispatch
+ * to in ops.select_cpu() above.
+ *
+ * Note that this implemenation has exactly the same behavior as the
+ * default ops.enqueue implementation, which just dispatches the task
+ * to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same
+ * if the implementation just didn't define the simple_enqueue struct_ops
+ * prog.
+ */
+ void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
+ {
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+ }
+
+ s32 BPF_STRUCT_OPS(simple_init)
+ {
+ /*
+ * All SCHED_OTHER, SCHED_IDLE, and SCHED_BATCH tasks should
+ * use sched_ext.
+ */
+ scx_bpf_switch_all();
+ return 0;
+ }
+
+ void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei)
+ {
+ exit_type = ei->type;
+ }
+
+ SEC(".struct_ops")
+ struct sched_ext_ops simple_ops = {
+ .select_cpu = (void *)simple_select_cpu,
+ .enqueue = (void *)simple_enqueue,
+ .init = (void *)simple_init,
+ .exit = (void *)simple_exit,
+ .name = "simple",
+ };
+
+Dispatch Queues
+---------------
+
+To match the impedance between the scheduler core and the BPF scheduler,
+sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a
+priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``),
+and one local dsq per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage
+an arbitrary number of dsq's using ``scx_bpf_create_dsq()`` and
+``scx_bpf_destroy_dsq()``.
+
+A CPU always executes a task from its local DSQ. A task is "dispatched" to a
+DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's
+local DSQ.
+
+When a CPU is looking for the next task to run, if the local DSQ is not
+empty, the first task is picked. Otherwise, the CPU tries to consume the
+global DSQ. If that doesn't yield a runnable task either, ``ops.dispatch()``
+is invoked.
+
+Scheduling Cycle
+----------------
+
+The following briefly shows how a waking task is scheduled and executed.
+
+1. When a task is waking up, ``ops.select_cpu()`` is the first operation
+ invoked. This serves two purposes. First, CPU selection optimization
+ hint. Second, waking up the selected CPU if idle.
+
+ The CPU selected by ``ops.select_cpu()`` is an optimization hint and not
+ binding. The actual decision is made at the last step of scheduling.
+ However, there is a small performance gain if the CPU
+ ``ops.select_cpu()`` returns matches the CPU the task eventually runs on.
+
+ A side-effect of selecting a CPU is waking it up from idle. While a BPF
+ scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper,
+ using ``ops.select_cpu()`` judiciously can be simpler and more efficient.
+
+ A task can be immediately dispatched to a DSQ from ``ops.select_cpu()`` by
+ calling ``scx_bpf_dispatch()``. If the task is dispatched to
+ ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be dispatched to the
+ local DSQ of whichever CPU is returned from ``ops.select_cpu()``.
+ Additionally, dispatching directly from ``ops.select_cpu()`` will cause the
+ ``ops.enqueue()`` callback to be skipped.
+
+ Note that the scheduler core will ignore an invalid CPU selection, for
+ example, if it's outside the allowed cpumask of the task.
+
+2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the
+ task was dispatched directly from ``ops.select_cpu()``). ``ops.enqueue()``
+ can make one of the following decisions:
+
+ * Immediately dispatch the task to either the global or local DSQ by
+ calling ``scx_bpf_dispatch()`` with ``SCX_DSQ_GLOBAL`` or
+ ``SCX_DSQ_LOCAL``, respectively.
+
+ * Immediately dispatch the task to a custom DSQ by calling
+ ``scx_bpf_dispatch()`` with a DSQ ID which is smaller than 2^63.
+
+ * Queue the task on the BPF side.
+
+3. When a CPU is ready to schedule, it first looks at its local DSQ. If
+ empty, it then looks at the global DSQ. If there still isn't a task to
+ run, ``ops.dispatch()`` is invoked which can use the following two
+ functions to populate the local DSQ.
+
+ * ``scx_bpf_dispatch()`` dispatches a task to a DSQ. Any target DSQ can
+ be used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``,
+ ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dispatch()``
+ currently can't be called with BPF locks held, this is being worked on
+ and will be supported. ``scx_bpf_dispatch()`` schedules dispatching
+ rather than performing them immediately. There can be up to
+ ``ops.dispatch_max_batch`` pending tasks.
+
+ * ``scx_bpf_consume()`` tranfers a task from the specified non-local DSQ
+ to the dispatching DSQ. This function cannot be called with any BPF
+ locks held. ``scx_bpf_consume()`` flushes the pending dispatched tasks
+ before trying to consume the specified DSQ.
+
+4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ,
+ the CPU runs the first one. If empty, the following steps are taken:
+
+ * Try to consume the global DSQ. If successful, run the task.
+
+ * If ``ops.dispatch()`` has dispatched any tasks, retry #3.
+
+ * If the previous task is an SCX task and still runnable, keep executing
+ it (see ``SCX_OPS_ENQ_LAST``).
+
+ * Go idle.
+
+Note that the BPF scheduler can always choose to dispatch tasks immediately
+in ``ops.enqueue()`` as illustrated in the above simple example. If only the
+built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as
+a task is never queued on the BPF scheduler and both the local and global
+DSQs are consumed automatically.
+
+``scx_bpf_dispatch()`` queues the task on the FIFO of the target DSQ. Use
+``scx_bpf_dispatch_vtime()`` for the priority queue. Internal DSQs such as
+``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue
+dispatching, and must be dispatched to with ``scx_bpf_dispatch()``. See the
+function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` for
+more information.
+
+Where to Look
+=============
+
+* ``include/linux/sched/ext.h`` defines the core data structures, ops table
+ and constants.
+
+* ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers.
+ The functions prefixed with ``scx_bpf_`` can be called from the BPF
+ scheduler.
+
+* ``tools/sched_ext/`` hosts example BPF scheduler implementations.
+
+ * ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a
+ custom DSQ.
+
+ * ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five
+ levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``.
+
+ABI Instability
+===============
+
+The APIs provided by sched_ext to BPF schedulers programs have no stability
+guarantees. This includes the ops table callbacks and constants defined in
+``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in
+``kernel/sched/ext.c``.
+
+While we will attempt to provide a relatively stable API surface when
+possible, they are subject to change without warning between kernel
+versions.
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 32cc5f439983..364ca827a45e 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -1,5 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
* Copyright (c) 2022 Tejun Heo <[email protected]>
* Copyright (c) 2022 David Vernet <[email protected]>
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index e12a057ead7b..bae49b743834 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -154,3 +154,5 @@ config SCHED_CLASS_EXT
wish to implement scheduling policies. The struct_ops structure
exported by sched_ext is struct sched_ext_ops, and is conceptually
similar to struct sched_class.
+
+ See Documentation/scheduler/sched-ext.rst for more details.
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 6b7990f56845..b09f6b154ed6 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1,5 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
* Copyright (c) 2022 Tejun Heo <[email protected]>
* Copyright (c) 2022 David Vernet <[email protected]>
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index f555395d9783..97064e53f299 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -1,5 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
* Copyright (c) 2022 Tejun Heo <[email protected]>
* Copyright (c) 2022 David Vernet <[email protected]>
diff --git a/tools/sched_ext/README.md b/tools/sched_ext/README.md
new file mode 100644
index 000000000000..16a42e4060f6
--- /dev/null
+++ b/tools/sched_ext/README.md
@@ -0,0 +1,270 @@
+SCHED_EXT EXAMPLE SCHEDULERS
+============================
+
+# Introduction
+
+This directory contains a number of example sched_ext schedulers. These
+schedulers are meant to provide examples of different types of schedulers
+that can be built using sched_ext, and illustrate how various features of
+sched_ext can be used.
+
+Some of the examples are performant, production-ready schedulers. That is, for
+the correct workload and with the correct tuning, they may be deployed in a
+production environment with acceptable or possibly even improved performance.
+Others are just examples that in practice, would not provide acceptable
+performance (though they could be improved to get there).
+
+This README will describe these example schedulers, including describing the
+types of workloads or scenarios they're designed to accommodate, and whether or
+not they're production ready. For more details on any of these schedulers,
+please see the header comment in their .bpf.c file.
+
+
+# Compiling the examples
+
+There are a few toolchain dependencies for compiling the example schedulers.
+
+## Toolchain dependencies
+
+1. clang >= 16.0.0
+
+The schedulers are BPF programs, and therefore must be compiled with clang. gcc
+is actively working on adding a BPF backend compiler as well, but are still
+missing some features such as BTF type tags which are necessary for using
+kptrs.
+
+2. pahole >= 1.25
+
+You may need pahole in order to generate BTF from DWARF.
+
+3. rust >= 1.70.0
+
+Rust schedulers uses features present in the rust toolchain >= 1.70.0. You
+should be able to use the stable build from rustup, but if that doesn't
+work, try using the rustup nightly build.
+
+There are other requirements as well, such as make, but these are the main /
+non-trivial ones.
+
+## Compiling the kernel
+
+In order to run a sched_ext scheduler, you'll have to run a kernel compiled
+with the patches in this repository, and with a minimum set of necessary
+Kconfig options:
+
+```
+CONFIG_BPF=y
+CONFIG_SCHED_CLASS_EXT=y
+CONFIG_BPF_SYSCALL=y
+CONFIG_BPF_JIT=y
+CONFIG_DEBUG_INFO_BTF=y
+```
+
+It's also recommended that you also include the following Kconfig options:
+
+```
+CONFIG_BPF_JIT_ALWAYS_ON=y
+CONFIG_BPF_JIT_DEFAULT_ON=y
+CONFIG_PAHOLE_HAS_SPLIT_BTF=y
+CONFIG_PAHOLE_HAS_BTF_TAG=y
+```
+
+There is a `Kconfig` file in this directory whose contents you can append to
+your local `.config` file, as long as there are no conflicts with any existing
+options in the file.
+
+## Getting a vmlinux.h file
+
+You may notice that most of the example schedulers include a "vmlinux.h" file.
+This is a large, auto-generated header file that contains all of the types
+defined in some vmlinux binary that was compiled with
+[BTF](https://docs.kernel.org/bpf/btf.html) (i.e. with the BTF-related Kconfig
+options specified above).
+
+The header file is created using `bpftool`, by passing it a vmlinux binary
+compiled with BTF as follows:
+
+```bash
+$ bpftool btf dump file /path/to/vmlinux format c > vmlinux.h
+```
+
+`bpftool` analyzes all of the BTF encodings in the binary, and produces a
+header file that can be included by BPF programs to access those types. For
+example, using vmlinux.h allows a scheduler to access fields defined directly
+in vmlinux as follows:
+
+```c
+#include "vmlinux.h"
+// vmlinux.h is also implicitly included by scx_common.bpf.h.
+#include "scx_common.bpf.h"
+
+/*
+ * vmlinux.h provides definitions for struct task_struct and
+ * struct scx_enable_args.
+ */
+void BPF_STRUCT_OPS(example_enable, struct task_struct *p,
+ struct scx_enable_args *args)
+{
+ bpf_printk("Task %s enabled in example scheduler", p->comm);
+}
+
+// vmlinux.h provides the definition for struct sched_ext_ops.
+SEC(".struct_ops.link")
+struct sched_ext_ops example_ops {
+ .enable = (void *)example_enable,
+ .name = "example",
+}
+```
+
+The scheduler build system will generate this vmlinux.h file as part of the
+scheduler build pipeline. It looks for a vmlinux file in the following
+dependency order:
+
+1. If the O= environment variable is defined, at `$O/vmlinux`
+2. If the KBUILD_OUTPUT= environment variable is defined, at
+ `$KBUILD_OUTPUT/vmlinux`
+3. At `../../vmlinux` (i.e. at the root of the kernel tree where you're
+ compiling the schedulers)
+3. `/sys/kernel/btf/vmlinux`
+4. `/boot/vmlinux-$(uname -r)`
+
+In other words, if you have compiled a kernel in your local repo, its vmlinux
+file will be used to generate vmlinux.h. Otherwise, it will be the vmlinux of
+the kernel you're currently running on. This means that if you're running on a
+kernel with sched_ext support, you may not need to compile a local kernel at
+all.
+
+### Aside on CO-RE
+
+One of the cooler features of BPF is that it supports
+[CO-RE](https://nakryiko.com/posts/bpf-core-reference-guide/) (Compile Once Run
+Everywhere). This feature allows you to reference fields inside of structs with
+types defined internal to the kernel, and not have to recompile if you load the
+BPF program on a different kernel with the field at a different offset. In our
+example above, we print out a task name with `p->comm`. CO-RE would perform
+relocations for that access when the program is loaded to ensure that it's
+referencing the correct offset for the currently running kernel.
+
+## Compiling the schedulers
+
+Once you have your toolchain setup, and a vmlinux that can be used to generate
+a full vmlinux.h file, you can compile the schedulers using `make`:
+
+```bash
+$ make -j($nproc)
+```
+
+# Example schedulers
+
+This directory contains the following example schedulers. These schedulers are
+for testing and demonstrating different aspects of sched_ext. While some may be
+useful in limited scenarios, they are not intended to be practical.
+
+For more scheduler implementations, tools and documentation, visit
+https://github.com/sched-ext/scx.
+
+## scx_simple
+
+A simple scheduler that provides an example of a minimal sched_ext scheduler.
+scx_simple can be run in either global weighted vtime mode, or FIFO mode.
+
+Though very simple, in limited scenarios, this scheduler can perform reasonably
+well on single-socket systems with a unified L3 cache.
+
+## scx_qmap
+
+Another simple, yet slightly more complex scheduler that provides an example of
+a basic weighted FIFO queuing policy. It also provides examples of some common
+useful BPF features, such as sleepable per-task storage allocation in the
+`ops.prep_enable()` callback, and using the `BPF_MAP_TYPE_QUEUE` map type to
+enqueue tasks. It also illustrates how core-sched support could be implemented.
+
+## scx_central
+
+A "central" scheduler where scheduling decisions are made from a single CPU.
+This scheduler illustrates how scheduling decisions can be dispatched from a
+single CPU, allowing other cores to run with infinite slices, without timer
+ticks, and without having to incur the overhead of making scheduling decisions.
+
+The approach demonstrated by this scheduler may be useful for any workload that
+benefits from minimizing scheduling overhead and timer ticks. An example of
+where this could be particularly useful is running VMs, where running with
+infinite slices and no timer ticks allows the VM to avoid unnecessary expensive
+vmexits.
+
+## scx_flatcg
+
+A flattened cgroup hierarchy scheduler. This scheduler implements hierarchical
+weight-based cgroup CPU control by flattening the cgroup hierarchy into a single
+layer, by compounding the active weight share at each level. The effect of this
+is a much more performant CPU controller, which does not need to descend down
+cgroup trees in order to properly compute a cgroup's share.
+
+Similar to scx_simple, in limited scenarios, this scheduler can perform
+reasonably well on single socket-socket systems with a unified L3 cache and show
+significantly lowered hierarchical scheduling overhead.
+
+
+# Troubleshooting
+
+There are a number of common issues that you may run into when building the
+schedulers. We'll go over some of the common ones here.
+
+## Build Failures
+
+### Old version of clang
+
+```
+error: static assertion failed due to requirement 'SCX_DSQ_FLAG_BUILTIN': bpftool generated vmlinux.h is missing high bits for 64bit enums, upgrade clang and pahole
+ _Static_assert(SCX_DSQ_FLAG_BUILTIN,
+ ^~~~~~~~~~~~~~~~~~~~
+1 error generated.
+```
+
+This means you built the kernel or the schedulers with an older version of
+clang than what's supported (i.e. older than 16.0.0). To remediate this:
+
+1. `which clang` to make sure you're using a sufficiently new version of clang.
+
+2. `make fullclean` in the root path of the repository, and rebuild the kernel
+ and schedulers.
+
+3. Rebuild the kernel, and then your example schedulers.
+
+The schedulers are also cleaned if you invoke `make mrproper` in the root
+directory of the tree.
+
+### Stale kernel build / incomplete vmlinux.h file
+
+As described above, you'll need a `vmlinux.h` file that was generated from a
+vmlinux built with BTF, and with sched_ext support enabled. If you don't,
+you'll see errors such as the following which indicate that a type being
+referenced in a scheduler is unknown:
+
+```
+/path/to/sched_ext/tools/sched_ext/user_exit_info.h:25:23: note: forward declaration of 'struct scx_exit_info'
+
+const struct scx_exit_info *ei)
+
+^
+```
+
+In order to resolve this, please follow the steps above in
+[Getting a vmlinux.h file](#getting-a-vmlinuxh-file) in order to ensure your
+schedulers are using a vmlinux.h file that includes the requisite types.
+
+## Misc
+
+### llvm: [OFF]
+
+You may see the following output when building the schedulers:
+
+```
+Auto-detecting system features:
+... clang-bpf-co-re: [ on ]
+... llvm: [ OFF ]
+... libcap: [ on ]
+... libbfd: [ on ]
+```
+
+Seeing `llvm: [ OFF ]` here is not an issue. You can safely ignore.
--
2.44.0


2024-05-01 15:26:57

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 34/39] sched_ext: Implement core-sched support

The core-sched support is composed of the following parts:

- task_struct->scx.core_sched_at is added. This is a timestamp which can be
used to order tasks. Depending on whether the BPF scheduler implements
custom ordering, it tracks either global FIFO ordering of all tasks or
local-DSQ ordering within the dispatched tasks on a CPU.

- prio_less() is updated to call scx_prio_less() when comparing SCX tasks.
scx_prio_less() calls ops.core_sched_before() if available or uses the
core_sched_at timestamp. For global FIFO ordering, the BPF scheduler
doesn't need to do anything. Otherwise, it should implement
ops.core_sched_before() which reflects the ordering.

- When core-sched is enabled, balance_scx() balances all SMT siblings so
that they all have tasks dispatched if necessary before pick_task_scx() is
called. pick_task_scx() picks between the current task and the first
dispatched task on the local DSQ based on availability and the
core_sched_at timestamps. Note that FIFO ordering is expected among the
already dispatched tasks whether running or on the local DSQ, so this path
always compares core_sched_at instead of calling into
ops.core_sched_before().

qmap_core_sched_before() is added to scx_qmap. It scales the
distances from the heads of the queues to compare the tasks across different
priority queues and seems to behave as expected.

v3: Fixed build error when !CONFIG_SCHED_SMT reported by Andrea Righi.

v2: Sched core added the const qualifiers to prio_less task arguments.
Explicitly drop them for ops.core_sched_before() task arguments. BPF
enforces access control through the verifier, so the qualifier isn't
actually operative and only gets in the way when interacting with
various helpers.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Reviewed-by: Josh Don <[email protected]>
Cc: Andrea Righi <[email protected]>
---
include/linux/sched/ext.h | 3 +
kernel/Kconfig.preempt | 2 +-
kernel/sched/core.c | 10 +-
kernel/sched/ext.c | 250 +++++++++++++++++++++++++++++++--
kernel/sched/ext.h | 5 +
tools/sched_ext/scx_qmap.bpf.c | 87 +++++++++++-
tools/sched_ext/scx_qmap.c | 5 +-
7 files changed, 344 insertions(+), 18 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 1dc0182fb1c8..d89b4d907b26 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -129,6 +129,9 @@ struct sched_ext_entity {
struct list_head runnable_node; /* rq->scx.runnable_list */
unsigned long runnable_at;

+#ifdef CONFIG_SCHED_CORE
+ u64 core_sched_at; /* see scx_prio_less() */
+#endif
u64 ddsp_dsq_id;
u64 ddsp_enq_flags;

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 0afcda19bc50..e12a057ead7b 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -135,7 +135,7 @@ config SCHED_CORE

config SCHED_CLASS_EXT
bool "Extensible Scheduling Class"
- depends on BPF_SYSCALL && BPF_JIT && !SCHED_CORE
+ depends on BPF_SYSCALL && BPF_JIT
help
This option enables a new scheduler class sched_ext (SCX), which
allows scheduling policies to be implemented as BPF programs to
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d940c17dfe8a..f3d210301e9f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -168,7 +168,10 @@ static inline int __task_prio(const struct task_struct *p)
if (p->sched_class == &idle_sched_class)
return MAX_RT_PRIO + NICE_WIDTH; /* 140 */

- return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */
+ if (task_on_scx(p))
+ return MAX_RT_PRIO + MAX_NICE + 1; /* 120, squash ext */
+
+ return MAX_RT_PRIO + MAX_NICE; /* 119, squash fair */
}

/*
@@ -197,6 +200,11 @@ static inline bool prio_less(const struct task_struct *a,
if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
return cfs_prio_less(a, b, in_fi);

+#ifdef CONFIG_SCHED_CLASS_EXT
+ if (pa == MAX_RT_PRIO + MAX_NICE + 1) /* ext */
+ return scx_prio_less(a, b, in_fi);
+#endif
+
return false;
}

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 619ae7e814be..83a56b5bb72b 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -348,6 +348,24 @@ struct sched_ext_ops {
*/
bool (*yield)(struct task_struct *from, struct task_struct *to);

+ /**
+ * core_sched_before - Task ordering for core-sched
+ * @a: task A
+ * @b: task B
+ *
+ * Used by core-sched to determine the ordering between two tasks. See
+ * Documentation/admin-guide/hw-vuln/core-scheduling.rst for details on
+ * core-sched.
+ *
+ * Both @a and @b are runnable and may or may not currently be queued on
+ * the BPF scheduler. Should return %true if @a should run before @b.
+ * %false if there's no required ordering or @b should run before @a.
+ *
+ * If not specified, the default is ordering them according to when they
+ * became runnable.
+ */
+ bool (*core_sched_before)(struct task_struct *a, struct task_struct *b);
+
/**
* set_weight - Set task weight
* @p: task to set weight for
@@ -672,6 +690,14 @@ enum scx_enq_flags {
enum scx_deq_flags {
/* expose select DEQUEUE_* flags as enums */
SCX_DEQ_SLEEP = DEQUEUE_SLEEP,
+
+ /* high 32bits are SCX specific */
+
+ /*
+ * The generic core-sched layer decided to execute the task even though
+ * it hasn't been dispatched yet. Dequeue from the BPF side.
+ */
+ SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32,
};

enum scx_pick_idle_cpu_flags {
@@ -1278,6 +1304,49 @@ static int ops_sanitize_err(const char *ops_name, s32 err)
return -EPROTO;
}

+/**
+ * touch_core_sched - Update timestamp used for core-sched task ordering
+ * @rq: rq to read clock from, must be locked
+ * @p: task to update the timestamp for
+ *
+ * Update @p->scx.core_sched_at timestamp. This is used by scx_prio_less() to
+ * implement global or local-DSQ FIFO ordering for core-sched. Should be called
+ * when a task becomes runnable and its turn on the CPU ends (e.g. slice
+ * exhaustion).
+ */
+static void touch_core_sched(struct rq *rq, struct task_struct *p)
+{
+#ifdef CONFIG_SCHED_CORE
+ /*
+ * It's okay to update the timestamp spuriously. Use
+ * sched_core_disabled() which is cheaper than enabled().
+ */
+ if (!sched_core_disabled())
+ p->scx.core_sched_at = rq_clock_task(rq);
+#endif
+}
+
+/**
+ * touch_core_sched_dispatch - Update core-sched timestamp on dispatch
+ * @rq: rq to read clock from, must be locked
+ * @p: task being dispatched
+ *
+ * If the BPF scheduler implements custom core-sched ordering via
+ * ops.core_sched_before(), @p->scx.core_sched_at is used to implement FIFO
+ * ordering within each local DSQ. This function is called from dispatch paths
+ * and updates @p->scx.core_sched_at if custom core-sched ordering is in effect.
+ */
+static void touch_core_sched_dispatch(struct rq *rq, struct task_struct *p)
+{
+ lockdep_assert_rq_held(rq);
+ assert_clock_updated(rq);
+
+#ifdef CONFIG_SCHED_CORE
+ if (SCX_HAS_OP(core_sched_before))
+ touch_core_sched(rq, p);
+#endif
+}
+
static void update_curr_scx(struct rq *rq)
{
struct task_struct *curr = rq->curr;
@@ -1293,8 +1362,11 @@ static void update_curr_scx(struct rq *rq)
account_group_exec_runtime(curr, delta_exec);
cgroup_account_cputime(curr, delta_exec);

- if (curr->scx.slice != SCX_SLICE_INF)
+ if (curr->scx.slice != SCX_SLICE_INF) {
curr->scx.slice -= min(curr->scx.slice, delta_exec);
+ if (!curr->scx.slice)
+ touch_core_sched(rq, curr);
+ }
}

static void dsq_mod_nr(struct scx_dispatch_q *dsq, s32 delta)
@@ -1487,6 +1559,8 @@ static void direct_dispatch(struct task_struct *p, u64 enq_flags)
{
struct scx_dispatch_q *dsq;

+ touch_core_sched_dispatch(task_rq(p), p);
+
enq_flags |= (p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
dsq = find_dsq_for_dispatch(task_rq(p), p->scx.ddsp_dsq_id, p);
dispatch_enqueue(dsq, p, enq_flags);
@@ -1572,12 +1646,19 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
return;

local:
+ /*
+ * For task-ordering, slice refill must be treated as implying the end
+ * of the current slice. Otherwise, the longer @p stays on the CPU, the
+ * higher priority it becomes from scx_prio_less()'s POV.
+ */
+ touch_core_sched(rq, p);
p->scx.slice = SCX_SLICE_DFL;
local_norefill:
dispatch_enqueue(&rq->scx.local_dsq, p, enq_flags);
return;

global:
+ touch_core_sched(rq, p); /* see the comment in local: */
p->scx.slice = SCX_SLICE_DFL;
dispatch_enqueue(&scx_dsq_global, p, enq_flags);
}
@@ -1641,6 +1722,9 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
if (SCX_HAS_OP(runnable))
SCX_CALL_OP_TASK(SCX_KF_REST, runnable, p, enq_flags);

+ if (enq_flags & SCX_ENQ_WAKEUP)
+ touch_core_sched(rq, p);
+
do_enqueue_task(rq, p, enq_flags, sticky_cpu);
}

@@ -2131,6 +2215,7 @@ static void finish_dispatch(struct rq *rq, struct rq_flags *rf,
struct scx_dispatch_q *dsq;
unsigned long opss;

+ touch_core_sched_dispatch(rq, p);
retry:
/*
* No need for _acquire here. @p is accessed only after a successful
@@ -2208,8 +2293,8 @@ static void flush_dispatch_buf(struct rq *rq, struct rq_flags *rf)
dspc->buf_cursor = 0;
}

-static int balance_scx(struct rq *rq, struct task_struct *prev,
- struct rq_flags *rf)
+static int balance_one(struct rq *rq, struct task_struct *prev,
+ struct rq_flags *rf, bool local)
{
struct scx_rq *scx_rq = &rq->scx;
struct scx_dsp_ctx *dspc = this_cpu_ptr(&scx_dsp_ctx);
@@ -2235,7 +2320,7 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
}

if (prev_on_scx) {
- WARN_ON_ONCE(prev->scx.flags & SCX_TASK_BAL_KEEP);
+ WARN_ON_ONCE(local && (prev->scx.flags & SCX_TASK_BAL_KEEP));
update_curr_scx(rq);

/*
@@ -2247,10 +2332,16 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
*
* See scx_ops_disable_workfn() for the explanation on the
* bypassing test.
+ *
+ * When balancing a remote CPU for core-sched, there won't be a
+ * following put_prev_task_scx() call and we don't own
+ * %SCX_TASK_BAL_KEEP. Instead, pick_task_scx() will test the
+ * same conditions later and pick @rq->curr accordingly.
*/
if ((prev->scx.flags & SCX_TASK_QUEUED) &&
prev->scx.slice && !scx_ops_bypassing()) {
- prev->scx.flags |= SCX_TASK_BAL_KEEP;
+ if (local)
+ prev->scx.flags |= SCX_TASK_BAL_KEEP;
goto has_tasks;
}
}
@@ -2312,10 +2403,56 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
return has_tasks;
}

+static int balance_scx(struct rq *rq, struct task_struct *prev,
+ struct rq_flags *rf)
+{
+ int ret;
+
+ ret = balance_one(rq, prev, rf, true);
+
+#ifdef CONFIG_SCHED_SMT
+ /*
+ * When core-sched is enabled, this ops.balance() call will be followed
+ * by put_prev_scx() and pick_task_scx() on this CPU and pick_task_scx()
+ * on the SMT siblings. Balance the siblings too.
+ */
+ if (sched_core_enabled(rq)) {
+ const struct cpumask *smt_mask = cpu_smt_mask(cpu_of(rq));
+ int scpu;
+
+ for_each_cpu_andnot(scpu, smt_mask, cpumask_of(cpu_of(rq))) {
+ struct rq *srq = cpu_rq(scpu);
+ struct rq_flags srf;
+ struct task_struct *sprev = srq->curr;
+
+ /*
+ * While core-scheduling, rq lock is shared among
+ * siblings but the debug annotations and rq clock
+ * aren't. Do pinning dance to transfer the ownership.
+ */
+ WARN_ON_ONCE(__rq_lockp(rq) != __rq_lockp(srq));
+ rq_unpin_lock(rq, rf);
+ rq_pin_lock(srq, &srf);
+
+ update_rq_clock(srq);
+ balance_one(srq, sprev, &srf, false);
+
+ rq_unpin_lock(srq, &srf);
+ rq_repin_lock(rq, rf);
+ }
+ }
+#endif
+ return ret;
+}
+
static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
{
if (p->scx.flags & SCX_TASK_QUEUED) {
- WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
+ /*
+ * Core-sched might decide to execute @p before it is
+ * dispatched. Call ops_dequeue() to notify the BPF scheduler.
+ */
+ ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC);
dispatch_dequeue(&rq->scx, p);
}

@@ -2406,7 +2543,8 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
/*
* If @p has slice left and balance_scx() didn't tag it for
* keeping, @p is getting preempted by a higher priority
- * scheduler class. Leave it at the head of the local DSQ.
+ * scheduler class or core-sched forcing a different task. Leave
+ * it at the head of the local DSQ.
*/
if (p->scx.slice && !scx_ops_bypassing()) {
dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
@@ -2463,6 +2601,84 @@ static struct task_struct *pick_next_task_scx(struct rq *rq)
return p;
}

+#ifdef CONFIG_SCHED_CORE
+/**
+ * scx_prio_less - Task ordering for core-sched
+ * @a: task A
+ * @b: task B
+ *
+ * Core-sched is implemented as an additional scheduling layer on top of the
+ * usual sched_class'es and needs to find out the expected task ordering. For
+ * SCX, core-sched calls this function to interrogate the task ordering.
+ *
+ * Unless overridden by ops.core_sched_before(), @p->scx.core_sched_at is used
+ * to implement the default task ordering. The older the timestamp, the higher
+ * prority the task - the global FIFO ordering matching the default scheduling
+ * behavior.
+ *
+ * When ops.core_sched_before() is enabled, @p->scx.core_sched_at is used to
+ * implement FIFO ordering within each local DSQ. See pick_task_scx().
+ */
+bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
+ bool in_fi)
+{
+ /*
+ * The const qualifiers are dropped from task_struct pointers when
+ * calling ops.core_sched_before(). Accesses are controlled by the
+ * verifier.
+ */
+ if (SCX_HAS_OP(core_sched_before) && !scx_ops_bypassing())
+ return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, core_sched_before,
+ (struct task_struct *)a,
+ (struct task_struct *)b);
+ else
+ return time_after64(a->scx.core_sched_at, b->scx.core_sched_at);
+}
+
+/**
+ * pick_task_scx - Pick a candidate task for core-sched
+ * @rq: rq to pick the candidate task from
+ *
+ * Core-sched calls this function on each SMT sibling to determine the next
+ * tasks to run on the SMT siblings. balance_one() has been called on all
+ * siblings and put_prev_task_scx() has been called only for the current CPU.
+ *
+ * As put_prev_task_scx() hasn't been called on remote CPUs, we can't just look
+ * at the first task in the local dsq. @rq->curr has to be considered explicitly
+ * to mimic %SCX_TASK_BAL_KEEP.
+ */
+static struct task_struct *pick_task_scx(struct rq *rq)
+{
+ struct task_struct *curr = rq->curr;
+ struct task_struct *first = first_local_task(rq);
+
+ if (curr->scx.flags & SCX_TASK_QUEUED) {
+ /* is curr the only runnable task? */
+ if (!first)
+ return curr;
+
+ /*
+ * Does curr trump first? We can always go by core_sched_at for
+ * this comparison as it represents global FIFO ordering when
+ * the default core-sched ordering is used and local-DSQ FIFO
+ * ordering otherwise.
+ *
+ * We can have a task with an earlier timestamp on the DSQ. For
+ * example, when a current task is preempted by a sibling
+ * picking a different cookie, the task would be requeued at the
+ * head of the local DSQ with an earlier timestamp than the
+ * core-sched picked next task. Besides, the BPF scheduler may
+ * dispatch any tasks to the local DSQ anytime.
+ */
+ if (curr->scx.slice && time_before64(curr->scx.core_sched_at,
+ first->scx.core_sched_at))
+ return curr;
+ }
+
+ return first; /* this may be %NULL */
+}
+#endif /* CONFIG_SCHED_CORE */
+
static enum scx_cpu_preempt_reason
preempt_reason_from_class(const struct sched_class *class)
{
@@ -2862,13 +3078,15 @@ static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued)
update_curr_scx(rq);

/*
- * While bypassing, always resched as we can't trust the slice
- * management.
+ * While disabling, always resched and refresh core-sched timestamp as
+ * we can't trust the slice management or ops.core_sched_before().
*/
- if (scx_ops_bypassing())
+ if (scx_ops_bypassing()) {
curr->scx.slice = 0;
- else if (SCX_HAS_OP(tick))
+ touch_core_sched(rq, curr);
+ } else if (SCX_HAS_OP(tick)) {
SCX_CALL_OP(SCX_KF_REST, tick, curr);
+ }

if (!curr->scx.slice)
resched_curr(rq);
@@ -3419,6 +3637,10 @@ DEFINE_SCHED_CLASS(ext) = {
.rq_offline = rq_offline_scx,
#endif

+#ifdef CONFIG_SCHED_CORE
+ .pick_task = pick_task_scx,
+#endif
+
.task_tick = task_tick_scx,

.switching_to = switching_to_scx,
@@ -3757,12 +3979,14 @@ bool task_should_scx(struct task_struct *p)
*
* c. balance_scx() never sets %SCX_TASK_BAL_KEEP as the slice value can't be
* trusted. Whenever a tick triggers, the running task is rotated to the tail
- * of the queue.
+ * of the queue with core_sched_at touched.
*
* d. pick_next_task() suppresses zero slice warning.
*
* e. scx_bpf_kick_cpu() is disabled to avoid irq_work malfunction during PM
* operations.
+ *
+ * f. scx_prio_less() reverts to the default core_sched_at order.
*/
static void scx_ops_bypass(bool bypass)
{
@@ -4767,6 +4991,7 @@ static void running_stub(struct task_struct *p) {}
static void stopping_stub(struct task_struct *p, bool runnable) {}
static void quiescent_stub(struct task_struct *p, u64 deq_flags) {}
static bool yield_stub(struct task_struct *from, struct task_struct *to) { return false; }
+static bool core_sched_before_stub(struct task_struct *a, struct task_struct *b) { return false; }
static void set_weight_stub(struct task_struct *p, u32 weight) {}
static void set_cpumask_stub(struct task_struct *p, const struct cpumask *mask) {}
static void update_idle_stub(s32 cpu, bool idle) {}
@@ -4799,6 +5024,7 @@ static struct sched_ext_ops __bpf_ops_sched_ext_ops = {
.stopping = stopping_stub,
.quiescent = quiescent_stub,
.yield = yield_stub,
+ .core_sched_before = core_sched_before_stub,
.set_weight = set_weight_stub,
.set_cpumask = set_cpumask_stub,
.update_idle = update_idle_stub,
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 10f4717839c0..6fd646450065 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -77,6 +77,11 @@ static inline const struct sched_class *next_active_class(const struct sched_cla
for_active_class_range(class, (prev_class) > &ext_sched_class ? \
&ext_sched_class : (prev_class), (end_class))

+#ifdef CONFIG_SCHED_CORE
+bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
+ bool in_fi);
+#endif
+
#else /* CONFIG_SCHED_CLASS_EXT */

#define scx_enabled() false
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index c2edc080d7e5..a442031309c0 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -13,6 +13,7 @@
* - Sleepable per-task storage allocation using ops.prep_enable().
* - Using ops.cpu_release() to handle a higher priority scheduling class taking
* the CPU away.
+ * - Core-sched support.
*
* This scheduler is primarily for demonstration and testing of sched_ext
* features and unlikely to be useful for actual workloads.
@@ -67,9 +68,21 @@ struct {
},
};

+/*
+ * Per-queue sequence numbers to implement core-sched ordering.
+ *
+ * Tail seq is assigned to each queued task and incremented. Head seq tracks the
+ * sequence number of the latest dispatched task. The distance between the a
+ * task's seq and the associated queue's head seq is called the queue distance
+ * and used when comparing two tasks for ordering. See qmap_core_sched_before().
+ */
+static u64 core_sched_head_seqs[5];
+static u64 core_sched_tail_seqs[5];
+
/* Per-task scheduling context */
struct task_ctx {
bool force_local; /* Dispatch directly to local_dsq */
+ u64 core_sched_seq;
};

struct {
@@ -93,6 +106,7 @@ struct {

/* Statistics */
u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued;
+u64 nr_core_sched_execed;

s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
s32 prev_cpu, u64 wake_flags)
@@ -159,8 +173,18 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
return;
}

- /* Is select_cpu() is telling us to enqueue locally? */
- if (tctx->force_local) {
+ /*
+ * All enqueued tasks must have their core_sched_seq updated for correct
+ * core-sched ordering, which is why %SCX_OPS_ENQ_LAST is specified in
+ * qmap_ops.flags.
+ */
+ tctx->core_sched_seq = core_sched_tail_seqs[idx]++;
+
+ /*
+ * If qmap_select_cpu() is telling us to or this is the last runnable
+ * task on the CPU, enqueue locally.
+ */
+ if (tctx->force_local || (enq_flags & SCX_ENQ_LAST)) {
tctx->force_local = false;
scx_bpf_dispatch(p, SCX_DSQ_LOCAL, slice_ns, enq_flags);
return;
@@ -204,6 +228,19 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
void BPF_STRUCT_OPS(qmap_dequeue, struct task_struct *p, u64 deq_flags)
{
__sync_fetch_and_add(&nr_dequeued, 1);
+ if (deq_flags & SCX_DEQ_CORE_SCHED_EXEC)
+ __sync_fetch_and_add(&nr_core_sched_execed, 1);
+}
+
+static void update_core_sched_head_seq(struct task_struct *p)
+{
+ struct task_ctx *tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
+ int idx = weight_to_idx(p->scx.weight);
+
+ if (tctx)
+ core_sched_head_seqs[idx] = tctx->core_sched_seq;
+ else
+ scx_bpf_error("task_ctx lookup failed");
}

void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
@@ -258,6 +295,7 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
if (!p)
continue;

+ update_core_sched_head_seq(p);
__sync_fetch_and_add(&nr_dispatched, 1);
scx_bpf_dispatch(p, SHARED_DSQ, slice_ns, 0);
bpf_task_release(p);
@@ -275,6 +313,49 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
}
}

+/*
+ * The distance from the head of the queue scaled by the weight of the queue.
+ * The lower the number, the older the task and the higher the priority.
+ */
+static s64 task_qdist(struct task_struct *p)
+{
+ int idx = weight_to_idx(p->scx.weight);
+ struct task_ctx *tctx;
+ s64 qdist;
+
+ tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
+ if (!tctx) {
+ scx_bpf_error("task_ctx lookup failed");
+ return 0;
+ }
+
+ qdist = tctx->core_sched_seq - core_sched_head_seqs[idx];
+
+ /*
+ * As queue index increments, the priority doubles. The queue w/ index 3
+ * is dispatched twice more frequently than 2. Reflect the difference by
+ * scaling qdists accordingly. Note that the shift amount needs to be
+ * flipped depending on the sign to avoid flipping priority direction.
+ */
+ if (qdist >= 0)
+ return qdist << (4 - idx);
+ else
+ return qdist << idx;
+}
+
+/*
+ * This is called to determine the task ordering when core-sched is picking
+ * tasks to execute on SMT siblings and should encode about the same ordering as
+ * the regular scheduling path. Use the priority-scaled distances from the head
+ * of the queues to compare the two tasks which should be consistent with the
+ * dispatch path behavior.
+ */
+bool BPF_STRUCT_OPS(qmap_core_sched_before,
+ struct task_struct *a, struct task_struct *b)
+{
+ return task_qdist(a) > task_qdist(b);
+}
+
void BPF_STRUCT_OPS(qmap_cpu_release, s32 cpu, struct scx_cpu_release_args *args)
{
u32 cnt;
@@ -384,11 +465,13 @@ SCX_OPS_DEFINE(qmap_ops,
.enqueue = (void *)qmap_enqueue,
.dequeue = (void *)qmap_dequeue,
.dispatch = (void *)qmap_dispatch,
+ .core_sched_before = (void *)qmap_core_sched_before,
.cpu_release = (void *)qmap_cpu_release,
.init_task = (void *)qmap_init_task,
.cpu_online = (void *)qmap_cpu_online,
.cpu_offline = (void *)qmap_cpu_offline,
.init = (void *)qmap_init,
.exit = (void *)qmap_exit,
+ .flags = SCX_OPS_ENQ_LAST,
.timeout_ms = 5000U,
.name = "qmap");
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index e82f58b5c131..a106ba099e5e 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -109,9 +109,10 @@ int main(int argc, char **argv)
long nr_enqueued = skel->bss->nr_enqueued;
long nr_dispatched = skel->bss->nr_dispatched;

- printf("stats : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64"\n",
+ printf("stats : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64" core=%"PRIu64"\n",
nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
- skel->bss->nr_reenqueued, skel->bss->nr_dequeued);
+ skel->bss->nr_reenqueued, skel->bss->nr_dequeued,
+ skel->bss->nr_core_sched_execed);
fflush(stdout);
sleep(1);
}
--
2.44.0


2024-05-01 15:28:31

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 39/39] sched_ext: Add selftests

From: David Vernet <[email protected]>

Add basic selftests.

Signed-off-by: David Vernet <[email protected]>
Acked-by: Tejun Heo <[email protected]>
---
tools/testing/selftests/sched_ext/.gitignore | 6 +
tools/testing/selftests/sched_ext/Makefile | 216 ++++++++++++++++++
tools/testing/selftests/sched_ext/config | 9 +
.../sched_ext/ddsp_bogus_dsq_fail.bpf.c | 42 ++++
.../selftests/sched_ext/ddsp_bogus_dsq_fail.c | 57 +++++
.../sched_ext/ddsp_vtimelocal_fail.bpf.c | 39 ++++
.../sched_ext/ddsp_vtimelocal_fail.c | 56 +++++
.../sched_ext/enq_last_no_enq_fails.bpf.c | 21 ++
.../sched_ext/enq_last_no_enq_fails.c | 60 +++++
.../sched_ext/enq_select_cpu_fails.bpf.c | 43 ++++
.../sched_ext/enq_select_cpu_fails.c | 61 +++++
tools/testing/selftests/sched_ext/exit.bpf.c | 84 +++++++
tools/testing/selftests/sched_ext/exit.c | 55 +++++
tools/testing/selftests/sched_ext/exit_test.h | 20 ++
.../testing/selftests/sched_ext/hotplug.bpf.c | 55 +++++
tools/testing/selftests/sched_ext/hotplug.c | 168 ++++++++++++++
.../selftests/sched_ext/hotplug_test.h | 15 ++
.../sched_ext/init_enable_count.bpf.c | 53 +++++
.../selftests/sched_ext/init_enable_count.c | 166 ++++++++++++++
.../testing/selftests/sched_ext/maximal.bpf.c | 164 +++++++++++++
tools/testing/selftests/sched_ext/maximal.c | 51 +++++
.../selftests/sched_ext/maybe_null.bpf.c | 26 +++
.../testing/selftests/sched_ext/maybe_null.c | 40 ++++
.../selftests/sched_ext/maybe_null_fail.bpf.c | 25 ++
.../testing/selftests/sched_ext/minimal.bpf.c | 21 ++
tools/testing/selftests/sched_ext/minimal.c | 58 +++++
.../selftests/sched_ext/prog_run.bpf.c | 32 +++
tools/testing/selftests/sched_ext/prog_run.c | 78 +++++++
.../testing/selftests/sched_ext/reload_loop.c | 75 ++++++
tools/testing/selftests/sched_ext/runner.c | 201 ++++++++++++++++
tools/testing/selftests/sched_ext/scx_test.h | 131 +++++++++++
.../selftests/sched_ext/select_cpu_dfl.bpf.c | 40 ++++
.../selftests/sched_ext/select_cpu_dfl.c | 72 ++++++
.../sched_ext/select_cpu_dfl_nodispatch.bpf.c | 89 ++++++++
.../sched_ext/select_cpu_dfl_nodispatch.c | 72 ++++++
.../sched_ext/select_cpu_dispatch.bpf.c | 41 ++++
.../selftests/sched_ext/select_cpu_dispatch.c | 70 ++++++
.../select_cpu_dispatch_bad_dsq.bpf.c | 37 +++
.../sched_ext/select_cpu_dispatch_bad_dsq.c | 56 +++++
.../select_cpu_dispatch_dbl_dsp.bpf.c | 38 +++
.../sched_ext/select_cpu_dispatch_dbl_dsp.c | 56 +++++
.../sched_ext/select_cpu_vtime.bpf.c | 92 ++++++++
.../selftests/sched_ext/select_cpu_vtime.c | 59 +++++
.../selftests/sched_ext/test_example.c | 49 ++++
tools/testing/selftests/sched_ext/util.c | 71 ++++++
tools/testing/selftests/sched_ext/util.h | 13 ++
46 files changed, 2983 insertions(+)
create mode 100644 tools/testing/selftests/sched_ext/.gitignore
create mode 100644 tools/testing/selftests/sched_ext/Makefile
create mode 100644 tools/testing/selftests/sched_ext/config
create mode 100644 tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.c
create mode 100644 tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.c
create mode 100644 tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c
create mode 100644 tools/testing/selftests/sched_ext/enq_select_cpu_fails.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/enq_select_cpu_fails.c
create mode 100644 tools/testing/selftests/sched_ext/exit.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/exit.c
create mode 100644 tools/testing/selftests/sched_ext/exit_test.h
create mode 100644 tools/testing/selftests/sched_ext/hotplug.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/hotplug.c
create mode 100644 tools/testing/selftests/sched_ext/hotplug_test.h
create mode 100644 tools/testing/selftests/sched_ext/init_enable_count.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/init_enable_count.c
create mode 100644 tools/testing/selftests/sched_ext/maximal.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/maximal.c
create mode 100644 tools/testing/selftests/sched_ext/maybe_null.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/maybe_null.c
create mode 100644 tools/testing/selftests/sched_ext/maybe_null_fail.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/minimal.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/minimal.c
create mode 100644 tools/testing/selftests/sched_ext/prog_run.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/prog_run.c
create mode 100644 tools/testing/selftests/sched_ext/reload_loop.c
create mode 100644 tools/testing/selftests/sched_ext/runner.c
create mode 100644 tools/testing/selftests/sched_ext/scx_test.h
create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dfl.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dfl.c
create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.c
create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dispatch.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dispatch.c
create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.c
create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.c
create mode 100644 tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/select_cpu_vtime.c
create mode 100644 tools/testing/selftests/sched_ext/test_example.c
create mode 100644 tools/testing/selftests/sched_ext/util.c
create mode 100644 tools/testing/selftests/sched_ext/util.h

diff --git a/tools/testing/selftests/sched_ext/.gitignore b/tools/testing/selftests/sched_ext/.gitignore
new file mode 100644
index 000000000000..ae5491a114c0
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/.gitignore
@@ -0,0 +1,6 @@
+*
+!*.c
+!*.h
+!Makefile
+!.gitignore
+!config
diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
new file mode 100644
index 000000000000..a95fa2a1adad
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/Makefile
@@ -0,0 +1,216 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+include ../../../build/Build.include
+include ../../../scripts/Makefile.arch
+include ../../../scripts/Makefile.include
+include ../lib.mk
+
+ifneq ($(LLVM),)
+ifneq ($(filter %/,$(LLVM)),)
+LLVM_PREFIX := $(LLVM)
+else ifneq ($(filter -%,$(LLVM)),)
+LLVM_SUFFIX := $(LLVM)
+endif
+
+CC := $(LLVM_PREFIX)clang$(LLVM_SUFFIX) $(CLANG_FLAGS) -fintegrated-as
+else
+CC := gcc
+endif # LLVM
+
+ifneq ($(CROSS_COMPILE),)
+$(error CROSS_COMPILE not supported for scx selftests)
+endif # CROSS_COMPILE
+
+CURDIR := $(abspath .)
+REPOROOT := $(abspath ../../../..)
+TOOLSDIR := $(REPOROOT)/tools
+LIBDIR := $(TOOLSDIR)/lib
+BPFDIR := $(LIBDIR)/bpf
+TOOLSINCDIR := $(TOOLSDIR)/include
+BPFTOOLDIR := $(TOOLSDIR)/bpf/bpftool
+APIDIR := $(TOOLSINCDIR)/uapi
+GENDIR := $(REPOROOT)/include/generated
+GENHDR := $(GENDIR)/autoconf.h
+SCXTOOLSDIR := $(TOOLSDIR)/sched_ext
+SCXTOOLSINCDIR := $(TOOLSDIR)/sched_ext/include
+
+OUTPUT_DIR := $(CURDIR)/build
+OBJ_DIR := $(OUTPUT_DIR)/obj
+INCLUDE_DIR := $(OUTPUT_DIR)/include
+BPFOBJ_DIR := $(OBJ_DIR)/libbpf
+SCXOBJ_DIR := $(OBJ_DIR)/sched_ext
+BPFOBJ := $(BPFOBJ_DIR)/libbpf.a
+LIBBPF_OUTPUT := $(OBJ_DIR)/libbpf/libbpf.a
+DEFAULT_BPFTOOL := $(OUTPUT_DIR)/sbin/bpftool
+HOST_BUILD_DIR := $(OBJ_DIR)
+HOST_OUTPUT_DIR := $(OUTPUT_DIR)
+
+VMLINUX_BTF_PATHS ?= ../../../../vmlinux \
+ /sys/kernel/btf/vmlinux \
+ /boot/vmlinux-$(shell uname -r)
+VMLINUX_BTF ?= $(abspath $(firstword $(wildcard $(VMLINUX_BTF_PATHS))))
+ifeq ($(VMLINUX_BTF),)
+$(error Cannot find a vmlinux for VMLINUX_BTF at any of "$(VMLINUX_BTF_PATHS)")
+endif
+
+BPFTOOL ?= $(DEFAULT_BPFTOOL)
+
+ifneq ($(wildcard $(GENHDR)),)
+ GENFLAGS := -DHAVE_GENHDR
+endif
+
+CFLAGS += -g -O2 -rdynamic -pthread -Wall -Werror $(GENFLAGS) \
+ -I$(INCLUDE_DIR) -I$(GENDIR) -I$(LIBDIR) \
+ -I$(TOOLSINCDIR) -I$(APIDIR) -I$(CURDIR)/include -I$(SCXTOOLSINCDIR)
+
+# Silence some warnings when compiled with clang
+ifneq ($(LLVM),)
+CFLAGS += -Wno-unused-command-line-argument
+endif
+
+LDFLAGS = -lelf -lz -lpthread -lzstd
+
+IS_LITTLE_ENDIAN = $(shell $(CC) -dM -E - </dev/null | \
+ grep 'define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__')
+
+# Get Clang's default includes on this system, as opposed to those seen by
+# '-target bpf'. This fixes "missing" files on some architectures/distros,
+# such as asm/byteorder.h, asm/socket.h, asm/sockios.h, sys/cdefs.h etc.
+#
+# Use '-idirafter': Don't interfere with include mechanics except where the
+# build would have failed anyways.
+define get_sys_includes
+$(shell $(1) -v -E - </dev/null 2>&1 \
+ | sed -n '/<...> search starts here:/,/End of search list./{ s| \(/.*\)|-idirafter \1|p }') \
+$(shell $(1) -dM -E - </dev/null | grep '__riscv_xlen ' | awk '{printf("-D__riscv_xlen=%d -D__BITS_PER_LONG=%d", $$3, $$3)}')
+endef
+
+BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH) \
+ $(if $(IS_LITTLE_ENDIAN),-mlittle-endian,-mbig-endian) \
+ -I$(CURDIR)/include -I$(CURDIR)/include/bpf-compat \
+ -I$(INCLUDE_DIR) -I$(APIDIR) -I$(SCXTOOLSINCDIR) \
+ -I$(REPOROOT)/include \
+ $(call get_sys_includes,$(CLANG)) \
+ -Wall -Wno-compare-distinct-pointer-types \
+ -Wno-incompatible-function-pointer-types \
+ -O2 -mcpu=v3
+
+# sort removes libbpf duplicates when not cross-building
+MAKE_DIRS := $(sort $(OBJ_DIR)/libbpf $(OBJ_DIR)/libbpf \
+ $(OBJ_DIR)/bpftool $(OBJ_DIR)/resolve_btfids \
+ $(INCLUDE_DIR) $(SCXOBJ_DIR))
+
+$(MAKE_DIRS):
+ $(call msg,MKDIR,,$@)
+ $(Q)mkdir -p $@
+
+$(BPFOBJ): $(wildcard $(BPFDIR)/*.[ch] $(BPFDIR)/Makefile) \
+ $(APIDIR)/linux/bpf.h \
+ | $(OBJ_DIR)/libbpf
+ $(Q)$(MAKE) $(submake_extras) -C $(BPFDIR) OUTPUT=$(OBJ_DIR)/libbpf/ \
+ EXTRA_CFLAGS='-g -O0 -fPIC' \
+ DESTDIR=$(OUTPUT_DIR) prefix= all install_headers
+
+$(DEFAULT_BPFTOOL): $(wildcard $(BPFTOOLDIR)/*.[ch] $(BPFTOOLDIR)/Makefile) \
+ $(LIBBPF_OUTPUT) | $(OBJ_DIR)/bpftool
+ $(Q)$(MAKE) $(submake_extras) -C $(BPFTOOLDIR) \
+ ARCH= CROSS_COMPILE= CC=$(HOSTCC) LD=$(HOSTLD) \
+ EXTRA_CFLAGS='-g -O0' \
+ OUTPUT=$(OBJ_DIR)/bpftool/ \
+ LIBBPF_OUTPUT=$(OBJ_DIR)/libbpf/ \
+ LIBBPF_DESTDIR=$(OUTPUT_DIR)/ \
+ prefix= DESTDIR=$(OUTPUT_DIR)/ install-bin
+
+$(INCLUDE_DIR)/vmlinux.h: $(VMLINUX_BTF) $(BPFTOOL) | $(INCLUDE_DIR)
+ifeq ($(VMLINUX_H),)
+ $(call msg,GEN,,$@)
+ $(Q)$(BPFTOOL) btf dump file $(VMLINUX_BTF) format c > $@
+else
+ $(call msg,CP,,$@)
+ $(Q)cp "$(VMLINUX_H)" $@
+endif
+
+$(SCXOBJ_DIR)/%.bpf.o: %.bpf.c $(INCLUDE_DIR)/vmlinux.h | $(BPFOBJ) $(SCXOBJ_DIR)
+ $(call msg,CLNG-BPF,,$(notdir $@))
+ $(Q)$(CLANG) $(BPF_CFLAGS) -target bpf -c $< -o $@
+
+$(INCLUDE_DIR)/%.bpf.skel.h: $(SCXOBJ_DIR)/%.bpf.o $(INCLUDE_DIR)/vmlinux.h $(BPFTOOL) | $(INCLUDE_DIR)
+ $(eval sched=$(notdir $@))
+ $(call msg,GEN-SKEL,,$(sched))
+ $(Q)$(BPFTOOL) gen object $(<:.o=.linked1.o) $<
+ $(Q)$(BPFTOOL) gen object $(<:.o=.linked2.o) $(<:.o=.linked1.o)
+ $(Q)$(BPFTOOL) gen object $(<:.o=.linked3.o) $(<:.o=.linked2.o)
+ $(Q)diff $(<:.o=.linked2.o) $(<:.o=.linked3.o)
+ $(Q)$(BPFTOOL) gen skeleton $(<:.o=.linked3.o) name $(subst .bpf.skel.h,,$(sched)) > $@
+ $(Q)$(BPFTOOL) gen subskeleton $(<:.o=.linked3.o) name $(subst .bpf.skel.h,,$(sched)) > $(@:.skel.h=.subskel.h)
+
+################
+# C schedulers #
+################
+
+override define CLEAN
+ rm -rf $(OUTPUT_DIR)
+ rm -f *.o *.bpf.o *.bpf.skel.h *.bpf.subskel.h
+ rm -f $(TEST_GEN_PROGS)
+ rm -f runner
+endef
+
+# Every testcase takes all of the BPF progs are dependencies by default. This
+# allows testcases to load any BPF scheduler, which is useful for testcases
+# that don't need their own prog to run their test.
+all_test_bpfprogs := $(foreach prog,$(wildcard *.bpf.c),$(INCLUDE_DIR)/$(patsubst %.c,%.skel.h,$(prog)))
+
+auto-test-targets := \
+ enq_last_no_enq_fails \
+ enq_select_cpu_fails \
+ ddsp_bogus_dsq_fail \
+ ddsp_vtimelocal_fail \
+ exit \
+ hotplug \
+ init_enable_count \
+ maximal \
+ maybe_null \
+ minimal \
+ prog_run \
+ reload_loop \
+ select_cpu_dfl \
+ select_cpu_dfl_nodispatch \
+ select_cpu_dispatch \
+ select_cpu_dispatch_bad_dsq \
+ select_cpu_dispatch_dbl_dsp \
+ select_cpu_vtime \
+ test_example \
+
+testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets)))
+
+$(SCXOBJ_DIR)/runner.o: runner.c | $(SCXOBJ_DIR)
+ $(CC) $(CFLAGS) -c $< -o $@
+
+# Create all of the test targets object files, whose testcase objects will be
+# registered into the runner in ELF constructors.
+#
+# Note that we must do double expansion here in order to support conditionally
+# compiling BPF object files only if one is present, as the wildcard Make
+# function doesn't support using implicit rules otherwise.
+$(testcase-targets): $(SCXOBJ_DIR)/%.o: %.c $(SCXOBJ_DIR)/runner.o $(all_test_bpfprogs) | $(SCXOBJ_DIR)
+ $(eval test=$(patsubst %.o,%.c,$(notdir $@)))
+ $(CC) $(CFLAGS) -c $< -o $@ $(SCXOBJ_DIR)/runner.o
+
+$(SCXOBJ_DIR)/util.o: util.c | $(SCXOBJ_DIR)
+ $(CC) $(CFLAGS) -c $< -o $@
+
+runner: $(SCXOBJ_DIR)/runner.o $(SCXOBJ_DIR)/util.o $(BPFOBJ) $(testcase-targets)
+ @echo "$(testcase-targets)"
+ $(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS)
+
+TEST_GEN_PROGS := runner
+
+all: runner
+
+.PHONY: all clean help
+
+.DEFAULT_GOAL := all
+
+.DELETE_ON_ERROR:
+
+.SECONDARY:
diff --git a/tools/testing/selftests/sched_ext/config b/tools/testing/selftests/sched_ext/config
new file mode 100644
index 000000000000..0de9b4ee249d
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/config
@@ -0,0 +1,9 @@
+CONFIG_SCHED_DEBUG=y
+CONFIG_SCHED_CLASS_EXT=y
+CONFIG_CGROUPS=y
+CONFIG_CGROUP_SCHED=y
+CONFIG_EXT_GROUP_SCHED=y
+CONFIG_BPF=y
+CONFIG_BPF_SYSCALL=y
+CONFIG_DEBUG_INFO=y
+CONFIG_DEBUG_INFO_BTF=y
diff --git a/tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c b/tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c
new file mode 100644
index 000000000000..e97ad41d354a
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ * Copyright (c) 2024 Tejun Heo <[email protected]>
+ */
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+UEI_DEFINE(uei);
+
+s32 BPF_STRUCT_OPS(ddsp_bogus_dsq_fail_select_cpu, struct task_struct *p,
+ s32 prev_cpu, u64 wake_flags)
+{
+ s32 cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
+
+ if (cpu >= 0) {
+ /*
+ * If we dispatch to a bogus DSQ that will fall back to the
+ * builtin global DSQ, we fail gracefully.
+ */
+ scx_bpf_dispatch_vtime(p, 0xcafef00d, SCX_SLICE_DFL,
+ p->scx.dsq_vtime, 0);
+ return cpu;
+ }
+
+ return prev_cpu;
+}
+
+void BPF_STRUCT_OPS(ddsp_bogus_dsq_fail_exit, struct scx_exit_info *ei)
+{
+ UEI_RECORD(uei, ei);
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops ddsp_bogus_dsq_fail_ops = {
+ .select_cpu = ddsp_bogus_dsq_fail_select_cpu,
+ .exit = ddsp_bogus_dsq_fail_exit,
+ .name = "ddsp_bogus_dsq_fail",
+ .timeout_ms = 1000U,
+};
diff --git a/tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.c b/tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.c
new file mode 100644
index 000000000000..e65d22f23f3b
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.c
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ * Copyright (c) 2024 Tejun Heo <[email protected]>
+ */
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "ddsp_bogus_dsq_fail.bpf.skel.h"
+#include "scx_test.h"
+
+static enum scx_test_status setup(void **ctx)
+{
+ struct ddsp_bogus_dsq_fail *skel;
+
+ skel = ddsp_bogus_dsq_fail__open_and_load();
+ SCX_FAIL_IF(!skel, "Failed to open and load skel");
+ *ctx = skel;
+
+ return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+ struct ddsp_bogus_dsq_fail *skel = ctx;
+ struct bpf_link *link;
+
+ link = bpf_map__attach_struct_ops(skel->maps.ddsp_bogus_dsq_fail_ops);
+ SCX_FAIL_IF(!link, "Failed to attach struct_ops");
+
+ sleep(1);
+
+ SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR));
+ bpf_link__destroy(link);
+
+ return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+ struct ddsp_bogus_dsq_fail *skel = ctx;
+
+ ddsp_bogus_dsq_fail__destroy(skel);
+}
+
+struct scx_test ddsp_bogus_dsq_fail = {
+ .name = "ddsp_bogus_dsq_fail",
+ .description = "Verify we gracefully fail, and fall back to using a "
+ "built-in DSQ, if we do a direct dispatch to an invalid"
+ " DSQ in ops.select_cpu()",
+ .setup = setup,
+ .run = run,
+ .cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&ddsp_bogus_dsq_fail)
diff --git a/tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c b/tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c
new file mode 100644
index 000000000000..dde7e7dafbfb
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ * Copyright (c) 2024 Tejun Heo <[email protected]>
+ */
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+UEI_DEFINE(uei);
+
+s32 BPF_STRUCT_OPS(ddsp_vtimelocal_fail_select_cpu, struct task_struct *p,
+ s32 prev_cpu, u64 wake_flags)
+{
+ s32 cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
+
+ if (cpu >= 0) {
+ /* Shouldn't be allowed to vtime dispatch to a builtin DSQ. */
+ scx_bpf_dispatch_vtime(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL,
+ p->scx.dsq_vtime, 0);
+ return cpu;
+ }
+
+ return prev_cpu;
+}
+
+void BPF_STRUCT_OPS(ddsp_vtimelocal_fail_exit, struct scx_exit_info *ei)
+{
+ UEI_RECORD(uei, ei);
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops ddsp_vtimelocal_fail_ops = {
+ .select_cpu = ddsp_vtimelocal_fail_select_cpu,
+ .exit = ddsp_vtimelocal_fail_exit,
+ .name = "ddsp_vtimelocal_fail",
+ .timeout_ms = 1000U,
+};
diff --git a/tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.c b/tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.c
new file mode 100644
index 000000000000..abafee587cd6
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.c
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ * Copyright (c) 2024 Tejun Heo <[email protected]>
+ */
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <unistd.h>
+#include "ddsp_vtimelocal_fail.bpf.skel.h"
+#include "scx_test.h"
+
+static enum scx_test_status setup(void **ctx)
+{
+ struct ddsp_vtimelocal_fail *skel;
+
+ skel = ddsp_vtimelocal_fail__open_and_load();
+ SCX_FAIL_IF(!skel, "Failed to open and load skel");
+ *ctx = skel;
+
+ return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+ struct ddsp_vtimelocal_fail *skel = ctx;
+ struct bpf_link *link;
+
+ link = bpf_map__attach_struct_ops(skel->maps.ddsp_vtimelocal_fail_ops);
+ SCX_FAIL_IF(!link, "Failed to attach struct_ops");
+
+ sleep(1);
+
+ SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR));
+ bpf_link__destroy(link);
+
+ return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+ struct ddsp_vtimelocal_fail *skel = ctx;
+
+ ddsp_vtimelocal_fail__destroy(skel);
+}
+
+struct scx_test ddsp_vtimelocal_fail = {
+ .name = "ddsp_vtimelocal_fail",
+ .description = "Verify we gracefully fail, and fall back to using a "
+ "built-in DSQ, if we do a direct vtime dispatch to a "
+ "built-in DSQ from DSQ in ops.select_cpu()",
+ .setup = setup,
+ .run = run,
+ .cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&ddsp_vtimelocal_fail)
diff --git a/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c b/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c
new file mode 100644
index 000000000000..b0b99531d5d5
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A scheduler that validates the behavior of direct dispatching with a default
+ * select_cpu implementation.
+ *
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+SEC(".struct_ops.link")
+struct sched_ext_ops enq_last_no_enq_fails_ops = {
+ .name = "enq_last_no_enq_fails",
+ /* Need to define ops.enqueue() with SCX_OPS_ENQ_LAST */
+ .flags = SCX_OPS_ENQ_LAST,
+ .timeout_ms = 1000U,
+};
diff --git a/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c b/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c
new file mode 100644
index 000000000000..2a3eda5e2c0b
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ */
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "enq_last_no_enq_fails.bpf.skel.h"
+#include "scx_test.h"
+
+static enum scx_test_status setup(void **ctx)
+{
+ struct enq_last_no_enq_fails *skel;
+
+ skel = enq_last_no_enq_fails__open_and_load();
+ if (!skel) {
+ SCX_ERR("Failed to open and load skel");
+ return SCX_TEST_FAIL;
+ }
+ *ctx = skel;
+
+ return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+ struct enq_last_no_enq_fails *skel = ctx;
+ struct bpf_link *link;
+
+ link = bpf_map__attach_struct_ops(skel->maps.enq_last_no_enq_fails_ops);
+ if (link) {
+ SCX_ERR("Incorrectly succeeded in to attaching scheduler");
+ return SCX_TEST_FAIL;
+ }
+
+ bpf_link__destroy(link);
+
+ return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+ struct enq_last_no_enq_fails *skel = ctx;
+
+ enq_last_no_enq_fails__destroy(skel);
+}
+
+struct scx_test enq_last_no_enq_fails = {
+ .name = "enq_last_no_enq_fails",
+ .description = "Verify we fail to load a scheduler if we specify "
+ "the SCX_OPS_ENQ_LAST flag without defining "
+ "ops.enqueue()",
+ .setup = setup,
+ .run = run,
+ .cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&enq_last_no_enq_fails)
diff --git a/tools/testing/selftests/sched_ext/enq_select_cpu_fails.bpf.c b/tools/testing/selftests/sched_ext/enq_select_cpu_fails.bpf.c
new file mode 100644
index 000000000000..b3dfc1033cd6
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/enq_select_cpu_fails.bpf.c
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+/* Manually specify the signature until the kfunc is added to the scx repo. */
+s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags,
+ bool *found) __ksym;
+
+s32 BPF_STRUCT_OPS(enq_select_cpu_fails_select_cpu, struct task_struct *p,
+ s32 prev_cpu, u64 wake_flags)
+{
+ return prev_cpu;
+}
+
+void BPF_STRUCT_OPS(enq_select_cpu_fails_enqueue, struct task_struct *p,
+ u64 enq_flags)
+{
+ /*
+ * Need to initialize the variable or the verifier will fail to load.
+ * Improving these semantics is actively being worked on.
+ */
+ bool found = false;
+
+ /* Can only call from ops.select_cpu() */
+ scx_bpf_select_cpu_dfl(p, 0, 0, &found);
+
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops enq_select_cpu_fails_ops = {
+ .select_cpu = enq_select_cpu_fails_select_cpu,
+ .enqueue = enq_select_cpu_fails_enqueue,
+ .name = "enq_select_cpu_fails",
+ .timeout_ms = 1000U,
+};
diff --git a/tools/testing/selftests/sched_ext/enq_select_cpu_fails.c b/tools/testing/selftests/sched_ext/enq_select_cpu_fails.c
new file mode 100644
index 000000000000..dd1350e5f002
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/enq_select_cpu_fails.c
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ */
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "enq_select_cpu_fails.bpf.skel.h"
+#include "scx_test.h"
+
+static enum scx_test_status setup(void **ctx)
+{
+ struct enq_select_cpu_fails *skel;
+
+ skel = enq_select_cpu_fails__open_and_load();
+ if (!skel) {
+ SCX_ERR("Failed to open and load skel");
+ return SCX_TEST_FAIL;
+ }
+ *ctx = skel;
+
+ return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+ struct enq_select_cpu_fails *skel = ctx;
+ struct bpf_link *link;
+
+ link = bpf_map__attach_struct_ops(skel->maps.enq_select_cpu_fails_ops);
+ if (!link) {
+ SCX_ERR("Failed to attach scheduler");
+ return SCX_TEST_FAIL;
+ }
+
+ sleep(1);
+
+ bpf_link__destroy(link);
+
+ return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+ struct enq_select_cpu_fails *skel = ctx;
+
+ enq_select_cpu_fails__destroy(skel);
+}
+
+struct scx_test enq_select_cpu_fails = {
+ .name = "enq_select_cpu_fails",
+ .description = "Verify we fail to call scx_bpf_select_cpu_dfl() "
+ "from ops.enqueue()",
+ .setup = setup,
+ .run = run,
+ .cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&enq_select_cpu_fails)
diff --git a/tools/testing/selftests/sched_ext/exit.bpf.c b/tools/testing/selftests/sched_ext/exit.bpf.c
new file mode 100644
index 000000000000..ae12ddaac921
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/exit.bpf.c
@@ -0,0 +1,84 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+#include "exit_test.h"
+
+const volatile int exit_point;
+UEI_DEFINE(uei);
+
+#define EXIT_CLEANLY() scx_bpf_exit(exit_point, "%d", exit_point)
+
+s32 BPF_STRUCT_OPS(exit_select_cpu, struct task_struct *p,
+ s32 prev_cpu, u64 wake_flags)
+{
+ bool found;
+
+ if (exit_point == EXIT_SELECT_CPU)
+ EXIT_CLEANLY();
+
+ return scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &found);
+}
+
+void BPF_STRUCT_OPS(exit_enqueue, struct task_struct *p, u64 enq_flags)
+{
+ if (exit_point == EXIT_ENQUEUE)
+ EXIT_CLEANLY();
+
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+}
+
+void BPF_STRUCT_OPS(exit_dispatch, s32 cpu, struct task_struct *p)
+{
+ if (exit_point == EXIT_DISPATCH)
+ EXIT_CLEANLY();
+
+ scx_bpf_consume(SCX_DSQ_GLOBAL);
+}
+
+void BPF_STRUCT_OPS(exit_enable, struct task_struct *p)
+{
+ if (exit_point == EXIT_ENABLE)
+ EXIT_CLEANLY();
+}
+
+s32 BPF_STRUCT_OPS(exit_init_task, struct task_struct *p,
+ struct scx_init_task_args *args)
+{
+ if (exit_point == EXIT_INIT_TASK)
+ EXIT_CLEANLY();
+
+ return 0;
+}
+
+void BPF_STRUCT_OPS(exit_exit, struct scx_exit_info *ei)
+{
+ UEI_RECORD(uei, ei);
+}
+
+s32 BPF_STRUCT_OPS_SLEEPABLE(exit_init)
+{
+ if (exit_point == EXIT_INIT)
+ EXIT_CLEANLY();
+
+ return 0;
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops exit_ops = {
+ .select_cpu = exit_select_cpu,
+ .enqueue = exit_enqueue,
+ .dispatch = exit_dispatch,
+ .init_task = exit_init_task,
+ .enable = exit_enable,
+ .exit = exit_exit,
+ .init = exit_init,
+ .name = "exit",
+ .timeout_ms = 1000U,
+};
diff --git a/tools/testing/selftests/sched_ext/exit.c b/tools/testing/selftests/sched_ext/exit.c
new file mode 100644
index 000000000000..31bcd06e21cd
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/exit.c
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ */
+#include <bpf/bpf.h>
+#include <sched.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "exit.bpf.skel.h"
+#include "scx_test.h"
+
+#include "exit_test.h"
+
+static enum scx_test_status run(void *ctx)
+{
+ enum exit_test_case tc;
+
+ for (tc = 0; tc < NUM_EXITS; tc++) {
+ struct exit *skel;
+ struct bpf_link *link;
+ char buf[16];
+
+ skel = exit__open();
+ skel->rodata->exit_point = tc;
+ exit__load(skel);
+ link = bpf_map__attach_struct_ops(skel->maps.exit_ops);
+ if (!link) {
+ SCX_ERR("Failed to attach scheduler");
+ exit__destroy(skel);
+ return SCX_TEST_FAIL;
+ }
+
+ /* Assumes uei.kind is written last */
+ while (skel->data->uei.kind == EXIT_KIND(SCX_EXIT_NONE))
+ sched_yield();
+
+ SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_UNREG_BPF));
+ SCX_EQ(skel->data->uei.exit_code, tc);
+ sprintf(buf, "%d", tc);
+ SCX_ASSERT(!strcmp(skel->data->uei.msg, buf));
+ bpf_link__destroy(link);
+ exit__destroy(skel);
+ }
+
+ return SCX_TEST_PASS;
+}
+
+struct scx_test exit_test = {
+ .name = "exit",
+ .description = "Verify we can cleanly exit a scheduler in multiple places",
+ .run = run,
+};
+REGISTER_SCX_TEST(&exit_test)
diff --git a/tools/testing/selftests/sched_ext/exit_test.h b/tools/testing/selftests/sched_ext/exit_test.h
new file mode 100644
index 000000000000..94f0268b9cb8
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/exit_test.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ */
+
+#ifndef __EXIT_TEST_H__
+#define __EXIT_TEST_H__
+
+enum exit_test_case {
+ EXIT_SELECT_CPU,
+ EXIT_ENQUEUE,
+ EXIT_DISPATCH,
+ EXIT_ENABLE,
+ EXIT_INIT_TASK,
+ EXIT_INIT,
+ NUM_EXITS,
+};
+
+#endif // # __EXIT_TEST_H__
diff --git a/tools/testing/selftests/sched_ext/hotplug.bpf.c b/tools/testing/selftests/sched_ext/hotplug.bpf.c
new file mode 100644
index 000000000000..1f5f91f4f66a
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/hotplug.bpf.c
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+#include "hotplug_test.h"
+
+UEI_DEFINE(uei);
+
+void BPF_STRUCT_OPS(hotplug_exit, struct scx_exit_info *ei)
+{
+ UEI_RECORD(uei, ei);
+}
+
+static void exit_from_hotplug(s32 cpu, bool onlining)
+{
+ s64 code = SCX_ECODE_ACT_RESTART | HOTPLUG_EXIT_RSN;
+
+ if (onlining)
+ code |= HOTPLUG_ONLINING;
+
+ scx_bpf_exit(code, "hotplug event detected (%d going %s)", cpu,
+ onlining ? "online" : "offline");
+}
+
+void BPF_STRUCT_OPS(hotplug_cpu_online, s32 cpu)
+{
+ exit_from_hotplug(cpu, true);
+}
+
+void BPF_STRUCT_OPS(hotplug_cpu_offline, s32 cpu)
+{
+ exit_from_hotplug(cpu, false);
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops hotplug_cb_ops = {
+ .cpu_online = hotplug_cpu_online,
+ .cpu_offline = hotplug_cpu_offline,
+ .exit = hotplug_exit,
+ .name = "hotplug_cbs",
+ .timeout_ms = 1000U,
+};
+
+SEC(".struct_ops.link")
+struct sched_ext_ops hotplug_nocb_ops = {
+ .exit = hotplug_exit,
+ .name = "hotplug_nocbs",
+ .timeout_ms = 1000U,
+};
diff --git a/tools/testing/selftests/sched_ext/hotplug.c b/tools/testing/selftests/sched_ext/hotplug.c
new file mode 100644
index 000000000000..87bf220b1bce
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/hotplug.c
@@ -0,0 +1,168 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ */
+#include <bpf/bpf.h>
+#include <sched.h>
+#include <scx/common.h>
+#include <sched.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#include "hotplug_test.h"
+#include "hotplug.bpf.skel.h"
+#include "scx_test.h"
+#include "util.h"
+
+const char *online_path = "/sys/devices/system/cpu/cpu1/online";
+
+static bool is_cpu_online(void)
+{
+ return file_read_long(online_path) > 0;
+}
+
+static void toggle_online_status(bool online)
+{
+ long val = online ? 1 : 0;
+ int ret;
+
+ ret = file_write_long(online_path, val);
+ if (ret != 0)
+ fprintf(stderr, "Failed to bring CPU %s (%s)",
+ online ? "online" : "offline", strerror(errno));
+}
+
+static enum scx_test_status setup(void **ctx)
+{
+ if (!is_cpu_online())
+ return SCX_TEST_SKIP;
+
+ return SCX_TEST_PASS;
+}
+
+static enum scx_test_status test_hotplug(bool onlining, bool cbs_defined)
+{
+ struct hotplug *skel;
+ struct bpf_link *link;
+ long kind, code;
+
+ SCX_ASSERT(is_cpu_online());
+
+ skel = hotplug__open_and_load();
+ SCX_ASSERT(skel);
+
+ /* Testing the offline -> online path, so go offline before starting */
+ if (onlining)
+ toggle_online_status(0);
+
+ if (cbs_defined) {
+ kind = SCX_KIND_VAL(SCX_EXIT_UNREG_BPF);
+ code = SCX_ECODE_VAL(SCX_ECODE_ACT_RESTART) | HOTPLUG_EXIT_RSN;
+ if (onlining)
+ code |= HOTPLUG_ONLINING;
+ } else {
+ kind = SCX_KIND_VAL(SCX_EXIT_UNREG_KERN);
+ code = SCX_ECODE_VAL(SCX_ECODE_ACT_RESTART) |
+ SCX_ECODE_VAL(SCX_ECODE_RSN_HOTPLUG);
+ }
+
+ if (cbs_defined)
+ link = bpf_map__attach_struct_ops(skel->maps.hotplug_cb_ops);
+ else
+ link = bpf_map__attach_struct_ops(skel->maps.hotplug_nocb_ops);
+
+ if (!link) {
+ SCX_ERR("Failed to attach scheduler");
+ hotplug__destroy(skel);
+ return SCX_TEST_FAIL;
+ }
+
+ toggle_online_status(onlining ? 1 : 0);
+
+ while (!UEI_EXITED(skel, uei))
+ sched_yield();
+
+ SCX_EQ(skel->data->uei.kind, kind);
+ SCX_EQ(UEI_REPORT(skel, uei), code);
+
+ if (!onlining)
+ toggle_online_status(1);
+
+ bpf_link__destroy(link);
+ hotplug__destroy(skel);
+
+ return SCX_TEST_PASS;
+}
+
+static enum scx_test_status test_hotplug_attach(void)
+{
+ struct hotplug *skel;
+ struct bpf_link *link;
+ enum scx_test_status status = SCX_TEST_PASS;
+ long kind, code;
+
+ SCX_ASSERT(is_cpu_online());
+ SCX_ASSERT(scx_hotplug_seq() > 0);
+
+ skel = SCX_OPS_OPEN(hotplug_nocb_ops, hotplug);
+ SCX_ASSERT(skel);
+
+ SCX_OPS_LOAD(skel, hotplug_nocb_ops, hotplug, uei);
+
+ /*
+ * Take the CPU offline to increment the global hotplug seq, which
+ * should cause attach to fail due to us setting the hotplug seq above
+ */
+ toggle_online_status(0);
+ link = bpf_map__attach_struct_ops(skel->maps.hotplug_nocb_ops);
+
+ toggle_online_status(1);
+
+ SCX_ASSERT(link);
+ while (!UEI_EXITED(skel, uei))
+ sched_yield();
+
+ kind = SCX_KIND_VAL(SCX_EXIT_UNREG_KERN);
+ code = SCX_ECODE_VAL(SCX_ECODE_ACT_RESTART) |
+ SCX_ECODE_VAL(SCX_ECODE_RSN_HOTPLUG);
+ SCX_EQ(skel->data->uei.kind, kind);
+ SCX_EQ(UEI_REPORT(skel, uei), code);
+
+ bpf_link__destroy(link);
+ hotplug__destroy(skel);
+
+ return status;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+
+#define HP_TEST(__onlining, __cbs_defined) ({ \
+ if (test_hotplug(__onlining, __cbs_defined) != SCX_TEST_PASS) \
+ return SCX_TEST_FAIL; \
+})
+
+ HP_TEST(true, true);
+ HP_TEST(false, true);
+ HP_TEST(true, false);
+ HP_TEST(false, false);
+
+#undef HP_TEST
+
+ return test_hotplug_attach();
+}
+
+static void cleanup(void *ctx)
+{
+ toggle_online_status(1);
+}
+
+struct scx_test hotplug_test = {
+ .name = "hotplug",
+ .description = "Verify hotplug behavior",
+ .setup = setup,
+ .run = run,
+ .cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&hotplug_test)
diff --git a/tools/testing/selftests/sched_ext/hotplug_test.h b/tools/testing/selftests/sched_ext/hotplug_test.h
new file mode 100644
index 000000000000..73d236f90787
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/hotplug_test.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ */
+
+#ifndef __HOTPLUG_TEST_H__
+#define __HOTPLUG_TEST_H__
+
+enum hotplug_test_flags {
+ HOTPLUG_EXIT_RSN = 1LLU << 0,
+ HOTPLUG_ONLINING = 1LLU << 1,
+};
+
+#endif // # __HOTPLUG_TEST_H__
diff --git a/tools/testing/selftests/sched_ext/init_enable_count.bpf.c b/tools/testing/selftests/sched_ext/init_enable_count.bpf.c
new file mode 100644
index 000000000000..47ea89a626c3
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/init_enable_count.bpf.c
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A scheduler that verifies that we do proper counting of init, enable, etc
+ * callbacks.
+ *
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+u64 init_task_cnt, exit_task_cnt, enable_cnt, disable_cnt;
+u64 init_fork_cnt, init_transition_cnt;
+
+s32 BPF_STRUCT_OPS_SLEEPABLE(cnt_init_task, struct task_struct *p,
+ struct scx_init_task_args *args)
+{
+ __sync_fetch_and_add(&init_task_cnt, 1);
+
+ if (args->fork)
+ __sync_fetch_and_add(&init_fork_cnt, 1);
+ else
+ __sync_fetch_and_add(&init_transition_cnt, 1);
+
+ return 0;
+}
+
+void BPF_STRUCT_OPS(cnt_exit_task, struct task_struct *p)
+{
+ __sync_fetch_and_add(&exit_task_cnt, 1);
+}
+
+void BPF_STRUCT_OPS(cnt_enable, struct task_struct *p)
+{
+ __sync_fetch_and_add(&enable_cnt, 1);
+}
+
+void BPF_STRUCT_OPS(cnt_disable, struct task_struct *p)
+{
+ __sync_fetch_and_add(&disable_cnt, 1);
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops init_enable_count_ops = {
+ .init_task = cnt_init_task,
+ .exit_task = cnt_exit_task,
+ .enable = cnt_enable,
+ .disable = cnt_disable,
+ .name = "init_enable_count",
+};
diff --git a/tools/testing/selftests/sched_ext/init_enable_count.c b/tools/testing/selftests/sched_ext/init_enable_count.c
new file mode 100644
index 000000000000..ef9da0a50846
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/init_enable_count.c
@@ -0,0 +1,166 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ */
+#include <stdio.h>
+#include <unistd.h>
+#include <sched.h>
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include "scx_test.h"
+#include "init_enable_count.bpf.skel.h"
+
+#define SCHED_EXT 7
+
+static struct init_enable_count *
+open_load_prog(bool global)
+{
+ struct init_enable_count *skel;
+
+ skel = init_enable_count__open();
+ SCX_BUG_ON(!skel, "Failed to open skel");
+
+ if (!global)
+ skel->struct_ops.init_enable_count_ops->flags |= __COMPAT_SCX_OPS_SWITCH_PARTIAL;
+
+ SCX_BUG_ON(init_enable_count__load(skel), "Failed to load skel");
+
+ return skel;
+}
+
+static enum scx_test_status run_test(bool global)
+{
+ struct init_enable_count *skel;
+ struct bpf_link *link;
+ const u32 num_children = 5, num_pre_forks = 1024;
+ int ret, i, status;
+ struct sched_param param = {};
+ pid_t pids[num_pre_forks];
+
+ skel = open_load_prog(global);
+
+ /*
+ * Fork a bunch of children before we attach the scheduler so that we
+ * ensure (at least in practical terms) that there are more tasks that
+ * transition from SCHED_OTHER -> SCHED_EXT than there are tasks that
+ * take the fork() path either below or in other processes.
+ */
+ for (i = 0; i < num_pre_forks; i++) {
+ pids[i] = fork();
+ SCX_FAIL_IF(pids[i] < 0, "Failed to fork child");
+ if (pids[i] == 0) {
+ sleep(1);
+ exit(0);
+ }
+ }
+
+ link = bpf_map__attach_struct_ops(skel->maps.init_enable_count_ops);
+ SCX_FAIL_IF(!link, "Failed to attach struct_ops");
+
+ for (i = 0; i < num_pre_forks; i++) {
+ SCX_FAIL_IF(waitpid(pids[i], &status, 0) != pids[i],
+ "Failed to wait for pre-forked child\n");
+
+ SCX_FAIL_IF(status != 0, "Pre-forked child %d exited with status %d\n", i,
+ status);
+ }
+
+ bpf_link__destroy(link);
+ SCX_GE(skel->bss->init_task_cnt, num_pre_forks);
+ SCX_GE(skel->bss->exit_task_cnt, num_pre_forks);
+
+ link = bpf_map__attach_struct_ops(skel->maps.init_enable_count_ops);
+ SCX_FAIL_IF(!link, "Failed to attach struct_ops");
+
+ /* SCHED_EXT children */
+ for (i = 0; i < num_children; i++) {
+ pids[i] = fork();
+ SCX_FAIL_IF(pids[i] < 0, "Failed to fork child");
+
+ if (pids[i] == 0) {
+ ret = sched_setscheduler(0, SCHED_EXT, &param);
+ SCX_BUG_ON(ret, "Failed to set sched to sched_ext");
+
+ /*
+ * Reset to SCHED_OTHER for half of them. Counts for
+ * everything should still be the same regardless, as
+ * ops.disable() is invoked even if a task is still on
+ * SCHED_EXT before it exits.
+ */
+ if (i % 2 == 0) {
+ ret = sched_setscheduler(0, SCHED_OTHER, &param);
+ SCX_BUG_ON(ret, "Failed to reset sched to normal");
+ }
+ exit(0);
+ }
+ }
+ for (i = 0; i < num_children; i++) {
+ SCX_FAIL_IF(waitpid(pids[i], &status, 0) != pids[i],
+ "Failed to wait for SCX child\n");
+
+ SCX_FAIL_IF(status != 0, "SCX child %d exited with status %d\n", i,
+ status);
+ }
+
+ /* SCHED_OTHER children */
+ for (i = 0; i < num_children; i++) {
+ pids[i] = fork();
+ if (pids[i] == 0)
+ exit(0);
+ }
+
+ for (i = 0; i < num_children; i++) {
+ SCX_FAIL_IF(waitpid(pids[i], &status, 0) != pids[i],
+ "Failed to wait for normal child\n");
+
+ SCX_FAIL_IF(status != 0, "Normal child %d exited with status %d\n", i,
+ status);
+ }
+
+ bpf_link__destroy(link);
+
+ SCX_GE(skel->bss->init_task_cnt, 2 * num_children);
+ SCX_GE(skel->bss->exit_task_cnt, 2 * num_children);
+
+ if (global) {
+ SCX_GE(skel->bss->enable_cnt, 2 * num_children);
+ SCX_GE(skel->bss->disable_cnt, 2 * num_children);
+ } else {
+ SCX_EQ(skel->bss->enable_cnt, num_children);
+ SCX_EQ(skel->bss->disable_cnt, num_children);
+ }
+ /*
+ * We forked a ton of tasks before we attached the scheduler above, so
+ * this should be fine. Technically it could be flaky if a ton of forks
+ * are happening at the same time in other processes, but that should
+ * be exceedingly unlikely.
+ */
+ SCX_GT(skel->bss->init_transition_cnt, skel->bss->init_fork_cnt);
+ SCX_GE(skel->bss->init_fork_cnt, 2 * num_children);
+
+ init_enable_count__destroy(skel);
+
+ return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+ enum scx_test_status status;
+
+ status = run_test(true);
+ if (status != SCX_TEST_PASS)
+ return status;
+
+ return run_test(false);
+}
+
+struct scx_test init_enable_count = {
+ .name = "init_enable_count",
+ .description = "Verify we do the correct amount of counting of init, "
+ "enable, etc callbacks.",
+ .run = run,
+};
+REGISTER_SCX_TEST(&init_enable_count)
diff --git a/tools/testing/selftests/sched_ext/maximal.bpf.c b/tools/testing/selftests/sched_ext/maximal.bpf.c
new file mode 100644
index 000000000000..00bfa9cb95d3
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/maximal.bpf.c
@@ -0,0 +1,164 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A scheduler with every callback defined.
+ *
+ * This scheduler defines every callback.
+ *
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+s32 BPF_STRUCT_OPS(maximal_select_cpu, struct task_struct *p, s32 prev_cpu,
+ u64 wake_flags)
+{
+ return prev_cpu;
+}
+
+void BPF_STRUCT_OPS(maximal_enqueue, struct task_struct *p, u64 enq_flags)
+{
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+}
+
+void BPF_STRUCT_OPS(maximal_dequeue, struct task_struct *p, u64 deq_flags)
+{}
+
+void BPF_STRUCT_OPS(maximal_dispatch, s32 cpu, struct task_struct *prev)
+{
+ scx_bpf_consume(SCX_DSQ_GLOBAL);
+}
+
+void BPF_STRUCT_OPS(maximal_runnable, struct task_struct *p, u64 enq_flags)
+{}
+
+void BPF_STRUCT_OPS(maximal_running, struct task_struct *p)
+{}
+
+void BPF_STRUCT_OPS(maximal_stopping, struct task_struct *p, bool runnable)
+{}
+
+void BPF_STRUCT_OPS(maximal_quiescent, struct task_struct *p, u64 deq_flags)
+{}
+
+bool BPF_STRUCT_OPS(maximal_yield, struct task_struct *from,
+ struct task_struct *to)
+{
+ return false;
+}
+
+bool BPF_STRUCT_OPS(maximal_core_sched_before, struct task_struct *a,
+ struct task_struct *b)
+{
+ return false;
+}
+
+void BPF_STRUCT_OPS(maximal_set_weight, struct task_struct *p, u32 weight)
+{}
+
+void BPF_STRUCT_OPS(maximal_set_cpumask, struct task_struct *p,
+ const struct cpumask *cpumask)
+{}
+
+void BPF_STRUCT_OPS(maximal_update_idle, s32 cpu, bool idle)
+{}
+
+void BPF_STRUCT_OPS(maximal_cpu_acquire, s32 cpu,
+ struct scx_cpu_acquire_args *args)
+{}
+
+void BPF_STRUCT_OPS(maximal_cpu_release, s32 cpu,
+ struct scx_cpu_release_args *args)
+{}
+
+void BPF_STRUCT_OPS(maximal_cpu_online, s32 cpu)
+{}
+
+void BPF_STRUCT_OPS(maximal_cpu_offline, s32 cpu)
+{}
+
+s32 BPF_STRUCT_OPS(maximal_init_task, struct task_struct *p,
+ struct scx_init_task_args *args)
+{
+ return 0;
+}
+
+void BPF_STRUCT_OPS(maximal_enable, struct task_struct *p)
+{}
+
+void BPF_STRUCT_OPS(maximal_exit_task, struct task_struct *p,
+ struct scx_exit_task_args *args)
+{}
+
+void BPF_STRUCT_OPS(maximal_disable, struct task_struct *p)
+{}
+
+s32 BPF_STRUCT_OPS(maximal_cgroup_init, struct cgroup *cgrp,
+ struct scx_cgroup_init_args *args)
+{
+ return 0;
+}
+
+void BPF_STRUCT_OPS(maximal_cgroup_exit, struct cgroup *cgrp)
+{}
+
+s32 BPF_STRUCT_OPS(maximal_cgroup_prep_move, struct task_struct *p,
+ struct cgroup *from, struct cgroup *to)
+{
+ return 0;
+}
+
+void BPF_STRUCT_OPS(maximal_cgroup_move, struct task_struct *p,
+ struct cgroup *from, struct cgroup *to)
+{}
+
+void BPF_STRUCT_OPS(maximal_cgroup_cancel_move, struct task_struct *p,
+ struct cgroup *from, struct cgroup *to)
+{}
+
+void BPF_STRUCT_OPS(maximal_cgroup_set_weight, struct cgroup *cgrp, u32 weight)
+{}
+
+s32 BPF_STRUCT_OPS_SLEEPABLE(maximal_init)
+{
+ return 0;
+}
+
+void BPF_STRUCT_OPS(maximal_exit, struct scx_exit_info *info)
+{}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops maximal_ops = {
+ .select_cpu = maximal_select_cpu,
+ .enqueue = maximal_enqueue,
+ .dequeue = maximal_dequeue,
+ .dispatch = maximal_dispatch,
+ .runnable = maximal_runnable,
+ .running = maximal_running,
+ .stopping = maximal_stopping,
+ .quiescent = maximal_quiescent,
+ .yield = maximal_yield,
+ .core_sched_before = maximal_core_sched_before,
+ .set_weight = maximal_set_weight,
+ .set_cpumask = maximal_set_cpumask,
+ .update_idle = maximal_update_idle,
+ .cpu_acquire = maximal_cpu_acquire,
+ .cpu_release = maximal_cpu_release,
+ .cpu_online = maximal_cpu_online,
+ .cpu_offline = maximal_cpu_offline,
+ .init_task = maximal_init_task,
+ .enable = maximal_enable,
+ .exit_task = maximal_exit_task,
+ .disable = maximal_disable,
+ .cgroup_init = maximal_cgroup_init,
+ .cgroup_exit = maximal_cgroup_exit,
+ .cgroup_prep_move = maximal_cgroup_prep_move,
+ .cgroup_move = maximal_cgroup_move,
+ .cgroup_cancel_move = maximal_cgroup_cancel_move,
+ .cgroup_set_weight = maximal_cgroup_set_weight,
+ .init = maximal_init,
+ .exit = maximal_exit,
+ .name = "maximal",
+};
diff --git a/tools/testing/selftests/sched_ext/maximal.c b/tools/testing/selftests/sched_ext/maximal.c
new file mode 100644
index 000000000000..f38fc973c380
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/maximal.c
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ */
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "maximal.bpf.skel.h"
+#include "scx_test.h"
+
+static enum scx_test_status setup(void **ctx)
+{
+ struct maximal *skel;
+
+ skel = maximal__open_and_load();
+ SCX_FAIL_IF(!skel, "Failed to open and load skel");
+ *ctx = skel;
+
+ return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+ struct maximal *skel = ctx;
+ struct bpf_link *link;
+
+ link = bpf_map__attach_struct_ops(skel->maps.maximal_ops);
+ SCX_FAIL_IF(!link, "Failed to attach scheduler");
+
+ bpf_link__destroy(link);
+
+ return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+ struct maximal *skel = ctx;
+
+ maximal__destroy(skel);
+}
+
+struct scx_test maximal = {
+ .name = "maximal",
+ .description = "Verify we can load a scheduler with every callback defined",
+ .setup = setup,
+ .run = run,
+ .cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&maximal)
diff --git a/tools/testing/selftests/sched_ext/maybe_null.bpf.c b/tools/testing/selftests/sched_ext/maybe_null.bpf.c
new file mode 100644
index 000000000000..ad5e694226bb
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/maybe_null.bpf.c
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+u64 vtime_test;
+
+void BPF_STRUCT_OPS(maybe_null_running, struct task_struct *p)
+{}
+
+void BPF_STRUCT_OPS(maybe_null_success_dispatch, s32 cpu, struct task_struct *p)
+{
+ if (p != NULL)
+ vtime_test = p->scx.dsq_vtime;
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops maybe_null_success = {
+ .dispatch = maybe_null_success_dispatch,
+ .enable = maybe_null_running,
+ .name = "minimal",
+};
diff --git a/tools/testing/selftests/sched_ext/maybe_null.c b/tools/testing/selftests/sched_ext/maybe_null.c
new file mode 100644
index 000000000000..3f26b784f9c5
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/maybe_null.c
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ */
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "maybe_null.bpf.skel.h"
+#include "maybe_null_fail.bpf.skel.h"
+#include "scx_test.h"
+
+static enum scx_test_status run(void *ctx)
+{
+ struct maybe_null *skel;
+ struct maybe_null_fail *fail_skel;
+
+ skel = maybe_null__open_and_load();
+ if (!skel) {
+ SCX_ERR("Failed to open and load maybe_null skel");
+ return SCX_TEST_FAIL;
+ }
+ maybe_null__destroy(skel);
+
+ fail_skel = maybe_null_fail__open_and_load();
+ if (fail_skel) {
+ maybe_null_fail__destroy(fail_skel);
+ SCX_ERR("Should failed to open and load maybe_null_fail skel");
+ return SCX_TEST_FAIL;
+ }
+
+ return SCX_TEST_PASS;
+}
+
+struct scx_test maybe_null = {
+ .name = "maybe_null",
+ .description = "Verify if PTR_MAYBE_NULL work for .dispatch",
+ .run = run,
+};
+REGISTER_SCX_TEST(&maybe_null)
diff --git a/tools/testing/selftests/sched_ext/maybe_null_fail.bpf.c b/tools/testing/selftests/sched_ext/maybe_null_fail.bpf.c
new file mode 100644
index 000000000000..1607fe07bead
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/maybe_null_fail.bpf.c
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+u64 vtime_test;
+
+void BPF_STRUCT_OPS(maybe_null_running, struct task_struct *p)
+{}
+
+void BPF_STRUCT_OPS(maybe_null_fail_dispatch, s32 cpu, struct task_struct *p)
+{
+ vtime_test = p->scx.dsq_vtime;
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops maybe_null_fail = {
+ .dispatch = maybe_null_fail_dispatch,
+ .enable = maybe_null_running,
+ .name = "minimal",
+};
diff --git a/tools/testing/selftests/sched_ext/minimal.bpf.c b/tools/testing/selftests/sched_ext/minimal.bpf.c
new file mode 100644
index 000000000000..6a7eccef0104
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/minimal.bpf.c
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A completely minimal scheduler.
+ *
+ * This scheduler defines the absolute minimal set of struct sched_ext_ops
+ * fields: its name. It should _not_ fail to be loaded, and can be used to
+ * exercise the default scheduling paths in ext.c.
+ *
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+SEC(".struct_ops.link")
+struct sched_ext_ops minimal_ops = {
+ .name = "minimal",
+};
diff --git a/tools/testing/selftests/sched_ext/minimal.c b/tools/testing/selftests/sched_ext/minimal.c
new file mode 100644
index 000000000000..6c5db8ebbf8a
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/minimal.c
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ */
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "minimal.bpf.skel.h"
+#include "scx_test.h"
+
+static enum scx_test_status setup(void **ctx)
+{
+ struct minimal *skel;
+
+ skel = minimal__open_and_load();
+ if (!skel) {
+ SCX_ERR("Failed to open and load skel");
+ return SCX_TEST_FAIL;
+ }
+ *ctx = skel;
+
+ return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+ struct minimal *skel = ctx;
+ struct bpf_link *link;
+
+ link = bpf_map__attach_struct_ops(skel->maps.minimal_ops);
+ if (!link) {
+ SCX_ERR("Failed to attach scheduler");
+ return SCX_TEST_FAIL;
+ }
+
+ bpf_link__destroy(link);
+
+ return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+ struct minimal *skel = ctx;
+
+ minimal__destroy(skel);
+}
+
+struct scx_test minimal = {
+ .name = "minimal",
+ .description = "Verify we can load a fully minimal scheduler",
+ .setup = setup,
+ .run = run,
+ .cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&minimal)
diff --git a/tools/testing/selftests/sched_ext/prog_run.bpf.c b/tools/testing/selftests/sched_ext/prog_run.bpf.c
new file mode 100644
index 000000000000..fd2c8f12af16
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/prog_run.bpf.c
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A scheduler that validates that we can invoke sched_ext kfuncs in
+ * BPF_PROG_TYPE_SYSCALL programs.
+ *
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ */
+
+#include <scx/common.bpf.h>
+
+UEI_DEFINE(uei);
+
+char _license[] SEC("license") = "GPL";
+
+SEC("syscall")
+int BPF_PROG(prog_run_syscall)
+{
+ scx_bpf_exit(0xdeadbeef, "Exited from PROG_RUN");
+ return 0;
+}
+
+void BPF_STRUCT_OPS(prog_run_exit, struct scx_exit_info *ei)
+{
+ UEI_RECORD(uei, ei);
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops prog_run_ops = {
+ .exit = prog_run_exit,
+ .name = "prog_run",
+};
diff --git a/tools/testing/selftests/sched_ext/prog_run.c b/tools/testing/selftests/sched_ext/prog_run.c
new file mode 100644
index 000000000000..3cd57ef8daaa
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/prog_run.c
@@ -0,0 +1,78 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ */
+#include <bpf/bpf.h>
+#include <sched.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "prog_run.bpf.skel.h"
+#include "scx_test.h"
+
+static enum scx_test_status setup(void **ctx)
+{
+ struct prog_run *skel;
+
+ skel = prog_run__open_and_load();
+ if (!skel) {
+ SCX_ERR("Failed to open and load skel");
+ return SCX_TEST_FAIL;
+ }
+ *ctx = skel;
+
+ return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+ struct prog_run *skel = ctx;
+ struct bpf_link *link;
+ int prog_fd, err = 0;
+
+ prog_fd = bpf_program__fd(skel->progs.prog_run_syscall);
+ if (prog_fd < 0) {
+ SCX_ERR("Failed to get BPF_PROG_RUN prog");
+ return SCX_TEST_FAIL;
+ }
+
+ LIBBPF_OPTS(bpf_test_run_opts, topts);
+
+ link = bpf_map__attach_struct_ops(skel->maps.prog_run_ops);
+ if (!link) {
+ SCX_ERR("Failed to attach scheduler");
+ close(prog_fd);
+ return SCX_TEST_FAIL;
+ }
+
+ err = bpf_prog_test_run_opts(prog_fd, &topts);
+ SCX_EQ(err, 0);
+
+ /* Assumes uei.kind is written last */
+ while (skel->data->uei.kind == EXIT_KIND(SCX_EXIT_NONE))
+ sched_yield();
+
+ SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_UNREG_BPF));
+ SCX_EQ(skel->data->uei.exit_code, 0xdeadbeef);
+ close(prog_fd);
+ bpf_link__destroy(link);
+
+ return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+ struct prog_run *skel = ctx;
+
+ prog_run__destroy(skel);
+}
+
+struct scx_test prog_run = {
+ .name = "prog_run",
+ .description = "Verify we can call into a scheduler with BPF_PROG_RUN, and invoke kfuncs",
+ .setup = setup,
+ .run = run,
+ .cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&prog_run)
diff --git a/tools/testing/selftests/sched_ext/reload_loop.c b/tools/testing/selftests/sched_ext/reload_loop.c
new file mode 100644
index 000000000000..5cfba2d6e056
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/reload_loop.c
@@ -0,0 +1,75 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ */
+#include <bpf/bpf.h>
+#include <pthread.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "maximal.bpf.skel.h"
+#include "scx_test.h"
+
+static struct maximal *skel;
+static pthread_t threads[2];
+
+bool force_exit = false;
+
+static enum scx_test_status setup(void **ctx)
+{
+ skel = maximal__open_and_load();
+ if (!skel) {
+ SCX_ERR("Failed to open and load skel");
+ return SCX_TEST_FAIL;
+ }
+
+ return SCX_TEST_PASS;
+}
+
+static void *do_reload_loop(void *arg)
+{
+ u32 i;
+
+ for (i = 0; i < 1024 && !force_exit; i++) {
+ struct bpf_link *link;
+
+ link = bpf_map__attach_struct_ops(skel->maps.maximal_ops);
+ if (link)
+ bpf_link__destroy(link);
+ }
+
+ return NULL;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+ int err;
+ void *ret;
+
+ err = pthread_create(&threads[0], NULL, do_reload_loop, NULL);
+ SCX_FAIL_IF(err, "Failed to create thread 0");
+
+ err = pthread_create(&threads[1], NULL, do_reload_loop, NULL);
+ SCX_FAIL_IF(err, "Failed to create thread 1");
+
+ SCX_FAIL_IF(pthread_join(threads[0], &ret), "thread 0 failed");
+ SCX_FAIL_IF(pthread_join(threads[1], &ret), "thread 1 failed");
+
+ return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+ force_exit = true;
+ maximal__destroy(skel);
+}
+
+struct scx_test reload_loop = {
+ .name = "reload_loop",
+ .description = "Stress test loading and unloading schedulers repeatedly in a tight loop",
+ .setup = setup,
+ .run = run,
+ .cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&reload_loop)
diff --git a/tools/testing/selftests/sched_ext/runner.c b/tools/testing/selftests/sched_ext/runner.c
new file mode 100644
index 000000000000..eab48c7ff309
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/runner.c
@@ -0,0 +1,201 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ * Copyright (c) 2024 Tejun Heo <[email protected]>
+ */
+#include <stdio.h>
+#include <unistd.h>
+#include <signal.h>
+#include <libgen.h>
+#include <bpf/bpf.h>
+#include "scx_test.h"
+
+const char help_fmt[] =
+"The runner for sched_ext tests.\n"
+"\n"
+"The runner is statically linked against all testcases, and runs them all serially.\n"
+"It's required for the testcases to be serial, as only a single host-wide sched_ext\n"
+"scheduler may be loaded at any given time."
+"\n"
+"Usage: %s [-t TEST] [-h]\n"
+"\n"
+" -t TEST Only run tests whose name includes this string\n"
+" -s Include print output for skipped tests\n"
+" -q Don't print the test descriptions during run\n"
+" -h Display this help and exit\n";
+
+static volatile int exit_req;
+static bool quiet, print_skipped;
+
+#define MAX_SCX_TESTS 2048
+
+static struct scx_test __scx_tests[MAX_SCX_TESTS];
+static unsigned __scx_num_tests = 0;
+
+static void sigint_handler(int simple)
+{
+ exit_req = 1;
+}
+
+static void print_test_preamble(const struct scx_test *test, bool quiet)
+{
+ printf("===== START =====\n");
+ printf("TEST: %s\n", test->name);
+ if (!quiet)
+ printf("DESCRIPTION: %s\n", test->description);
+ printf("OUTPUT:\n");
+}
+
+static const char *status_to_result(enum scx_test_status status)
+{
+ switch (status) {
+ case SCX_TEST_PASS:
+ case SCX_TEST_SKIP:
+ return "ok";
+ case SCX_TEST_FAIL:
+ return "not ok";
+ default:
+ return "<UNKNOWN>";
+ }
+}
+
+static void print_test_result(const struct scx_test *test,
+ enum scx_test_status status,
+ unsigned int testnum)
+{
+ const char *result = status_to_result(status);
+ const char *directive = status == SCX_TEST_SKIP ? "SKIP " : "";
+
+ printf("%s %u %s # %s\n", result, testnum, test->name, directive);
+ printf("===== END =====\n");
+}
+
+static bool should_skip_test(const struct scx_test *test, const char * filter)
+{
+ return !strstr(test->name, filter);
+}
+
+static enum scx_test_status run_test(const struct scx_test *test)
+{
+ enum scx_test_status status;
+ void *context = NULL;
+
+ if (test->setup) {
+ status = test->setup(&context);
+ if (status != SCX_TEST_PASS)
+ return status;
+ }
+
+ status = test->run(context);
+
+ if (test->cleanup)
+ test->cleanup(context);
+
+ return status;
+}
+
+static bool test_valid(const struct scx_test *test)
+{
+ if (!test) {
+ fprintf(stderr, "NULL test detected\n");
+ return false;
+ }
+
+ if (!test->name) {
+ fprintf(stderr,
+ "Test with no name found. Must specify test name.\n");
+ return false;
+ }
+
+ if (!test->description) {
+ fprintf(stderr, "Test %s requires description.\n", test->name);
+ return false;
+ }
+
+ if (!test->run) {
+ fprintf(stderr, "Test %s has no run() callback\n", test->name);
+ return false;
+ }
+
+ return true;
+}
+
+int main(int argc, char **argv)
+{
+ const char *filter = NULL;
+ unsigned testnum = 0, i;
+ unsigned passed = 0, skipped = 0, failed = 0;
+ int opt;
+
+ signal(SIGINT, sigint_handler);
+ signal(SIGTERM, sigint_handler);
+
+ libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
+
+ while ((opt = getopt(argc, argv, "qst:h")) != -1) {
+ switch (opt) {
+ case 'q':
+ quiet = true;
+ break;
+ case 's':
+ print_skipped = true;
+ break;
+ case 't':
+ filter = optarg;
+ break;
+ default:
+ fprintf(stderr, help_fmt, basename(argv[0]));
+ return opt != 'h';
+ }
+ }
+
+ for (i = 0; i < __scx_num_tests; i++) {
+ enum scx_test_status status;
+ struct scx_test *test = &__scx_tests[i];
+
+ if (filter && should_skip_test(test, filter)) {
+ /*
+ * Printing the skipped tests and their preambles can
+ * add a lot of noise to the runner output. Printing
+ * this is only really useful for CI, so let's skip it
+ * by default.
+ */
+ if (print_skipped) {
+ print_test_preamble(test, quiet);
+ print_test_result(test, SCX_TEST_SKIP, ++testnum);
+ }
+ continue;
+ }
+
+ print_test_preamble(test, quiet);
+ status = run_test(test);
+ print_test_result(test, status, ++testnum);
+ switch (status) {
+ case SCX_TEST_PASS:
+ passed++;
+ break;
+ case SCX_TEST_SKIP:
+ skipped++;
+ break;
+ case SCX_TEST_FAIL:
+ failed++;
+ break;
+ }
+ }
+ printf("\n\n=============================\n\n");
+ printf("RESULTS:\n\n");
+ printf("PASSED: %u\n", passed);
+ printf("SKIPPED: %u\n", skipped);
+ printf("FAILED: %u\n", failed);
+
+ return 0;
+}
+
+void scx_test_register(struct scx_test *test)
+{
+ SCX_BUG_ON(!test_valid(test), "Invalid test found");
+ SCX_BUG_ON(__scx_num_tests >= MAX_SCX_TESTS, "Maximum tests exceeded");
+
+ __scx_tests[__scx_num_tests++] = *test;
+}
diff --git a/tools/testing/selftests/sched_ext/scx_test.h b/tools/testing/selftests/sched_ext/scx_test.h
new file mode 100644
index 000000000000..90b8d6915bb7
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/scx_test.h
@@ -0,0 +1,131 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ */
+
+#ifndef __SCX_TEST_H__
+#define __SCX_TEST_H__
+
+#include <errno.h>
+#include <scx/common.h>
+#include <scx/compat.h>
+
+enum scx_test_status {
+ SCX_TEST_PASS = 0,
+ SCX_TEST_SKIP,
+ SCX_TEST_FAIL,
+};
+
+#define EXIT_KIND(__ent) __COMPAT_ENUM_OR_ZERO("scx_exit_kind", #__ent)
+
+struct scx_test {
+ /**
+ * name - The name of the testcase.
+ */
+ const char *name;
+
+ /**
+ * description - A description of your testcase: what it tests and is
+ * meant to validate.
+ */
+ const char *description;
+
+ /*
+ * setup - Setup the test.
+ * @ctx: A pointer to a context object that will be passed to run and
+ * cleanup.
+ *
+ * An optional callback that allows a testcase to perform setup for its
+ * run. A test may return SCX_TEST_SKIP to skip the run.
+ */
+ enum scx_test_status (*setup)(void **ctx);
+
+ /*
+ * run - Run the test.
+ * @ctx: Context set in the setup() callback. If @ctx was not set in
+ * setup(), it is NULL.
+ *
+ * The main test. Callers should return one of:
+ *
+ * - SCX_TEST_PASS: Test passed
+ * - SCX_TEST_SKIP: Test should be skipped
+ * - SCX_TEST_FAIL: Test failed
+ *
+ * This callback must be defined.
+ */
+ enum scx_test_status (*run)(void *ctx);
+
+ /*
+ * cleanup - Perform cleanup following the test
+ * @ctx: Context set in the setup() callback. If @ctx was not set in
+ * setup(), it is NULL.
+ *
+ * An optional callback that allows a test to perform cleanup after
+ * being run. This callback is run even if the run() callback returns
+ * SCX_TEST_SKIP or SCX_TEST_FAIL. It is not run if setup() returns
+ * SCX_TEST_SKIP or SCX_TEST_FAIL.
+ */
+ void (*cleanup)(void *ctx);
+};
+
+void scx_test_register(struct scx_test *test);
+
+#define REGISTER_SCX_TEST(__test) \
+ __attribute__((constructor)) \
+ static void ___scxregister##__LINE__(void) \
+ { \
+ scx_test_register(__test); \
+ }
+
+#define SCX_ERR(__fmt, ...) \
+ do { \
+ fprintf(stderr, "ERR: %s:%d\n", __FILE__, __LINE__); \
+ fprintf(stderr, __fmt"\n", ##__VA_ARGS__); \
+ } while (0)
+
+#define SCX_FAIL(__fmt, ...) \
+ do { \
+ SCX_ERR(__fmt, ##__VA_ARGS__); \
+ return SCX_TEST_FAIL; \
+ } while (0)
+
+#define SCX_FAIL_IF(__cond, __fmt, ...) \
+ do { \
+ if (__cond) \
+ SCX_FAIL(__fmt, ##__VA_ARGS__); \
+ } while (0)
+
+#define SCX_GT(_x, _y) SCX_FAIL_IF((_x) <= (_y), "Expected %s > %s (%lu > %lu)", \
+ #_x, #_y, (u64)(_x), (u64)(_y))
+#define SCX_GE(_x, _y) SCX_FAIL_IF((_x) < (_y), "Expected %s >= %s (%lu >= %lu)", \
+ #_x, #_y, (u64)(_x), (u64)(_y))
+#define SCX_LT(_x, _y) SCX_FAIL_IF((_x) >= (_y), "Expected %s < %s (%lu < %lu)", \
+ #_x, #_y, (u64)(_x), (u64)(_y))
+#define SCX_LE(_x, _y) SCX_FAIL_IF((_x) > (_y), "Expected %s <= %s (%lu <= %lu)", \
+ #_x, #_y, (u64)(_x), (u64)(_y))
+#define SCX_EQ(_x, _y) SCX_FAIL_IF((_x) != (_y), "Expected %s == %s (%lu == %lu)", \
+ #_x, #_y, (u64)(_x), (u64)(_y))
+#define SCX_ASSERT(_x) SCX_FAIL_IF(!(_x), "Expected %s to be true (%lu)", \
+ #_x, (u64)(_x))
+
+#define SCX_ECODE_VAL(__ecode) ({ \
+ u64 __val = 0; \
+ bool __found = false; \
+ \
+ __found = __COMPAT_read_enum("scx_exit_code", #__ecode, &__val); \
+ SCX_ASSERT(__found); \
+ (s64)__val; \
+})
+
+#define SCX_KIND_VAL(__kind) ({ \
+ u64 __val = 0; \
+ bool __found = false; \
+ \
+ __found = __COMPAT_read_enum("scx_exit_kind", #__kind, &__val); \
+ SCX_ASSERT(__found); \
+ __val; \
+})
+
+#endif // # __SCX_TEST_H__
diff --git a/tools/testing/selftests/sched_ext/select_cpu_dfl.bpf.c b/tools/testing/selftests/sched_ext/select_cpu_dfl.bpf.c
new file mode 100644
index 000000000000..2ed2991afafe
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/select_cpu_dfl.bpf.c
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A scheduler that validates the behavior of direct dispatching with a default
+ * select_cpu implementation.
+ *
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+bool saw_local = false;
+
+static bool task_is_test(const struct task_struct *p)
+{
+ return !bpf_strncmp(p->comm, 9, "select_cpu");
+}
+
+void BPF_STRUCT_OPS(select_cpu_dfl_enqueue, struct task_struct *p,
+ u64 enq_flags)
+{
+ const struct cpumask *idle_mask = scx_bpf_get_idle_cpumask();
+
+ if (task_is_test(p) &&
+ bpf_cpumask_test_cpu(scx_bpf_task_cpu(p), idle_mask)) {
+ saw_local = true;
+ }
+ scx_bpf_put_idle_cpumask(idle_mask);
+
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops select_cpu_dfl_ops = {
+ .enqueue = select_cpu_dfl_enqueue,
+ .name = "select_cpu_dfl",
+};
diff --git a/tools/testing/selftests/sched_ext/select_cpu_dfl.c b/tools/testing/selftests/sched_ext/select_cpu_dfl.c
new file mode 100644
index 000000000000..a53a40c2d2f0
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/select_cpu_dfl.c
@@ -0,0 +1,72 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ */
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "select_cpu_dfl.bpf.skel.h"
+#include "scx_test.h"
+
+#define NUM_CHILDREN 1028
+
+static enum scx_test_status setup(void **ctx)
+{
+ struct select_cpu_dfl *skel;
+
+ skel = select_cpu_dfl__open_and_load();
+ SCX_FAIL_IF(!skel, "Failed to open and load skel");
+ *ctx = skel;
+
+ return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+ struct select_cpu_dfl *skel = ctx;
+ struct bpf_link *link;
+ pid_t pids[NUM_CHILDREN];
+ int i, status;
+
+ link = bpf_map__attach_struct_ops(skel->maps.select_cpu_dfl_ops);
+ SCX_FAIL_IF(!link, "Failed to attach scheduler");
+
+ for (i = 0; i < NUM_CHILDREN; i++) {
+ pids[i] = fork();
+ if (pids[i] == 0) {
+ sleep(1);
+ exit(0);
+ }
+ }
+
+ for (i = 0; i < NUM_CHILDREN; i++) {
+ SCX_EQ(waitpid(pids[i], &status, 0), pids[i]);
+ SCX_EQ(status, 0);
+ }
+
+ SCX_ASSERT(!skel->bss->saw_local);
+
+ bpf_link__destroy(link);
+
+ return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+ struct select_cpu_dfl *skel = ctx;
+
+ select_cpu_dfl__destroy(skel);
+}
+
+struct scx_test select_cpu_dfl = {
+ .name = "select_cpu_dfl",
+ .description = "Verify the default ops.select_cpu() dispatches tasks "
+ "when idles cores are found, and skips ops.enqueue()",
+ .setup = setup,
+ .run = run,
+ .cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&select_cpu_dfl)
diff --git a/tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.bpf.c b/tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.bpf.c
new file mode 100644
index 000000000000..4bb5abb2d369
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.bpf.c
@@ -0,0 +1,89 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A scheduler that validates the behavior of direct dispatching with a default
+ * select_cpu implementation, and with the SCX_OPS_ENQ_DFL_NO_DISPATCH ops flag
+ * specified.
+ *
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+bool saw_local = false;
+
+/* Per-task scheduling context */
+struct task_ctx {
+ bool force_local; /* CPU changed by ops.select_cpu() */
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+ __type(key, int);
+ __type(value, struct task_ctx);
+} task_ctx_stor SEC(".maps");
+
+/* Manually specify the signature until the kfunc is added to the scx repo. */
+s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags,
+ bool *found) __ksym;
+
+s32 BPF_STRUCT_OPS(select_cpu_dfl_nodispatch_select_cpu, struct task_struct *p,
+ s32 prev_cpu, u64 wake_flags)
+{
+ struct task_ctx *tctx;
+ s32 cpu;
+
+ tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
+ if (!tctx) {
+ scx_bpf_error("task_ctx lookup failed");
+ return -ESRCH;
+ }
+
+ cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags,
+ &tctx->force_local);
+
+ return cpu;
+}
+
+void BPF_STRUCT_OPS(select_cpu_dfl_nodispatch_enqueue, struct task_struct *p,
+ u64 enq_flags)
+{
+ u64 dsq_id = SCX_DSQ_GLOBAL;
+ struct task_ctx *tctx;
+
+ tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
+ if (!tctx) {
+ scx_bpf_error("task_ctx lookup failed");
+ return;
+ }
+
+ if (tctx->force_local) {
+ dsq_id = SCX_DSQ_LOCAL;
+ tctx->force_local = false;
+ saw_local = true;
+ }
+
+ scx_bpf_dispatch(p, dsq_id, SCX_SLICE_DFL, enq_flags);
+}
+
+s32 BPF_STRUCT_OPS(select_cpu_dfl_nodispatch_init_task,
+ struct task_struct *p, struct scx_init_task_args *args)
+{
+ if (bpf_task_storage_get(&task_ctx_stor, p, 0,
+ BPF_LOCAL_STORAGE_GET_F_CREATE))
+ return 0;
+ else
+ return -ENOMEM;
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops select_cpu_dfl_nodispatch_ops = {
+ .select_cpu = select_cpu_dfl_nodispatch_select_cpu,
+ .enqueue = select_cpu_dfl_nodispatch_enqueue,
+ .init_task = select_cpu_dfl_nodispatch_init_task,
+ .name = "select_cpu_dfl_nodispatch",
+};
diff --git a/tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.c b/tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.c
new file mode 100644
index 000000000000..1d85bf4bf3a3
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.c
@@ -0,0 +1,72 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ */
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "select_cpu_dfl_nodispatch.bpf.skel.h"
+#include "scx_test.h"
+
+#define NUM_CHILDREN 1028
+
+static enum scx_test_status setup(void **ctx)
+{
+ struct select_cpu_dfl_nodispatch *skel;
+
+ skel = select_cpu_dfl_nodispatch__open_and_load();
+ SCX_FAIL_IF(!skel, "Failed to open and load skel");
+ *ctx = skel;
+
+ return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+ struct select_cpu_dfl_nodispatch *skel = ctx;
+ struct bpf_link *link;
+ pid_t pids[NUM_CHILDREN];
+ int i, status;
+
+ link = bpf_map__attach_struct_ops(skel->maps.select_cpu_dfl_nodispatch_ops);
+ SCX_FAIL_IF(!link, "Failed to attach scheduler");
+
+ for (i = 0; i < NUM_CHILDREN; i++) {
+ pids[i] = fork();
+ if (pids[i] == 0) {
+ sleep(1);
+ exit(0);
+ }
+ }
+
+ for (i = 0; i < NUM_CHILDREN; i++) {
+ SCX_EQ(waitpid(pids[i], &status, 0), pids[i]);
+ SCX_EQ(status, 0);
+ }
+
+ SCX_ASSERT(skel->bss->saw_local);
+
+ bpf_link__destroy(link);
+
+ return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+ struct select_cpu_dfl_nodispatch *skel = ctx;
+
+ select_cpu_dfl_nodispatch__destroy(skel);
+}
+
+struct scx_test select_cpu_dfl_nodispatch = {
+ .name = "select_cpu_dfl_nodispatch",
+ .description = "Verify behavior of scx_bpf_select_cpu_dfl() in "
+ "ops.select_cpu()",
+ .setup = setup,
+ .run = run,
+ .cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&select_cpu_dfl_nodispatch)
diff --git a/tools/testing/selftests/sched_ext/select_cpu_dispatch.bpf.c b/tools/testing/selftests/sched_ext/select_cpu_dispatch.bpf.c
new file mode 100644
index 000000000000..f0b96a4a04b2
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch.bpf.c
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A scheduler that validates the behavior of direct dispatching with a default
+ * select_cpu implementation.
+ *
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+s32 BPF_STRUCT_OPS(select_cpu_dispatch_select_cpu, struct task_struct *p,
+ s32 prev_cpu, u64 wake_flags)
+{
+ u64 dsq_id = SCX_DSQ_LOCAL;
+ s32 cpu = prev_cpu;
+
+ if (scx_bpf_test_and_clear_cpu_idle(cpu))
+ goto dispatch;
+
+ cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
+ if (cpu >= 0)
+ goto dispatch;
+
+ dsq_id = SCX_DSQ_GLOBAL;
+ cpu = prev_cpu;
+
+dispatch:
+ scx_bpf_dispatch(p, dsq_id, SCX_SLICE_DFL, 0);
+ return cpu;
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops select_cpu_dispatch_ops = {
+ .select_cpu = select_cpu_dispatch_select_cpu,
+ .name = "select_cpu_dispatch",
+ .timeout_ms = 1000U,
+};
diff --git a/tools/testing/selftests/sched_ext/select_cpu_dispatch.c b/tools/testing/selftests/sched_ext/select_cpu_dispatch.c
new file mode 100644
index 000000000000..0309ca8785b3
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch.c
@@ -0,0 +1,70 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ */
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "select_cpu_dispatch.bpf.skel.h"
+#include "scx_test.h"
+
+#define NUM_CHILDREN 1028
+
+static enum scx_test_status setup(void **ctx)
+{
+ struct select_cpu_dispatch *skel;
+
+ skel = select_cpu_dispatch__open_and_load();
+ SCX_FAIL_IF(!skel, "Failed to open and load skel");
+ *ctx = skel;
+
+ return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+ struct select_cpu_dispatch *skel = ctx;
+ struct bpf_link *link;
+ pid_t pids[NUM_CHILDREN];
+ int i, status;
+
+ link = bpf_map__attach_struct_ops(skel->maps.select_cpu_dispatch_ops);
+ SCX_FAIL_IF(!link, "Failed to attach scheduler");
+
+ for (i = 0; i < NUM_CHILDREN; i++) {
+ pids[i] = fork();
+ if (pids[i] == 0) {
+ sleep(1);
+ exit(0);
+ }
+ }
+
+ for (i = 0; i < NUM_CHILDREN; i++) {
+ SCX_EQ(waitpid(pids[i], &status, 0), pids[i]);
+ SCX_EQ(status, 0);
+ }
+
+ bpf_link__destroy(link);
+
+ return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+ struct select_cpu_dispatch *skel = ctx;
+
+ select_cpu_dispatch__destroy(skel);
+}
+
+struct scx_test select_cpu_dispatch = {
+ .name = "select_cpu_dispatch",
+ .description = "Test direct dispatching to built-in DSQs from "
+ "ops.select_cpu()",
+ .setup = setup,
+ .run = run,
+ .cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&select_cpu_dispatch)
diff --git a/tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.bpf.c b/tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.bpf.c
new file mode 100644
index 000000000000..7b42ddce0f56
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.bpf.c
@@ -0,0 +1,37 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A scheduler that validates the behavior of direct dispatching with a default
+ * select_cpu implementation.
+ *
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+UEI_DEFINE(uei);
+
+s32 BPF_STRUCT_OPS(select_cpu_dispatch_bad_dsq_select_cpu, struct task_struct *p,
+ s32 prev_cpu, u64 wake_flags)
+{
+ /* Dispatching to a random DSQ should fail. */
+ scx_bpf_dispatch(p, 0xcafef00d, SCX_SLICE_DFL, 0);
+
+ return prev_cpu;
+}
+
+void BPF_STRUCT_OPS(select_cpu_dispatch_bad_dsq_exit, struct scx_exit_info *ei)
+{
+ UEI_RECORD(uei, ei);
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops select_cpu_dispatch_bad_dsq_ops = {
+ .select_cpu = select_cpu_dispatch_bad_dsq_select_cpu,
+ .exit = select_cpu_dispatch_bad_dsq_exit,
+ .name = "select_cpu_dispatch_bad_dsq",
+ .timeout_ms = 1000U,
+};
diff --git a/tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.c b/tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.c
new file mode 100644
index 000000000000..47eb6ed7627d
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.c
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ */
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "select_cpu_dispatch_bad_dsq.bpf.skel.h"
+#include "scx_test.h"
+
+static enum scx_test_status setup(void **ctx)
+{
+ struct select_cpu_dispatch_bad_dsq *skel;
+
+ skel = select_cpu_dispatch_bad_dsq__open_and_load();
+ SCX_FAIL_IF(!skel, "Failed to open and load skel");
+ *ctx = skel;
+
+ return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+ struct select_cpu_dispatch_bad_dsq *skel = ctx;
+ struct bpf_link *link;
+
+ link = bpf_map__attach_struct_ops(skel->maps.select_cpu_dispatch_bad_dsq_ops);
+ SCX_FAIL_IF(!link, "Failed to attach scheduler");
+
+ sleep(1);
+
+ SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR));
+ bpf_link__destroy(link);
+
+ return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+ struct select_cpu_dispatch_bad_dsq *skel = ctx;
+
+ select_cpu_dispatch_bad_dsq__destroy(skel);
+}
+
+struct scx_test select_cpu_dispatch_bad_dsq = {
+ .name = "select_cpu_dispatch_bad_dsq",
+ .description = "Verify graceful failure if we direct-dispatch to a "
+ "bogus DSQ in ops.select_cpu()",
+ .setup = setup,
+ .run = run,
+ .cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&select_cpu_dispatch_bad_dsq)
diff --git a/tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c b/tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c
new file mode 100644
index 000000000000..653e3dc0b4dc
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A scheduler that validates the behavior of direct dispatching with a default
+ * select_cpu implementation.
+ *
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+UEI_DEFINE(uei);
+
+s32 BPF_STRUCT_OPS(select_cpu_dispatch_dbl_dsp_select_cpu, struct task_struct *p,
+ s32 prev_cpu, u64 wake_flags)
+{
+ /* Dispatching twice in a row is disallowed. */
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0);
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0);
+
+ return prev_cpu;
+}
+
+void BPF_STRUCT_OPS(select_cpu_dispatch_dbl_dsp_exit, struct scx_exit_info *ei)
+{
+ UEI_RECORD(uei, ei);
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops select_cpu_dispatch_dbl_dsp_ops = {
+ .select_cpu = select_cpu_dispatch_dbl_dsp_select_cpu,
+ .exit = select_cpu_dispatch_dbl_dsp_exit,
+ .name = "select_cpu_dispatch_dbl_dsp",
+ .timeout_ms = 1000U,
+};
diff --git a/tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.c b/tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.c
new file mode 100644
index 000000000000..48ff028a3c46
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.c
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2023 David Vernet <[email protected]>
+ * Copyright (c) 2023 Tejun Heo <[email protected]>
+ */
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "select_cpu_dispatch_dbl_dsp.bpf.skel.h"
+#include "scx_test.h"
+
+static enum scx_test_status setup(void **ctx)
+{
+ struct select_cpu_dispatch_dbl_dsp *skel;
+
+ skel = select_cpu_dispatch_dbl_dsp__open_and_load();
+ SCX_FAIL_IF(!skel, "Failed to open and load skel");
+ *ctx = skel;
+
+ return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+ struct select_cpu_dispatch_dbl_dsp *skel = ctx;
+ struct bpf_link *link;
+
+ link = bpf_map__attach_struct_ops(skel->maps.select_cpu_dispatch_dbl_dsp_ops);
+ SCX_FAIL_IF(!link, "Failed to attach scheduler");
+
+ sleep(1);
+
+ SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR));
+ bpf_link__destroy(link);
+
+ return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+ struct select_cpu_dispatch_dbl_dsp *skel = ctx;
+
+ select_cpu_dispatch_dbl_dsp__destroy(skel);
+}
+
+struct scx_test select_cpu_dispatch_dbl_dsp = {
+ .name = "select_cpu_dispatch_dbl_dsp",
+ .description = "Verify graceful failure if we dispatch twice to a "
+ "DSQ in ops.select_cpu()",
+ .setup = setup,
+ .run = run,
+ .cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&select_cpu_dispatch_dbl_dsp)
diff --git a/tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c b/tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c
new file mode 100644
index 000000000000..7f3ebf4fc2ea
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c
@@ -0,0 +1,92 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A scheduler that validates that enqueue flags are properly stored and
+ * applied at dispatch time when a task is directly dispatched from
+ * ops.select_cpu(). We validate this by using scx_bpf_dispatch_vtime(), and
+ * making the test a very basic vtime scheduler.
+ *
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ * Copyright (c) 2024 Tejun Heo <[email protected]>
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+volatile bool consumed;
+
+static u64 vtime_now;
+
+#define VTIME_DSQ 0
+
+static inline bool vtime_before(u64 a, u64 b)
+{
+ return (s64)(a - b) < 0;
+}
+
+static inline u64 task_vtime(const struct task_struct *p)
+{
+ u64 vtime = p->scx.dsq_vtime;
+
+ if (vtime_before(vtime, vtime_now - SCX_SLICE_DFL))
+ return vtime_now - SCX_SLICE_DFL;
+ else
+ return vtime;
+}
+
+s32 BPF_STRUCT_OPS(select_cpu_vtime_select_cpu, struct task_struct *p,
+ s32 prev_cpu, u64 wake_flags)
+{
+ s32 cpu;
+
+ cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
+ if (cpu >= 0)
+ goto ddsp;
+
+ cpu = prev_cpu;
+ scx_bpf_test_and_clear_cpu_idle(cpu);
+ddsp:
+ scx_bpf_dispatch_vtime(p, VTIME_DSQ, SCX_SLICE_DFL, task_vtime(p), 0);
+ return cpu;
+}
+
+void BPF_STRUCT_OPS(select_cpu_vtime_dispatch, s32 cpu, struct task_struct *p)
+{
+ if (scx_bpf_consume(VTIME_DSQ))
+ consumed = true;
+}
+
+void BPF_STRUCT_OPS(select_cpu_vtime_running, struct task_struct *p)
+{
+ if (vtime_before(vtime_now, p->scx.dsq_vtime))
+ vtime_now = p->scx.dsq_vtime;
+}
+
+void BPF_STRUCT_OPS(select_cpu_vtime_stopping, struct task_struct *p,
+ bool runnable)
+{
+ p->scx.dsq_vtime += (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight;
+}
+
+void BPF_STRUCT_OPS(select_cpu_vtime_enable, struct task_struct *p)
+{
+ p->scx.dsq_vtime = vtime_now;
+}
+
+s32 BPF_STRUCT_OPS_SLEEPABLE(select_cpu_vtime_init)
+{
+ return scx_bpf_create_dsq(VTIME_DSQ, -1);
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops select_cpu_vtime_ops = {
+ .select_cpu = select_cpu_vtime_select_cpu,
+ .dispatch = select_cpu_vtime_dispatch,
+ .running = select_cpu_vtime_running,
+ .stopping = select_cpu_vtime_stopping,
+ .enable = select_cpu_vtime_enable,
+ .init = select_cpu_vtime_init,
+ .name = "select_cpu_vtime",
+ .timeout_ms = 1000U,
+};
diff --git a/tools/testing/selftests/sched_ext/select_cpu_vtime.c b/tools/testing/selftests/sched_ext/select_cpu_vtime.c
new file mode 100644
index 000000000000..b4629c2364f5
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/select_cpu_vtime.c
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ * Copyright (c) 2024 Tejun Heo <[email protected]>
+ */
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "select_cpu_vtime.bpf.skel.h"
+#include "scx_test.h"
+
+static enum scx_test_status setup(void **ctx)
+{
+ struct select_cpu_vtime *skel;
+
+ skel = select_cpu_vtime__open_and_load();
+ SCX_FAIL_IF(!skel, "Failed to open and load skel");
+ *ctx = skel;
+
+ return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+ struct select_cpu_vtime *skel = ctx;
+ struct bpf_link *link;
+
+ SCX_ASSERT(!skel->bss->consumed);
+
+ link = bpf_map__attach_struct_ops(skel->maps.select_cpu_vtime_ops);
+ SCX_FAIL_IF(!link, "Failed to attach scheduler");
+
+ sleep(1);
+
+ SCX_ASSERT(skel->bss->consumed);
+
+ bpf_link__destroy(link);
+
+ return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+ struct select_cpu_vtime *skel = ctx;
+
+ select_cpu_vtime__destroy(skel);
+}
+
+struct scx_test select_cpu_vtime = {
+ .name = "select_cpu_vtime",
+ .description = "Test doing direct vtime-dispatching from "
+ "ops.select_cpu(), to a non-built-in DSQ",
+ .setup = setup,
+ .run = run,
+ .cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&select_cpu_vtime)
diff --git a/tools/testing/selftests/sched_ext/test_example.c b/tools/testing/selftests/sched_ext/test_example.c
new file mode 100644
index 000000000000..ce36cdf03cdc
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/test_example.c
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 Tejun Heo <[email protected]>
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ */
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include "scx_test.h"
+
+static bool setup_called = false;
+static bool run_called = false;
+static bool cleanup_called = false;
+
+static int context = 10;
+
+static enum scx_test_status setup(void **ctx)
+{
+ setup_called = true;
+ *ctx = &context;
+
+ return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+ int *arg = ctx;
+
+ SCX_ASSERT(setup_called);
+ SCX_ASSERT(!run_called && !cleanup_called);
+ SCX_EQ(*arg, context);
+
+ run_called = true;
+ return SCX_TEST_PASS;
+}
+
+static void cleanup (void *ctx)
+{
+ SCX_BUG_ON(!run_called || cleanup_called, "Wrong callbacks invoked");
+}
+
+struct scx_test example = {
+ .name = "example",
+ .description = "Validate the basic function of the test suite itself",
+ .setup = setup,
+ .run = run,
+ .cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&example)
diff --git a/tools/testing/selftests/sched_ext/util.c b/tools/testing/selftests/sched_ext/util.c
new file mode 100644
index 000000000000..e47769c91918
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/util.c
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ */
+#include <errno.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+/* Returns read len on success, or -errno on failure. */
+static ssize_t read_text(const char *path, char *buf, size_t max_len)
+{
+ ssize_t len;
+ int fd;
+
+ fd = open(path, O_RDONLY);
+ if (fd < 0)
+ return -errno;
+
+ len = read(fd, buf, max_len - 1);
+
+ if (len >= 0)
+ buf[len] = 0;
+
+ close(fd);
+ return len < 0 ? -errno : len;
+}
+
+/* Returns written len on success, or -errno on failure. */
+static ssize_t write_text(const char *path, char *buf, ssize_t len)
+{
+ int fd;
+ ssize_t written;
+
+ fd = open(path, O_WRONLY | O_APPEND);
+ if (fd < 0)
+ return -errno;
+
+ written = write(fd, buf, len);
+ close(fd);
+ return written < 0 ? -errno : written;
+}
+
+long file_read_long(const char *path)
+{
+ char buf[128];
+
+
+ if (read_text(path, buf, sizeof(buf)) <= 0)
+ return -1;
+
+ return atol(buf);
+}
+
+int file_write_long(const char *path, long val)
+{
+ char buf[64];
+ int ret;
+
+ ret = sprintf(buf, "%lu", val);
+ if (ret < 0)
+ return ret;
+
+ if (write_text(path, buf, sizeof(buf)) <= 0)
+ return -1;
+
+ return 0;
+}
diff --git a/tools/testing/selftests/sched_ext/util.h b/tools/testing/selftests/sched_ext/util.h
new file mode 100644
index 000000000000..bc13dfec1267
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/util.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <[email protected]>
+ */
+
+#ifndef __SCX_TEST_UTIL_H__
+#define __SCX_TEST_UTIL_H__
+
+long file_read_long(const char *path);
+int file_write_long(const char *path, long val);
+
+#endif // __SCX_TEST_H__
--
2.44.0


2024-05-01 15:32:25

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 36/39] sched_ext: Implement DSQ iterator

DSQs are very opaque in the consumption path. The BPF scheduler has no way
of knowing which tasks are being considered and which is picked. This patch
adds BPF DSQ iterator.

- Allows iterating tasks queued on a DSQ in the dispatch order or reverse
from anywhere using bpf_for_each(scx_dsq) or calling the iterator kfuncs
directly.

- Allows consuming arbitrary tasks on the DSQ in any order while iterating
in the dispatch path using the new scx_bpf_consume_task().

- Has ordering guarantee where only tasks which were already queued when the
iteration started are visible and consumable during the iteration.

Note that scx_bpf_consume_task() does a bit of dance to pass in the pointer
to the iterator to __scx_bpf_consume_task(). This is to work around the
current limitation in the BPF verifier where it doesn't allow the memory
area used for an iterator to be passed into kfuncs. We should be able to
remove this workaround in the future.

scx_qmap is updated to implement periodic dumping of the shared DSQ and a
rather silly prioritization mechanism to demonstrate the use of DSQ
iteration and selective consumption.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
---
include/linux/sched/ext.h | 4 +
kernel/sched/ext.c | 271 ++++++++++++++++++++++-
tools/sched_ext/include/scx/common.bpf.h | 19 ++
tools/sched_ext/include/scx/compat.bpf.h | 15 ++
tools/sched_ext/include/scx/compat.h | 6 +
tools/sched_ext/scx_qmap.bpf.c | 98 +++++++-
tools/sched_ext/scx_qmap.c | 22 +-
7 files changed, 422 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 8c6299915800..32cc5f439983 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -59,6 +59,7 @@ struct scx_dispatch_q {
struct list_head list; /* tasks in dispatch order */
struct rb_root priq; /* used to order by p->scx.dsq_vtime */
u32 nr;
+ u64 seq; /* used by BPF iter */
u64 id;
struct rhash_head hash_node;
struct llist_node free_node;
@@ -92,6 +93,8 @@ enum scx_task_state {
/* scx_entity.dsq_flags */
enum scx_ent_dsq_flags {
SCX_TASK_DSQ_ON_PRIQ = 1 << 0, /* task is queued on the priority queue of a dsq */
+
+ SCX_TASK_DSQ_CURSOR = 1 << 31, /* iteration cursor, not a task */
};

/*
@@ -132,6 +135,7 @@ struct scx_dsq_node {
struct sched_ext_entity {
struct scx_dispatch_q *dsq;
struct scx_dsq_node dsq_node; /* protected by dsq lock */
+ u64 dsq_seq;
u32 flags; /* protected by rq lock */
u32 weight;
s32 sticky_cpu;
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 13ba4d3d39bd..fb4849fb7afd 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1088,6 +1088,78 @@ static __always_inline bool scx_kf_allowed_on_arg_tasks(u32 mask,
return true;
}

+/**
+ * nldsq_next_task - Iterate to the next task in a non-local DSQ
+ * @dsq: user dsq being interated
+ * @cur: current position, %NULL to start iteration
+ * @rev: walk backwards
+ *
+ * Returns %NULL when iteration is finished.
+ */
+static struct task_struct *nldsq_next_task(struct scx_dispatch_q *dsq,
+ struct task_struct *cur, bool rev)
+{
+ struct list_head *list_node;
+ struct scx_dsq_node *dsq_node;
+
+ lockdep_assert_held(&dsq->lock);
+
+ if (cur)
+ list_node = &cur->scx.dsq_node.list;
+ else
+ list_node = &dsq->list;
+
+ /* find the next task, need to skip BPF iteration cursors */
+ do {
+ if (rev)
+ list_node = list_node->prev;
+ else
+ list_node = list_node->next;
+
+ if (list_node == &dsq->list)
+ return NULL;
+
+ dsq_node = container_of(list_node, struct scx_dsq_node, list);
+ } while (dsq_node->flags & SCX_TASK_DSQ_CURSOR);
+
+ return container_of(dsq_node, struct task_struct, scx.dsq_node);
+}
+
+#define nldsq_for_each_task(p, dsq) \
+ for ((p) = nldsq_next_task((dsq), NULL, false); (p); \
+ (p) = nldsq_next_task((dsq), (p), false))
+
+
+/*
+ * BPF DSQ iterator. Tasks in a non-local DSQ can be iterated in [reverse]
+ * dispatch order. BPF-visible iterator is opaque and larger to allow future
+ * changes without breaking backward compatibility. Can be used with
+ * bpf_for_each(). See bpf_iter_scx_dsq_*().
+ */
+enum scx_dsq_iter_flags {
+ /* iterate in the reverse dispatch order */
+ SCX_DSQ_ITER_REV = 1LLU << 0,
+
+ __SCX_DSQ_ITER_ALL_FLAGS = SCX_DSQ_ITER_REV,
+};
+
+struct bpf_iter_scx_dsq_kern {
+ /*
+ * Must be the first field. Used to work around BPF restriction and pass
+ * in the iterator pointer to scx_bpf_consume_task().
+ */
+ struct bpf_iter_scx_dsq_kern *self;
+
+ struct scx_dsq_node cursor;
+ struct scx_dispatch_q *dsq;
+ u64 dsq_seq;
+ u64 flags;
+} __attribute__((aligned(8)));
+
+struct bpf_iter_scx_dsq {
+ u64 __opaque[12];
+} __attribute__((aligned(8)));
+

/*
* SCX task iterator.
@@ -1429,7 +1501,7 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,
* tested easily when adding the first task.
*/
if (unlikely(RB_EMPTY_ROOT(&dsq->priq) &&
- !list_empty(&dsq->list)))
+ nldsq_next_task(dsq, NULL, false)))
scx_ops_error("DSQ ID 0x%016llx already had FIFO-enqueued tasks",
dsq->id);

@@ -1461,8 +1533,12 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,
list_add_tail(&p->scx.dsq_node.list, &dsq->list);
}

+ /* seq records the order tasks are queued, used by BPF DSQ iterator */
+ dsq->seq++;
+ p->scx.dsq_seq = dsq->seq;
+
dsq_mod_nr(dsq, 1);
- p->scx.dsq = dsq;
+ WRITE_ONCE(p->scx.dsq, dsq);

/*
* scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the
@@ -1555,7 +1631,7 @@ static void dispatch_dequeue(struct scx_rq *scx_rq, struct task_struct *p)
WARN_ON_ONCE(task_linked_on_dsq(p));
p->scx.holding_cpu = -1;
}
- p->scx.dsq = NULL;
+ WRITE_ONCE(p->scx.dsq, NULL);

if (!is_local)
raw_spin_unlock(&dsq->lock);
@@ -2059,7 +2135,7 @@ static void consume_local_task(struct rq *rq, struct scx_dispatch_q *dsq,
list_add_tail(&p->scx.dsq_node.list, &scx_rq->local_dsq.list);
dsq_mod_nr(dsq, -1);
dsq_mod_nr(&scx_rq->local_dsq, 1);
- p->scx.dsq = &scx_rq->local_dsq;
+ WRITE_ONCE(p->scx.dsq, &scx_rq->local_dsq);
raw_spin_unlock(&dsq->lock);
}

@@ -2131,7 +2207,7 @@ static bool consume_dispatch_q(struct rq *rq, struct rq_flags *rf,

raw_spin_lock(&dsq->lock);

- list_for_each_entry(p, &dsq->list, scx.dsq_node.list) {
+ nldsq_for_each_task(p, dsq) {
struct rq *task_rq = task_rq(p);

if (rq == task_rq) {
@@ -5693,12 +5769,88 @@ __bpf_kfunc bool scx_bpf_consume(u64 dsq_id)
}
}

+/**
+ * __scx_bpf_consume_task - Transfer a task from DSQ iteration to the local DSQ
+ * @it: DSQ iterator in progress
+ * @p: task to consume
+ *
+ * Transfer @p which is on the DSQ currently iterated by @it to the current
+ * CPU's local DSQ. For the transfer to be successful, @p must still be on the
+ * DSQ and have been queued before the DSQ iteration started. This function
+ * doesn't care whether @p was obtained from the DSQ iteration. @p just has to
+ * be on the DSQ and have been queued before the iteration started.
+ *
+ * Returns %true if @p has been consumed, %false if @p had already been consumed
+ * or dequeued.
+ */
+__bpf_kfunc bool __scx_bpf_consume_task(unsigned long it, struct task_struct *p)
+{
+ struct bpf_iter_scx_dsq_kern *kit = (void *)it;
+ struct scx_dispatch_q *dsq, *kit_dsq;
+ struct scx_dsp_ctx *dspc = this_cpu_ptr(&scx_dsp_ctx);
+ struct rq *task_rq;
+ u64 kit_dsq_seq;
+
+ /* can't trust @kit, carefully fetch the values we need */
+ if (get_kernel_nofault(kit_dsq, &kit->dsq) ||
+ get_kernel_nofault(kit_dsq_seq, &kit->dsq_seq)) {
+ scx_ops_error("invalid @it 0x%lx", it);
+ return false;
+ }
+
+ /*
+ * @kit can't be trusted and we can only get the DSQ from @p. As we
+ * don't know @p's rq is locked, use READ_ONCE() to access the field.
+ * Derefing is safe as DSQs are RCU protected.
+ */
+ dsq = READ_ONCE(p->scx.dsq);
+
+ if (unlikely(dsq->id == SCX_DSQ_LOCAL)) {
+ scx_ops_error("local DSQ not allowed");
+ return false;
+ }
+
+ if (unlikely(!dsq || dsq != kit_dsq))
+ return false;
+
+ if (!scx_kf_allowed(SCX_KF_DISPATCH))
+ return false;
+
+ flush_dispatch_buf(dspc->rq, dspc->rf);
+
+ raw_spin_lock(&dsq->lock);
+
+ /*
+ * Did someone else get to it? @p could have already left $dsq, got
+ * re-enqueud, or be in the process of being consumed by someone else.
+ */
+ if (unlikely(p->scx.dsq != dsq ||
+ time_after64(p->scx.dsq_seq, kit_dsq_seq) ||
+ p->scx.holding_cpu >= 0))
+ goto out_unlock;
+
+ task_rq = task_rq(p);
+
+ if (dspc->rq == task_rq) {
+ consume_local_task(dspc->rq, dsq, p);
+ return true;
+ }
+
+ if (task_can_run_on_remote_rq(p, dspc->rq))
+ return consume_remote_task(dspc->rq, dspc->rf, dsq, p, task_rq);
+
+out_unlock:
+ raw_spin_unlock(&dsq->lock);
+ return false;
+}
+
__bpf_kfunc_end_defs();

BTF_KFUNCS_START(scx_kfunc_ids_dispatch)
BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots)
BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel)
BTF_ID_FLAGS(func, scx_bpf_consume)
+BTF_ID_FLAGS(func, __scx_bpf_consume_task)
BTF_KFUNCS_END(scx_kfunc_ids_dispatch)

static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = {
@@ -5877,6 +6029,112 @@ __bpf_kfunc void scx_bpf_destroy_dsq(u64 dsq_id)
destroy_dsq(dsq_id);
}

+/**
+ * bpf_iter_scx_dsq_new - Create a DSQ iterator
+ * @it: iterator to initialize
+ * @dsq_id: DSQ to iterate
+ * @flags: %SCX_DSQ_ITER_*
+ *
+ * Initialize BPF iterator @it which can be used with bpf_for_each() to walk
+ * tasks in the DSQ specified by @dsq_id. Iteration using @it only includes
+ * tasks which are already queued when this function is invoked.
+ */
+__bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id,
+ u64 flags)
+{
+ struct bpf_iter_scx_dsq_kern *kit = (void *)it;
+
+ BUILD_BUG_ON(sizeof(struct bpf_iter_scx_dsq_kern) >
+ sizeof(struct bpf_iter_scx_dsq));
+ BUILD_BUG_ON(__alignof__(struct bpf_iter_scx_dsq_kern) !=
+ __alignof__(struct bpf_iter_scx_dsq));
+
+ if (flags & ~__SCX_DSQ_ITER_ALL_FLAGS)
+ return -EINVAL;
+
+ kit->dsq = find_non_local_dsq(dsq_id);
+ if (!kit->dsq)
+ return -ENOENT;
+
+ INIT_LIST_HEAD(&kit->cursor.list);
+ RB_CLEAR_NODE(&kit->cursor.priq);
+ kit->cursor.flags = SCX_TASK_DSQ_CURSOR;
+ kit->self = kit;
+ kit->dsq_seq = READ_ONCE(kit->dsq->seq);
+ kit->flags = flags;
+
+ return 0;
+}
+
+/**
+ * bpf_iter_scx_dsq_next - Progress a DSQ iterator
+ * @it: iterator to progress
+ *
+ * Return the next task. See bpf_iter_scx_dsq_new().
+ */
+__bpf_kfunc struct task_struct *bpf_iter_scx_dsq_next(struct bpf_iter_scx_dsq *it)
+{
+ struct bpf_iter_scx_dsq_kern *kit = (void *)it;
+ bool rev = kit->flags & SCX_DSQ_ITER_REV;
+ struct task_struct *p;
+ unsigned long flags;
+
+ if (!kit->dsq)
+ return NULL;
+
+ raw_spin_lock_irqsave(&kit->dsq->lock, flags);
+
+ if (list_empty(&kit->cursor.list))
+ p = NULL;
+ else
+ p = container_of(&kit->cursor, struct task_struct, scx.dsq_node);
+
+ /*
+ * Only tasks which were queued before the iteration started are
+ * visible. This bounds BPF iterations and guarantees that vtime never
+ * jumps in the other direction while iterating.
+ */
+ do {
+ p = nldsq_next_task(kit->dsq, p, rev);
+ } while (p && unlikely(time_after64(p->scx.dsq_seq, kit->dsq_seq)));
+
+ if (p) {
+ if (rev)
+ list_move_tail(&kit->cursor.list, &p->scx.dsq_node.list);
+ else
+ list_move(&kit->cursor.list, &p->scx.dsq_node.list);
+ } else {
+ list_del_init(&kit->cursor.list);
+ }
+
+ raw_spin_unlock_irqrestore(&kit->dsq->lock, flags);
+
+ return p;
+}
+
+/**
+ * bpf_iter_scx_dsq_destroy - Destroy a DSQ iterator
+ * @it: iterator to destroy
+ *
+ * Undo scx_iter_scx_dsq_new().
+ */
+__bpf_kfunc void bpf_iter_scx_dsq_destroy(struct bpf_iter_scx_dsq *it)
+{
+ struct bpf_iter_scx_dsq_kern *kit = (void *)it;
+
+ if (!kit->dsq)
+ return;
+
+ if (!list_empty(&kit->cursor.list)) {
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&kit->dsq->lock, flags);
+ list_del_init(&kit->cursor.list);
+ raw_spin_unlock_irqrestore(&kit->dsq->lock, flags);
+ }
+ kit->dsq = NULL;
+}
+
__bpf_kfunc_end_defs();

struct scx_bpf_error_bstr_bufs {
@@ -6211,6 +6469,9 @@ BTF_KFUNCS_START(scx_kfunc_ids_any)
BTF_ID_FLAGS(func, scx_bpf_kick_cpu)
BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued)
BTF_ID_FLAGS(func, scx_bpf_destroy_dsq)
+BTF_ID_FLAGS(func, bpf_iter_scx_dsq_new, KF_ITER_NEW | KF_RCU_PROTECTED)
+BTF_ID_FLAGS(func, bpf_iter_scx_dsq_next, KF_ITER_NEXT | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_iter_scx_dsq_destroy, KF_ITER_DESTROY)
BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS)
BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS)
BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids)
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 0b26046339ee..17c76919e450 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -35,10 +35,14 @@ void scx_bpf_dispatch_vtime(struct task_struct *p, u64 dsq_id, u64 slice, u64 vt
u32 scx_bpf_dispatch_nr_slots(void) __ksym;
void scx_bpf_dispatch_cancel(void) __ksym;
bool scx_bpf_consume(u64 dsq_id) __ksym;
+bool __scx_bpf_consume_task(unsigned long it, struct task_struct *p) __ksym __weak;
u32 scx_bpf_reenqueue_local(void) __ksym;
void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym;
s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym;
void scx_bpf_destroy_dsq(u64 dsq_id) __ksym;
+int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id, bool rev) __ksym __weak;
+struct task_struct *bpf_iter_scx_dsq_next(struct bpf_iter_scx_dsq *it) __ksym __weak;
+void bpf_iter_scx_dsq_destroy(struct bpf_iter_scx_dsq *it) __ksym __weak;
void scx_bpf_exit_bstr(s64 exit_code, char *fmt, unsigned long long *data, u32 data__sz) __ksym __weak;
void scx_bpf_error_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym;
u32 scx_bpf_nr_cpu_ids(void) __ksym __weak;
@@ -55,6 +59,21 @@ bool scx_bpf_task_running(const struct task_struct *p) __ksym;
s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym;
struct cgroup *scx_bpf_task_cgroup(struct task_struct *p) __ksym;

+/*
+ * Use the following as @it when calling scx_bpf_consume_task() from whitin
+ * bpf_for_each() loops.
+ */
+#define BPF_FOR_EACH_ITER (&___it)
+
+/* hopefully temporary wrapper to work around BPF restriction */
+static inline bool scx_bpf_consume_task(struct bpf_iter_scx_dsq *it,
+ struct task_struct *p)
+{
+ unsigned long ptr;
+ bpf_probe_read_kernel(&ptr, sizeof(ptr), it);
+ return __scx_bpf_consume_task(ptr, p);
+}
+
static inline __attribute__((format(printf, 1, 2)))
void ___scx_bpf_exit_format_checker(const char *fmt, ...) {}

diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
index 0729aa9bb03e..c17ef3757b31 100644
--- a/tools/sched_ext/include/scx/compat.bpf.h
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@@ -56,6 +56,21 @@ static inline void __COMPAT_scx_bpf_switch_all(void)
#define __COMPAT_HAS_CPUMASKS \
bpf_ksym_exists(scx_bpf_nr_cpu_ids)

+/*
+ * Iteration and scx_bpf_consume_task() are new. The following become noop on
+ * older kernels. The users can switch to bpf_for_each(scx_dsq) and directly
+ * call scx_bpf_consume_task() in the future.
+ */
+#define __COMPAT_DSQ_FOR_EACH(p, dsq_id, flags) \
+ if (bpf_ksym_exists(bpf_iter_scx_dsq_new)) \
+ bpf_for_each(scx_dsq, (p), (dsq_id), (flags))
+
+static inline bool __COMPAT_scx_bpf_consume_task(struct bpf_iter_scx_dsq *it,
+ struct task_struct *p)
+{
+ return false;
+}
+
/*
* Define sched_ext_ops. This may be expanded to define multiple variants for
* backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH().
diff --git a/tools/sched_ext/include/scx/compat.h b/tools/sched_ext/include/scx/compat.h
index 7155b69150ff..7783c82a8a18 100644
--- a/tools/sched_ext/include/scx/compat.h
+++ b/tools/sched_ext/include/scx/compat.h
@@ -123,6 +123,12 @@ static inline bool __COMPAT_struct_has_field(const char *type, const char *field
#define __COMPAT_HAS_CPUMASKS \
__COMPAT_has_ksym("scx_bpf_nr_cpu_ids")

+/*
+ * DSQ iterator is new. Users will be able to assume existence in the future.
+ */
+#define __COMPAT_HAS_DSQ_ITER \
+ __COMPAT_has_ksym("bpf_iter_scx_dsq_new")
+
static inline long scx_hotplug_seq(void)
{
int fd;
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index a442031309c0..924e7e2b8c4c 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -23,6 +23,7 @@
* Copyright (c) 2022 David Vernet <[email protected]>
*/
#include <scx/common.bpf.h>
+#include <string.h>

enum consts {
ONE_SEC_IN_NS = 1000000000,
@@ -36,6 +37,8 @@ const volatile u32 stall_user_nth;
const volatile u32 stall_kernel_nth;
const volatile u32 dsp_inf_loop_after;
const volatile u32 dsp_batch;
+const volatile bool print_shared_dsq;
+const volatile char exp_prefix[17];
const volatile s32 disallow_tgid;
const volatile bool switch_partial;

@@ -106,7 +109,7 @@ struct {

/* Statistics */
u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued;
-u64 nr_core_sched_execed;
+u64 nr_core_sched_execed, nr_expedited;

s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
s32 prev_cpu, u64 wake_flags)
@@ -243,6 +246,37 @@ static void update_core_sched_head_seq(struct task_struct *p)
scx_bpf_error("task_ctx lookup failed");
}

+static bool consume_shared_dsq(void)
+{
+ struct task_struct *p;
+ bool consumed;
+
+ if (exp_prefix[0] == '\0')
+ return scx_bpf_consume(SHARED_DSQ);
+
+ /*
+ * To demonstrate the use of scx_bpf_consume_task(), implement silly
+ * selective priority boosting mechanism by scanning SHARED_DSQ looking
+ * for matching comms and consume them first. This makes difference only
+ * when dsp_batch is larger than 1.
+ */
+ consumed = false;
+ __COMPAT_DSQ_FOR_EACH(p, SHARED_DSQ, 0) {
+ char comm[sizeof(exp_prefix)];
+
+ memcpy(comm, p->comm, sizeof(exp_prefix) - 1);
+
+ if (!bpf_strncmp(comm, sizeof(exp_prefix),
+ (const char *)exp_prefix) &&
+ __COMPAT_scx_bpf_consume_task(BPF_FOR_EACH_ITER, p)) {
+ consumed = true;
+ __sync_fetch_and_add(&nr_expedited, 1);
+ }
+ }
+
+ return consumed || scx_bpf_consume(SHARED_DSQ);
+}
+
void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
{
struct task_struct *p;
@@ -251,7 +285,7 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
void *fifo;
s32 i, pid;

- if (scx_bpf_consume(SHARED_DSQ))
+ if (consume_shared_dsq())
return;

if (dsp_inf_loop_after && nr_dispatched > dsp_inf_loop_after) {
@@ -302,7 +336,7 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
batch--;
cpuc->dsp_cnt--;
if (!batch || !scx_bpf_dispatch_nr_slots()) {
- scx_bpf_consume(SHARED_DSQ);
+ consume_shared_dsq();
return;
}
if (!cpuc->dsp_cnt)
@@ -445,14 +479,70 @@ void BPF_STRUCT_OPS(qmap_cpu_offline, s32 cpu)
print_cpus();
}

+struct monitor_timer {
+ struct bpf_timer timer;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 1);
+ __type(key, u32);
+ __type(value, struct monitor_timer);
+} central_timer SEC(".maps");
+
+/*
+ * Dump the currently queued tasks in the shared DSQ to demonstrate the usage of
+ * scx_bpf_dsq_nr_queued() and DSQ iterator. Raise the dispatch batch count to
+ * see meaningful dumps in the trace pipe.
+ */
+static void dump_shared_dsq(void)
+{
+ struct task_struct *p;
+ s32 nr;
+
+ if (!(nr = scx_bpf_dsq_nr_queued(SHARED_DSQ)))
+ return;
+
+ bpf_printk("Dumping %d tasks in SHARED_DSQ in reverse order", nr);
+
+ bpf_rcu_read_lock();
+ __COMPAT_DSQ_FOR_EACH(p, SHARED_DSQ, SCX_DSQ_ITER_REV)
+ bpf_printk("%s[%d]", p->comm, p->pid);
+ bpf_rcu_read_unlock();
+}
+
+static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer)
+{
+ if (print_shared_dsq)
+ dump_shared_dsq();
+
+ bpf_timer_start(timer, ONE_SEC_IN_NS, 0);
+ return 0;
+}
+
s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
{
+ u32 key = 0;
+ struct bpf_timer *timer;
+ s32 ret;
+
if (!switch_partial)
__COMPAT_scx_bpf_switch_all();

print_cpus();

- return scx_bpf_create_dsq(SHARED_DSQ, -1);
+ ret = scx_bpf_create_dsq(SHARED_DSQ, -1);
+ if (ret)
+ return ret;
+
+ timer = bpf_map_lookup_elem(&central_timer, &key);
+ if (!timer)
+ return -ESRCH;
+
+ bpf_timer_init(timer, &central_timer, CLOCK_MONOTONIC);
+ bpf_timer_set_callback(timer, monitor_timerfn);
+
+ return bpf_timer_start(timer, ONE_SEC_IN_NS, 0);
}

void BPF_STRUCT_OPS(qmap_exit, struct scx_exit_info *ei)
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index a106ba099e5e..1b8cd2993ee2 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -20,7 +20,7 @@ const char help_fmt[] =
"See the top-level comment in .bpf.c for more details.\n"
"\n"
"Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-l COUNT] [-b COUNT]\n"
-" [-d PID] [-D LEN] [-p] [-v]\n"
+" [-P] [-E PREFIX] [-d PID] [-D LEN] [-p] [-v]\n"
"\n"
" -s SLICE_US Override slice duration\n"
" -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n"
@@ -28,6 +28,9 @@ const char help_fmt[] =
" -T COUNT Stall every COUNT'th kernel thread\n"
" -l COUNT Trigger dispatch infinite looping after COUNT dispatches\n"
" -b COUNT Dispatch upto COUNT tasks together\n"
+" -P Print out DSQ content to trace_pipe every second, use with -b\n"
+" -E PREFIX Expedite consumption of threads w/ matching comm, use with -b\n"
+" (e.g. match shell on a loaded system)\n"
" -d PID Disallow a process from switching into SCHED_EXT (-1 for self)\n"
" -D LEN Set scx_exit_info.dump buffer length\n"
" -p Switch only tasks on SCHED_EXT policy intead of all\n"
@@ -61,7 +64,7 @@ int main(int argc, char **argv)

skel = SCX_OPS_OPEN(qmap_ops, scx_qmap);

- while ((opt = getopt(argc, argv, "s:e:t:T:l:b:d:D:pvh")) != -1) {
+ while ((opt = getopt(argc, argv, "s:e:t:T:l:b:PE:d:D:pvh")) != -1) {
switch (opt) {
case 's':
skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
@@ -81,6 +84,13 @@ int main(int argc, char **argv)
case 'b':
skel->rodata->dsp_batch = strtoul(optarg, NULL, 0);
break;
+ case 'P':
+ skel->rodata->print_shared_dsq = true;
+ break;
+ case 'E':
+ strncpy(skel->rodata->exp_prefix, optarg,
+ sizeof(skel->rodata->exp_prefix) - 1);
+ break;
case 'd':
skel->rodata->disallow_tgid = strtol(optarg, NULL, 0);
if (skel->rodata->disallow_tgid < 0)
@@ -102,6 +112,10 @@ int main(int argc, char **argv)
}
}

+ if (!__COMPAT_HAS_DSQ_ITER &&
+ (skel->rodata->print_shared_dsq || strlen(skel->rodata->exp_prefix)))
+ fprintf(stderr, "kernel doesn't support DSQ iteration\n");
+
SCX_OPS_LOAD(skel, qmap_ops, scx_qmap, uei);
link = SCX_OPS_ATTACH(skel, qmap_ops);

@@ -109,10 +123,10 @@ int main(int argc, char **argv)
long nr_enqueued = skel->bss->nr_enqueued;
long nr_dispatched = skel->bss->nr_dispatched;

- printf("stats : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64" core=%"PRIu64"\n",
+ printf("stats : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64" core=%"PRIu64" exp=%"PRIu64"\n",
nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
skel->bss->nr_reenqueued, skel->bss->nr_dequeued,
- skel->bss->nr_core_sched_execed);
+ skel->bss->nr_core_sched_execed, skel->bss->nr_expedited);
fflush(stdout);
sleep(1);
}
--
2.44.0


2024-05-02 02:25:04

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: [PATCH 38/39] sched_ext: Documentation: scheduler: Document extensible scheduler class

On Wed, May 01, 2024 at 05:10:13AM -1000, Tejun Heo wrote:
> Add Documentation/scheduler/sched-ext.rst which gives a high-level overview
> and pointers to the examples.
>

The doc LGTM, thanks!

Reviewed-by: Bagas Sanjaya <[email protected]>

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (316.00 B)
signature.asc (235.00 B)
Download all attachments

2024-05-02 09:48:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class



Can you please put your efforts and the touted Google collaboration in
fixing the existing cgroup mess?

2024-05-02 19:20:29

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

Hello, Peter.

On Thu, May 02, 2024 at 10:48:00AM +0200, Peter Zijlstra wrote:
> Can you please put your efforts and the touted Google collaboration in
> fixing the existing cgroup mess?

I suppose you're referring to Rik's flattened hierarchy patchset.

https://lore.kernel.org/all/[email protected]

Rik spent a lot of time and energy on it and IIRC one of the reasons why it
didn't get pushed further was the lack of any enthusiasm or support from the
upstream community.

We can resurrect the discussion on that patchset but how is that connected
to sched_ext? One of the example schedulers, scx_flatcg, does employ the
same approach with a twist (instead of flattening completely, it builds
two-level hierarchy so that a leaf cgroup can be treated as a single entity)
to demonstrate the idea but the two projects don't really have much in
common otherwise. Are you saying that Meta and Google working on the
flattened hierarchy is a prerequisite for landing sched_ext?

Thanks.

--
tejun

2024-05-03 08:53:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

On Thu, May 02, 2024 at 09:20:15AM -1000, Tejun Heo wrote:
> Hello, Peter.
>
> On Thu, May 02, 2024 at 10:48:00AM +0200, Peter Zijlstra wrote:
> > Can you please put your efforts and the touted Google collaboration in
> > fixing the existing cgroup mess?
>
> I suppose you're referring to Rik's flattened hierarchy patchset.
>
> https://lore.kernel.org/all/[email protected]
>
> Rik spent a lot of time and energy on it and IIRC one of the reasons why it
> didn't get pushed further was the lack of any enthusiasm or support from the
> upstream community.
>
> We can resurrect the discussion on that patchset but how is that connected
> to sched_ext?

I'm absolutely not taking any of this until at the very least the cgroup
situation that's been created is solved. And even then, I fundamentally
believe the approach to be detrimental to the scheduler eco-system.
Witness the metric ton of toy schedulers written for it, that's all
effort not put into improving the existing code.

You guys Google/Facebook got us the cgroup thing, Google did a lot of
the work for cpu-cgroup, and now you Facebook say you can't live with it
because it's too expensive. Yes Rik did put a lot of effort into it, but
Google shot it down. What am I to do?

You Google/Facebook are touting collaboration, collaborate on fixing it.
Instead of re-posting this over and over. After all, your main
motivation for starting this was the cpu-cgroup overhead.

From where I'm sitting, you created a problem (cpu-cgroup) and now
you're creating an even bigger problem as a work-around. Very much not
appreciated.



2024-05-05 23:31:47

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

Hello,

On Fri, May 03, 2024 at 10:52:32AM +0200, Peter Zijlstra wrote:
> On Thu, May 02, 2024 at 09:20:15AM -1000, Tejun Heo wrote:
> > We can resurrect the discussion on that patchset but how is that connected
> > to sched_ext?
>
> I'm absolutely not taking any of this until at the very least the cgroup
> situation that's been created is solved. And even then, I fundamentally
> believe the approach to be detrimental to the scheduler eco-system.

Please see below for more on the flattened hierarchy patchset. However, no
matter how that discussion works out, what you seem to be suggesting -
suspending discussion or further push for upstream on sched_ext until a
mostly unrelated work is done - doesn't seem reasonable, especially when
most of the input that you have provided is not constructive.

Even if we agree that, for some reason, the two projects are linked and Meta
and Google owe to push the flattened hierarchy patchset to land sched_ext
upstream, it should be obvious how your proposition puts us in an impossible
spot - A is a prerequisite for B but B isn't going to happen. That's not a
motivating situation for anyone.

If working on the flattened hierarchy patchset is something you want us to
commit to as a gesture of good will, we can surely consider that, but that
shouldn't block further discussions on sched_ext or its upstreaming.

(I reordered your comment about the number of sched_ext schedulers and
developer attention towards the end of the reply to avoid jumping back and
forth between subjects.)

> You guys Google/Facebook got us the cgroup thing, Google did a lot of

We can't divorce ourselves completely from the organizations that we work
for but the above is still a pretty broad stroke. Neither David nor I was
involved in the CPU controller design or implementation and I don't think
it's the same group of people on the Google side either. We sure can discuss
how to proceed on the flattened hierarchy patchset but I don't think the
picture you're painting is a fair depiction of the overall situation.

> the work for cpu-cgroup, and now you Facebook say you can't live with it
> because it's too expensive. Yes Rik did put a lot of effort into it, but
> Google shot it down. What am I to do?

You could have encouraged and guided the project if it felt important
enough. You didn't have to but that was an option.

> You Google/Facebook are touting collaboration, collaborate on fixing it.
> Instead of re-posting this over and over. After all, your main
> motivation for starting this was the cpu-cgroup overhead.

The hierarchical scheduling overhead isn't the main motivation for us. We
can't use the CPU controller for all workloads and while it'd be nice to
improve that, it's pretty easy to work around especially with constantly
increasing number of CPUs. Currently, most sched_ext experiments are without
cgroups and even when cgroups are considered, they're just used as grouping
hints.

In fact, we want to try implementing hierarchical scheduling by dynamically
soft-affinitizing cgroups to CPUs, which would be a bridge too far for the
in-kernel scheduler, at least for now, as it wouldn't be able to handle
custom affinities properly, but it is an idea worth exploring. Enabling
experiments like that is definitely one of our main motivations.

> From where I'm sitting, you created a problem (cpu-cgroup) and now
> you're creating an even bigger problem as a work-around. Very much not
> appreciated.

I have a hard time agreeing. These projects don't overlap all that much.
Their scopes are wildly different. That said, if this is somehow the
blocker, we can talk and try to find a solution but such a solution would
have to be reasonable from our end too. How else would it work?

> Witness the metric ton of toy schedulers written for it, that's all
> effort not put into improving the existing code.

This view works only if you assume that the entire world contains only a
handful of developers who can work on schedulers. The only way that would be
the case is if the barrier of entry is raised unreasonably high. Sometimes a
high barrier of entry can't be avoided or is beneficial. However, if it's
pushed up high enough to leave only a handful of people to work on an area
as large as scheduling, something probably is wrong.

You know better than anyone that there's no such thing as the perfect
scheduler for all, or even most, workloads. There are too many interacting
factors and second-order effects for a single implementation, no matter how
advanced, to be perfect or even great for the multitudes of situations that
scheduling encounters. With hardware and workloads becoming more complex,
the situation isn't getting any better. This partially explains why we can
easily achieve significantly better behaviors for specific workloads even
with a toy scheduler which is just there to demonstrate an idea.

The built-in scheduler has to be good enough for everyone, and, thanks to
the effort of you and the other sched maintainers, it serves that role
admirably. However, that requirement also comes with stringent constraints.
Radical ideas are difficult to play with. Each change has to make some sense
for every use case. Nothing drastic can be introduced unless the future path
can reasonably be forecast. So, development efforts must be highly
orchestrated and stay consistent which justifies a higher barrier of entry
and strict control.

Yet, the many different ways that even simple schedulers can demonstrates
sometimes significant behavior and performance benefits for specific
workloads suggest that there are a lot of low hanging fruits in the area.
Low hanging fruits that we can't easily reach from our current local
optimum. A single implementation which has to satisfy all users all the time
is unlikely to be an effective vehicle for mapping out such landscape.

I believe we agree that we want more people contributing to the scheduling
area. We need that. However, I have a hard time seeing how that would be
achieved in the current structure. Most people can't afford to sink six
months, a year, two years into a project only to eventually be nacked
without any way to deploy and prove their ideas and efforts. Unfortunately,
that is where we end up today in many cases.

There are many smart people with bright ideas just outside the fence who are
eager to develop, tune and even just play with schedulers. I believe they
will flourish when they can work in an environment where scheduling
experimentation is accessible and encouraged. In fact, we are already seeing
that. Out of the four non-trivial sched_ext schedulers, three are either
primarily driven by or have significant contributions from people who had
not and wouldn't have worked on the in-kernel schedulers at all.

So, here's my proposition. Let's please open it up. sched_ext hooks into
sched infra but the contact surface is limited and we'll try our best to
stay out of your way. I can't promise that it won't ever get in your way,
but, if it ever does, just ping me and David. Resolving such situations
would be our highest priority. Let us and others try out crazy ideas and
find out what works.

Thanks.

--
tejun

2024-05-06 18:58:27

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

On Fri, 2024-05-03 at 10:52 +0200, Peter Zijlstra wrote:
> On Thu, May 02, 2024 at 09:20:15AM -1000, Tejun Heo wrote:
> > Hello, Peter.
> >
> > On Thu, May 02, 2024 at 10:48:00AM +0200, Peter Zijlstra wrote:
> > > Can you please put your efforts and the touted Google
> > > collaboration in
> > > fixing the existing cgroup mess?
> >
> > I suppose you're referring to Rik's flattened hierarchy patchset.
> >
> >  
> > https://lore.kernel.org/all/[email protected]
> >
>
> You guys Google/Facebook got us the cgroup thing, Google did a lot of
> the work for cpu-cgroup, and now you Facebook say you can't live with
> it
> because it's too expensive. Yes Rik did put a lot of effort into it,
> but
> Google shot it down. What am I to do?

I believe the issues that Paul pointed out with my
flattened cgroup code are fixable. I ended up not
getting back to this code because it took me a few
months to think of ways to fix the issues Paul found,
and by then I had moved on to other projects.

For reference, Paul found these two (very real) issues
with my implementation.

1) Thundering herd problem. If many tasks in a low
priority cgroup wake up at the same time, they can
end up swamping a CPU.

I believe this can be solved with the same idea
I had for reimplementing CONFIG_CFS_BANDWIDTH.
Specifically, the code that determines the time
slice length for a task already has a way to
determine whether a CPU is "overloaded", and
time slices need to be shortened. Once we reach
that situation, we can place woken up tasks on
a secondary heap of per-cgroup runqueues, from
which we do not directly run tasks, but pick
the lowest vruntime task from the lowest vruntime
cgroup and put that on the main runqueue, if
the previously running task has a vruntime that
is higher than that of a task in the secondary
group. If a task is woken up in a cgroup that
already has tasks on that secondary queue, we
wake up the task onto that secondary queue.

This means on overloaded CPUs, we move back to
a task selection mechanism closer to what we
currently have, while in the non-overloaded
situation we use a flat runqueue.

This same scheme could be used to implement
CFS bandwidth control. A task belonging to a
throttled group would be placed on the group's
queue, not the CPU's flat runqueue.

2) The vruntime for a task can be advanced by way
to much at once. If we have tasks A & B running,
and task B has a priority that is 1/100th of that
of task A, its vruntime would be advanced 100x
as much as task A, when running the same length
time slice.

This creates a big issue if we get a wakeup of
task C, at the same priority as task B, and then
task A goes to sleep. Due to the very far advanced
runtime of task B, task C could get to monopolize
the CPU for a considerable amount of time, and
task B could get starved.

A potential fix for this is to never account more
than the maximum time slice length at a time, while
any excess delta_exec time for the task gets remembered.

At pick_next_entity time, the scheduler can see that
task B has a lot of delta_exec time left, and account
up to the maximum slice length to the task's vruntime,
and place it back in the queue if the next task now has
a lower vruntime.

For a steady state of a high priority task A and a low
priority task B, this makes pick_next_task more expensive,
but when task A disappears and task C appears, CPU time
will continue to be fair between them.

Limiting the total weight of tasks on the flat runqueue,
using the mechanism for thundering herd and CFS bandwidth
outlined above, should keep this overhead bounded to
something reasonable.

Does the above sound like it would work?

Does it sound like code that you would be ok with merging?

Is it a large enough improvement over the current hierarchical
runqueue that it would be worth doing?

This would be a fairly large project, so we should probably discuss
some of the details before investing too much time in it.

--
All Rights Reversed.

2024-05-07 19:33:21

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

Hello, Rik.

On Mon, May 06, 2024 at 02:47:47PM -0400, Rik van Riel wrote:
> I believe the issues that Paul pointed out with my flattened cgroup code
> are fixable. I ended up not getting back to this code because it took me a
> few months to think of ways to fix the issues Paul found, and by then I
> had moved on to other projects.
>
> For reference, Paul found these two (very real) issues with my
> implementation.
>
> 1) Thundering herd problem. If many tasks in a low priority cgroup wake
> up at the same time, they can end up swamping a CPU.

The way that scx_flatcg (which seems broken right now, will fix it) works
around this problem is by always doing two-level scheduling. ie. The top
level rbtree hosts the tasks in the root cgroup and all the active cgroups,
where each cgroup is scheduled according to their current flattened
hierarchical share. This seems to work pretty well as what becomes really
expensive is the repeated nesting which can easily go 6+ levels.

This doesn't solve the thundering herd problem completely but shifts it one
level. ie. Thundering herds of threads can be handled easily. However,
thunderding herds of cgroups can still cause unfairness. I can imagine cases
where this can lead to scheduling issues but they all seem pretty convoluted
and artificial. Sure, somebody who's intentionally adversarial can cause
temporary issues but perfect isolation of adversarial actors isn't what
cgroups can or should practically target.

Even in the case this is an actual issue, we can solve it by limited
nesting. cgroups already have deligation boundaries. Maybe it needs to be
made more explicit but one solution could be adding a nesting layer only on
delegation boundaries so that misbehaviors are better contained within each
delegation domain.

> I believe this can be solved with the same idea I had for
> reimplementing CONFIG_CFS_BANDWIDTH. Specifically, the code that
> determines the time slice length for a task already has a way to
> determine whether a CPU is "overloaded", and time slices need to be
> shortened. Once we reach that situation, we can place woken up tasks on
> a secondary heap of per-cgroup runqueues, from which we do not directly
> run tasks, but pick the lowest vruntime task from the lowest vruntime
> cgroup and put that on the main runqueue, if the previously running

When overloaded, are the cgroups being put on a single rbtree? If so, they'd
be using flattened shares, right? I wonder what you're suggesting for the
overloaded case is pretty simliar to what flatcg is doing plus avoiding one
level of indirection while not overloaded.

Thanks.

--
tejun

2024-05-07 19:50:19

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

On Tue, 2024-05-07 at 09:33 -1000, Tejun Heo wrote:
> On Mon, May 06, 2024 at 02:47:47PM -0400, Rik van Riel wrote:
>
>
> >    I believe this can be solved with the same idea I had for
> >    reimplementing CONFIG_CFS_BANDWIDTH. Specifically, the code that
> >    determines the time slice length for a task already has a way to
> >    determine whether a CPU is "overloaded", and time slices need to
> > be
> >    shortened. Once we reach that situation, we can place woken up
> > tasks on
> >    a secondary heap of per-cgroup runqueues, from which we do not
> > directly
> >    run tasks, but pick the lowest vruntime task from the lowest
> > vruntime
> >    cgroup and put that on the main runqueue, if the previously
> > running
>
> When overloaded, are the cgroups being put on a single rbtree? If so,
> they'd
> be using flattened shares, right? I wonder what you're suggesting for
> the
> overloaded case is pretty simliar to what flatcg is doing plus
> avoiding one
> level of indirection while not overloaded.

It does indeed sound like flatcg is doing almost the same thing.

I'm not entirely sure what to make of the fact that we both
came up with the same solution to the problem. I suppose it's
a nice improvement over a fully hierarchical solution, especially
when it comes to overhead, but there is still a fair amount of
complexity left.

I don't know if there is a simpler solution to this problem.

There might not be.

--
All Rights Reversed.

2024-05-09 07:39:48

by Changwoo Min

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

Hello,

I'd like to reaffirm Valve and Igalia's backing for the sched_ext
proposal.

Let's delve into the context first. Valve, in collaboration with
Igalia and other firms, has been dedicated to enhancing the
gaming experience on Linux. Our endeavor involves utilizing
a standard Linux distribution (SteamOS) to execute unaltered
Windows games on the Linux kernel with the aid of Wine and
various other software components. The overarching objective is
to refine the Linux desktop environment for gaming and
interactive purposes. As part of our commitment, we adhere to an
"upstream everything" policy, contributing to the Linux kernel
and numerous open-source projects. For those interested, you can
explore the details of our contributions through the following
link:


https://osseu2023.sched.com/event/1Qv8y/how-steamos-is-contributing-to-the-linux-ecosystem-alberto-garcia-igalia


From our perspective, sched_ext holds significant promise and
utility, particularly in facilitating rapid experimentation with
new ideas. Our experimental ideas may or may not align with the
existing scheduler designs, be it CFS or EEVDF.

Specifically, our research into the characteristics of gaming
workloads for schedulers has unveiled intriguing insights that
could inform better scheduling decisions. For instance, tasks
within the gaming software stack, such as game engines, Wine, and
graphics drivers, often exhibit very short duration when
scheduled, necessitating frequent scheduling activities.
Moreover, multiple tasks across software layers collaborate to
complete a single application-level task, forming task chains.
Inadequate scheduling decisions within these chains can lead to
high tail latency, commonly known as "stuttering" in the gaming
community.


> Witness the metric ton of toy schedulers written for it, that's all
> effort not put into improving the existing code.

While these properties offer valuable insights for improving
scheduling decisions for gaming workloads, their applicability to
general-purpose schedulers like EEVDF remains uncertain. The most
effective means to evaluate their broader utility is through
practical experimentation. In this regard, sched_ext provides an
excellent platform for rapid testing of new ideas.

One may question why not just experiment out-of-tree? In reality,
we can’t just trivially patch _general-purpose_ EEVDF (and CFS
too) to be better for _all_ use cases, especially when upstream
has resisted tons of niche complexity in the upstream scheduler.
It is a very hard problem, and we believe having the sched_ext
upstream for more users/distros will encourage more progress. Our
case of Linux gaming demonstrates that working on the existing
code is neither always possible nor effective. Further details of
our findings can be found through the following link:


https://ossna2024.sched.com/event/1aBOT/optimizing-scheduler-for-linux-gaming-changwoo-min-igalia


> situation that's been created is solved. And even then, I fundamentally
> believe the approach to be detrimental to the scheduler eco-system.

Contrary to the notion that sched_ext might prove detrimental to
the scheduler ecosystem, we hold a different view. The successful
implementation of sched_ext enriches the scheduler community with
fresh insights, ideas, and code. For instance, our adoption of
a virtual deadline-based approach in designing LAVD
(Latency-criticality Aware Virtual Deadline), our sched_ext-based
scheduler for gaming, represents a deliberate design choice.
Aligning our heuristics and findings with EEVDF through a similar
virtual deadline-based approach enables us to contribute our
discoveries to EEVDF in the future once proven to be more
universally applicable. Notably, the concept of "latency
criticality" in LAVD holds promise beyond gaming workloads,
potentially benefiting various interactive workloads. If you are
interested in, you can find the source code of LAVD in the
following link:

https://github.com/sched-ext/scx/tree/main/scheds/rust/scx_lavd


In essence, I envision sched_ext and its community as an
incubator for new ideas, invigorating the scheduler ecosystem.
Some of the "toy" schedulers may evolve into specialized
solutions tailored for specific problem domains, such as HPC,
AI/ML, or gaming. Lessons learned from these experimental
schedulers will invariably contribute, directly or indirectly, to
the evolution of the EEVDF scheduler.

Sincerely,
Changwoo Min

2024-05-10 18:26:18

by Peter Jung

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class


Hi everyone,

We (CachyOS [0]) want to support to proposal of sched_ext.
CachyOS is an archlinux based distribution, which improves the performance,
throughput and interactivity for the desktop but also on servers. We
have a big
community, which likes testing new features on the kernel, GPU drivers,
frequency drivers, new locking technologies and schedulers.

[0]: https://cachyos.org/

For CachyOS, schedulers have always been a key matter. We have numerous
variants of kernels with different schedulers so that everyone can be
satisfied. So far, schedulers in the kernel have faced one problem,
namely: 1
kernel = 1 scheduler. It was impossible to change the CPU scheduler without
restarting the computer and booting another kernel. scx-scheduler simply and
extremely successfully changes this approach and allows you to change
schedulers at runtime. The approach with a single default scheduler in the
kernel also leads to a problematic situation, because in our opinion it is
impossible to develop a single scheduler that is the optimal solution
for all
possible tasks. Again, and in this case, scx-scheds makes this problem
obsolete.

The EEVDF Scheduler gives a good overall experience, but it currently lacks
good interactivity under load, specific scenarios like gaming and also the
latency. Rusty gives CachyOS currently a very good replacement for desktop
usage. We are planning to provide the Rusty Scheduler in the future as
default.
The interactivity of Rusty is dramatically better than EEVDF, especially
when a
heavy workload is running in the background, e.g. during compilation.
Using the
desktop under such a workload can be very challenging for the default
scheduler
and at rusty it is not even noticeable that during compilation is
running. Also
in gaming scenarios rusty seems to provide excellent performance.

The LAVD Scheduler is integrated as default in our handheld variant and
shows
very good impressive results in frame times as well as 1% lows. This
gives our
user on the Handheld Edition a better experience.

It is also worth mentioning that our community takes an active part in
sched-ext testing and regularly reports bugs and suggestions for changes.
Cooperation with developers is also perfect - they immediately make
corrections
if there are any regressions.

Regards,

Peter Jung
CachyOS

2024-05-13 09:09:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

On Sun, May 05, 2024 at 01:31:26PM -1000, Tejun Heo wrote:

> > You Google/Facebook are touting collaboration, collaborate on fixing it.
> > Instead of re-posting this over and over. After all, your main
> > motivation for starting this was the cpu-cgroup overhead.
>
> The hierarchical scheduling overhead isn't the main motivation for us. We
> can't use the CPU controller for all workloads and while it'd be nice to
> improve that,

Hurmph, I had the impression from the earlier threads that this ~5%
cgroup overhead was most definitely a problem and a motivator for all
this.

The overhead was prohibitive, it was claimed, and you needed a solution.
Did not previous versions use this very argument in order to push for
all this?

By improving the cgroup mess -- I very much agree that the cgroup thing
is not very nice. This whole argument goes away and we all get a better
cgroup implementation.

> This view works only if you assume that the entire world contains only a
> handful of developers who can work on schedulers. The only way that would be
> the case is if the barrier of entry is raised unreasonably high. Sometimes a
> high barrier of entry can't be avoided or is beneficial. However, if it's
> pushed up high enough to leave only a handful of people to work on an area
> as large as scheduling, something probably is wrong.

I've never really felt there were too few sched patches to stare at on
any one day (quite the opposite on many days in fact).

There have also always been plenty out of tree scheduler patches --
although I rarely if ever have time to look at them.

Writing a custom scheduler isn't that hard, simply ripping out
fair_sched_class and replacing it with something simple really isn't
*that* hard.

The only really hard requirement is respecting affinities, you'll crash
and burn real hard if you get that wrong (think of all the per-cpu
kthreads that hard rely on the per-cpu-ness of them).

But you can easily ignore cgroups, uclamp and a ton of other stuff and
still boot and play around.

> I believe we agree that we want more people contributing to the scheduling
> area.

I think therein lies the rub -- contribution. If we were to do this
thing, random loadable BPF schedulers, then how do we ensure people will
contribute back?

That is, from where I am sitting I see $vendor mandate their $enterprise
product needs their $BPF scheduler. At which point $vendor will have no
incentive to ever contribute back.

And customers of $vendor that want to run additional workloads on
their machine are then stuck with that scheduler, irrespective of it
being suitable for them or not. This is not a good experience.

So I don't at all mind people playing around with schedulers -- they can
do so today, there are a ton of out of tree patches to start or learn
from, or like I said, it really isn't all that hard to just rip out fair
and write something new.

Open source, you get to do your own thing. Have at.

But part of what made Linux work so well, is in my opinion the GPL. GPL
forces people to contribute back -- to work on the shared project. And I
see the whole BPF thing as a run-around on that.

Even the large cloud vendors and service providers (Amazon, Google,
Facebook etc.) contribute back because of rebase pain -- as you well
know. The rebase pain offsets the 'TIVO hole'.

But with the BPF muck; where is the motivation to help improve things?

Keeping a rando github repo with BPF schedulers is not contributing.
That's just a repo with multiple out of tree schedulers to be ignored.
Who will put in the effort of upsteaming things if they can hack up a
BPF and throw it over the wall?

So yeah, I'm very much NOT supportive of this effort. From where I'm
sitting there is simply not a single benefit. You're not making my life
better, so why would I care?

How does this BPF muck translate into better quality patches for me?

2024-05-13 18:27:02

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

On Mon, 13 May 2024 10:03:59 +0200
Peter Zijlstra <[email protected]> wrote:

> > I believe we agree that we want more people contributing to the scheduling
> > area.
>
> I think therein lies the rub -- contribution. If we were to do this
> thing, random loadable BPF schedulers, then how do we ensure people will
> contribute back?

Hi Peter,

I'm somewhat agnostic to sched_ext itself, but I have been an advocate
for a plugable scheduler infrastructure. And we are seriously looking
at adding it to ChromeOS.

>
> That is, from where I am sitting I see $vendor mandate their $enterprise
> product needs their $BPF scheduler. At which point $vendor will have no
> incentive to ever contribute back.

Believe me they already have their own scheduler, and because its so
different, it's very hard to contribute back.

>
> And customers of $vendor that want to run additional workloads on
> their machine are then stuck with that scheduler, irrespective of it
> being suitable for them or not. This is not a good experience.

And $vendor usually has a unique workload that their changes will
likely cause regressions in other workloads, making it even harder to
contribute back.

>
> So I don't at all mind people playing around with schedulers -- they can
> do so today, there are a ton of out of tree patches to start or learn
> from, or like I said, it really isn't all that hard to just rip out fair
> and write something new.

For cloud servers, I bet a lot of schedulers are not public. Although,
my company tries to publish the schedulers they use.

>
> Open source, you get to do your own thing. Have at.
>
> But part of what made Linux work so well, is in my opinion the GPL. GPL
> forces people to contribute back -- to work on the shared project. And I
> see the whole BPF thing as a run-around on that.
>
> Even the large cloud vendors and service providers (Amazon, Google,
> Facebook etc.) contribute back because of rebase pain -- as you well
> know. The rebase pain offsets the 'TIVO hole'.

From what I understand (I don't work on production, but Chromebooks), a
lot of changes cannot be contributed back because their updates are far
from what is upstream. Having a plugable scheduler would actually allow
them to contribute *more*.

>
> But with the BPF muck; where is the motivation to help improve things?

For the same reasons you mention about GPL and why it works.
Collaboration. Sharing ideas helps everyone. If there's some secret
sauce scheduler then they would likely just replace the scheduler, as
its more performant. I don't believe it would be worth while to use BPF
for that purpose.

>
> Keeping a rando github repo with BPF schedulers is not contributing.

Agreed, and I would guess having them in the Linux kernel tree would be
more beneficial.

> That's just a repo with multiple out of tree schedulers to be ignored.
> Who will put in the effort of upsteaming things if they can hack up a
> BPF and throw it over the wall?

If there's a place in the Linux kernel tree, I'm sure there would be
motivation to place it there. Having it in the kernel proper does give
more visibility of code, and therefore enhancements to that code. This
was the same rationale for putting perf into the kernel proper.

>
> So yeah, I'm very much NOT supportive of this effort. From where I'm
> sitting there is simply not a single benefit. You're not making my life
> better, so why would I care?
>
> How does this BPF muck translate into better quality patches for me?

Here's how we will be using it (we will likely be porting sched_ext to
ChromeOS regardless of its acceptance).

Doing testing of scheduler changes in the field is extremely time
consuming and complex. We tested EEVDF vs CFS by backporting EEVDF to
5.15 (as that is the kernel version we are using on the chromebooks we
were testing on), and then we need to add a user space "switch" to
change the scheduler. Note, this also risks causing a bug in adding
these changes. Then we push the kernel out, and then start our
experiment that enables our feature to a small percentage, and slowly
increases the number of users until we have a enough for a statistical
result.

What sched_ext would give us is a easy way to try different scheduling
algorithms and get feedback much quicker. Once we determine a solution
that improves things, we would then spend the time to implement it in
the scheduler, and yes, send it upstream.

To me, sched_ext should never be the final solution, but it can be
extremely useful in testing various changes quickly in the field. Which
to me would encourage more contributions.

-- Steve

2024-05-13 20:36:27

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

On Wed, May 01, 2024 at 05:09:35AM -1000, Tejun Heo wrote:
..
> - Ubuntu is considering to include sched_ext in the upcoming 24.10 release.
> Andrea Righi of Canonical has been actively working on a userspace
> scheduling framework since the end of the last year.
>
> https://github.com/sched-ext/scx/tree/main/scheds/rust/scx_rustland
> https://discourse.ubuntu.com/t/introducing-kernel-6-8-for-the-24-04-noble-numbat-release/41958

Regarding this topic, I can confirm that Ubuntu intends to provide
official support for a sched_ext kernel in the 24.10 release.

We still have to finalize all the specific details of the plan, but
essentially, there will be a separate "derivative" kernel featuring
sched_ext, alongside a user-space scx package(s) for the schedulers and
tools (which, ideally, we would also like to upstream into Debian).

-Andrea

2024-05-14 00:07:31

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

On 05/13/24 14:26, Steven Rostedt wrote:
> On Mon, 13 May 2024 10:03:59 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > > I believe we agree that we want more people contributing to the scheduling
> > > area.
> >
> > I think therein lies the rub -- contribution. If we were to do this
> > thing, random loadable BPF schedulers, then how do we ensure people will
> > contribute back?
>
> Hi Peter,
>
> I'm somewhat agnostic to sched_ext itself, but I have been an advocate
> for a plugable scheduler infrastructure. And we are seriously looking
> at adding it to ChromeOS.
>
> >
> > That is, from where I am sitting I see $vendor mandate their $enterprise
> > product needs their $BPF scheduler. At which point $vendor will have no
> > incentive to ever contribute back.
>
> Believe me they already have their own scheduler, and because its so
> different, it's very hard to contribute back.
>
> >
> > And customers of $vendor that want to run additional workloads on
> > their machine are then stuck with that scheduler, irrespective of it
> > being suitable for them or not. This is not a good experience.
>
> And $vendor usually has a unique workload that their changes will
> likely cause regressions in other workloads, making it even harder to
> contribute back.
>
> >
> > So I don't at all mind people playing around with schedulers -- they can
> > do so today, there are a ton of out of tree patches to start or learn
> > from, or like I said, it really isn't all that hard to just rip out fair
> > and write something new.
>
> For cloud servers, I bet a lot of schedulers are not public. Although,
> my company tries to publish the schedulers they use.
>
> >
> > Open source, you get to do your own thing. Have at.
> >
> > But part of what made Linux work so well, is in my opinion the GPL. GPL
> > forces people to contribute back -- to work on the shared project. And I
> > see the whole BPF thing as a run-around on that.
> >
> > Even the large cloud vendors and service providers (Amazon, Google,
> > Facebook etc.) contribute back because of rebase pain -- as you well
> > know. The rebase pain offsets the 'TIVO hole'.
>
> From what I understand (I don't work on production, but Chromebooks), a
> lot of changes cannot be contributed back because their updates are far
> from what is upstream. Having a plugable scheduler would actually allow
> them to contribute *more*.
>
> >
> > But with the BPF muck; where is the motivation to help improve things?
>
> For the same reasons you mention about GPL and why it works.
> Collaboration. Sharing ideas helps everyone. If there's some secret
> sauce scheduler then they would likely just replace the scheduler, as
> its more performant. I don't believe it would be worth while to use BPF
> for that purpose.
>
> >
> > Keeping a rando github repo with BPF schedulers is not contributing.
>
> Agreed, and I would guess having them in the Linux kernel tree would be
> more beneficial.
>
> > That's just a repo with multiple out of tree schedulers to be ignored.
> > Who will put in the effort of upsteaming things if they can hack up a
> > BPF and throw it over the wall?
>
> If there's a place in the Linux kernel tree, I'm sure there would be
> motivation to place it there. Having it in the kernel proper does give
> more visibility of code, and therefore enhancements to that code. This
> was the same rationale for putting perf into the kernel proper.
>
> >
> > So yeah, I'm very much NOT supportive of this effort. From where I'm
> > sitting there is simply not a single benefit. You're not making my life
> > better, so why would I care?
> >
> > How does this BPF muck translate into better quality patches for me?
>
> Here's how we will be using it (we will likely be porting sched_ext to
> ChromeOS regardless of its acceptance).
>
> Doing testing of scheduler changes in the field is extremely time
> consuming and complex. We tested EEVDF vs CFS by backporting EEVDF to
> 5.15 (as that is the kernel version we are using on the chromebooks we
> were testing on), and then we need to add a user space "switch" to
> change the scheduler. Note, this also risks causing a bug in adding
> these changes. Then we push the kernel out, and then start our
> experiment that enables our feature to a small percentage, and slowly
> increases the number of users until we have a enough for a statistical
> result.
>
> What sched_ext would give us is a easy way to try different scheduling
> algorithms and get feedback much quicker. Once we determine a solution
> that improves things, we would then spend the time to implement it in
> the scheduler, and yes, send it upstream.
>
> To me, sched_ext should never be the final solution, but it can be
> extremely useful in testing various changes quickly in the field. Which
> to me would encourage more contributions.

I really don't think the problems we have are because of EEVDF vs CFS vs
anything else. Other major OSes have one scheduler, but what they exceed on is
providing better QoS interfaces and mechanism to handle specific scenarios that
Linux lacks.

The confusion I see again and again over the years is the fragmentation of
Linux eco system and app writers don't know how to do things properly on Linux
vs other OSes. Note our CONFIG system is part of this fragmentation.

The addition of more flavours which inevitably will lead to custom QoS specific
to that scheduler and libraries built on top of it that require that particular
extension available is a recipe for more confusion and fragmentation. Not to
mention big players are likely to take over, and I wouldn't be surprised if new
business models start to spring up on top of that. Add to the lot the potential
security issues with the ease to lure people to download sneaky sched extension
that gives great promises but full of malware (more dangerous with the greater
power of BPF/sudo misused).

I really don't buy the rapid development aspect too. The scheduler was heavily
influenced by the early contributors which come from server market that had
(few) very specific workloads they needed to optimize for and throughput had
a heavier weight vs latency. Fast forward to now, things are different. Even on
server market latency/responsiveness has become more important. Power and
thermal are important on a larger class of systems now too. I'd dare say even
on server market. How do you know when it's okay for an app/task to consume too
much power and when it is not? Hint hint, you can't unless someone in userspace
tells you. Similarly for latency vs throughput. What is the correct way to
write an application to provide this info? Then we can ask what is missing in
the scheduler to enable this.

Note the original min/wakeup_granularity_ns, latency_ns etc were tuned by
default for throughput by the way (server market bias). You can manipulate
those and get better latencies.

And this brings me to the major point, we really need to stop thinking that we
must improve everything at system level. Workloads need to evolve to take best
out of systems and we need new libraries for performance and power management.
And this means they need to get new APIs and libraries do a be able to do
a better job and scale well.

I agree with Peter it is not hard to write something to make specific workload
better. But what we really need is enable workloads to be written better and be
more portable to take best of the hardware they run on, AND coexist with other
workloads. For example, how do you write a good multi threaded application that
can scale well across systems (including big.LITTLE) and not trip over other
workloads stealing resources sometimes? You need something like this

https://developer.apple.com/documentation/DISPATCH

which has a linux port

https://github.com/apple/swift-corelibs-libdispatch

not a new scheduler.

How do you write an app that can manage bad thermal situations?

https://developer.android.com/games/optimize/adpf/thermal

POSIX is dormant, and every OS has to wing new interfaces to deal with the new
realities. And I don't see a lot of these discussions. Linux is lagging behind
in general in this aspect. The trend I see is how do I make existing stuff
better, and believe me I've seen strcmp(task->comm, ...) to hand pick things.
Which I am sure we'll end up down this path if we let things loose.

So I am against any custom extension. I think it all has to be part of the
kernel tree and adhere to all of its supported interfaces. Which I think what
we really ought on focusing to evolve and improve. This is the biggest friction
point IMO, not the scheduler algorithm. If the latter need to change, it needs
to be as the result of this friction - which what EEVDF came about from to my
understanding. To enable implementing a latency interface easier. But Vincent
had a working implementation with CFS too which I think would have worked fine
by the way.

I do hope we can reconsider some of our default behaviors though (that bias to
perf and throughput specifically).

FWIW IMO the biggest issues I see in the scheduler is that its testability and
debuggability is hard. I think BPF can be a good fit for that. For the latter
I started this project, yet I am still trying to figure out how to add tracer
for the difficult paths to help people more easily report when a bad decision
has happened to provide more info about the internal state of the scheduler, in
hope to accelerate the process of finding solutions. I think people are getting
stuck explaining why things are failing, which makes finding a common solution
hard if not impossible. We need better way to understand the problems people
are seeing

https://github.com/qais-yousef/sched-analyzer

Similar methodology can be used to create a BPF based sched test framework.
I don't have cycles to start this, but hope to if no one beats me to it.

I think it would be great to have a clear list of the current limitations
people see in the scheduler. It could be a failure on my end, but I haven't
seen specifics of problems and what was tried and failed to the point it is
impossible to move forward. From what I see, I am hitting bugs here and there
all the time. But they are hard to debug to truly understand where things went
wrong. Like this one for example where PTHREAD_PRIO_PI is a NOP for fair tasks.
Many thought using this flag doesn't help (rather than buggy)..

https://lore.kernel.org/lkml/[email protected]/

2024-05-14 20:25:02

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

On 5/13/24 4:03 AM, Peter Zijlstra wrote:
> On Sun, May 05, 2024 at 01:31:26PM -1000, Tejun Heo wrote:
>
>>> You Google/Facebook are touting collaboration, collaborate on fixing it.
>>> Instead of re-posting this over and over. After all, your main
>>> motivation for starting this was the cpu-cgroup overhead.
>>
>> The hierarchical scheduling overhead isn't the main motivation for us. We
>> can't use the CPU controller for all workloads and while it'd be nice to
>> improve that,
>
> Hurmph, I had the impression from the earlier threads that this ~5%
> cgroup overhead was most definitely a problem and a motivator for all
> this.
>
> The overhead was prohibitive, it was claimed, and you needed a solution.
> Did not previous versions use this very argument in order to push for
> all this?
>
> By improving the cgroup mess -- I very much agree that the cgroup thing
> is not very nice. This whole argument goes away and we all get a better
> cgroup implementation.
>
>> This view works only if you assume that the entire world contains only a
>> handful of developers who can work on schedulers. The only way that would be
>> the case is if the barrier of entry is raised unreasonably high. Sometimes a
>> high barrier of entry can't be avoided or is beneficial. However, if it's
>> pushed up high enough to leave only a handful of people to work on an area
>> as large as scheduling, something probably is wrong.
>
> I've never really felt there were too few sched patches to stare at on
> any one day (quite the opposite on many days in fact).
>
> There have also always been plenty out of tree scheduler patches --
> although I rarely if ever have time to look at them.
>
> Writing a custom scheduler isn't that hard, simply ripping out
> fair_sched_class and replacing it with something simple really isn't
> *that* hard.
>
> The only really hard requirement is respecting affinities, you'll crash
> and burn real hard if you get that wrong (think of all the per-cpu
> kthreads that hard rely on the per-cpu-ness of them).
>
> But you can easily ignore cgroups, uclamp and a ton of other stuff and
> still boot and play around.
>
>> I believe we agree that we want more people contributing to the scheduling
>> area.
>
> I think therein lies the rub -- contribution. If we were to do this
> thing, random loadable BPF schedulers, then how do we ensure people will
> contribute back?
>
> That is, from where I am sitting I see $vendor mandate their $enterprise
> product needs their $BPF scheduler. At which point $vendor will have no
> incentive to ever contribute back.

Especially in the scheduler space, the incentive to contribute back
today is somewhat inverted. As you mention above, it's relatively easy
to make custom things, and it's also very difficult to get features and
patches included. The cost of maintaining patches out of tree is
relatively low in comparison with the cost of working through inclusion,
and the scheduler stands out in terms of how hard it is to land changes.

I think the scheduler balances the needs of a wide variety of workloads
exceptionally well, but based on the volume of out of tree scheduler
infrastructure, it feels like the community is struggling to meet their
collaboration needs in the upstream tree.

Just like I can’t imagine one filesystem working for everything, I think
we need to open up the field a little on schedulers. As we develop for
new variations in workloads, power management, and hardware types, I
think sched_ext gives us a way to do more collaboration in the upstream
tree, and while I’m not pretending it’s perfect, it’s definitely ready
for expansion and broader use.

I do think that sched_ext developers will keep participating upstream,
and I agree with a lot of the points that Steve makes in his reply.
People are going to keep sending patches in because the kernel community
is just the best place to build and maintain this functionality.

-chris

2024-05-14 21:34:22

by David Vernet

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

On Tue, May 14, 2024 at 01:07:15AM +0100, Qais Yousef wrote:

[...]

> > >
> > > How does this BPF muck translate into better quality patches for me?
> >
> > Here's how we will be using it (we will likely be porting sched_ext to
> > ChromeOS regardless of its acceptance).
> >
> > Doing testing of scheduler changes in the field is extremely time
> > consuming and complex. We tested EEVDF vs CFS by backporting EEVDF to
> > 5.15 (as that is the kernel version we are using on the chromebooks we
> > were testing on), and then we need to add a user space "switch" to
> > change the scheduler. Note, this also risks causing a bug in adding
> > these changes. Then we push the kernel out, and then start our
> > experiment that enables our feature to a small percentage, and slowly
> > increases the number of users until we have a enough for a statistical
> > result.
> >
> > What sched_ext would give us is a easy way to try different scheduling
> > algorithms and get feedback much quicker. Once we determine a solution
> > that improves things, we would then spend the time to implement it in
> > the scheduler, and yes, send it upstream.
> >
> > To me, sched_ext should never be the final solution, but it can be
> > extremely useful in testing various changes quickly in the field. Which
> > to me would encourage more contributions.

Hello Qais,

[...]

> I really don't buy the rapid development aspect too. The scheduler was heavily

There are already several examples from users who have shown that the rapid
development and experimentation is extremely useful. Imagine if you're
iterating on the scheduler to improve p99 frame rates on the Steam Deck, as
Changwoo described. It's much more efficient to be able to just tweak and load
a BPF scheduler (that is safe and can't crash the machine) to try some random
idea out than it is to:

1. Tweak and recompile the kernel
2. Reinstall the kernel on the Steam Deck
3. Reboot the Steam Deck
4. Reload a game and let caches rewarm
5. Measure FPS

You're talking about a 5 second compile job + 1 second to reload a safe BPF
scheduler vs. having to do all of the above steps _and_ potentially making a
mistake that brings the machine down. These benefits are also extremely useful
for testing workloads on production servers, etc. Let’s also not forget that
unlike many other kernel features, you probably can’t get reliable scheduling
results from running in a VM. The experimentation overhead is very real.

[...]

> influenced by the early contributors which come from server market that had
> (few) very specific workloads they needed to optimize for and throughput had
> a heavier weight vs latency. Fast forward to now, things are different. Even on
> server market latency/responsiveness has become more important. Power and
> thermal are important on a larger class of systems now too. I'd dare say even
> on server market. How do you know when it's okay for an app/task to consume too
> much power and when it is not? Hint hint, you can't unless someone in userspace
> tells you. Similarly for latency vs throughput. What is the correct way to
> write an application to provide this info? Then we can ask what is missing in
> the scheduler to enable this.

Hmm, you seem to be arguing that the way forward here is to have our one
general purpose scheduler be entirely driven by user space hinting. Assuming
I’m not misunderstanding you, I strongly disagree with this sentiment. User
space hinting can be powerful, but I think we need to have a general purpose
scheduler that's completely agnostic to whatever is running in user space.
We’ve also been able to get strong results from sched_ext schedulers that don’t
use any user space hinting.

Also, even if this ended up being the way forward, I don’t see it being
practical to implement. Wouldn’t it require us to update all of user space
globally just to update how it interfaces with the scheduler?

[...]

> Note the original min/wakeup_granularity_ns, latency_ns etc were tuned by
> default for throughput by the way (server market bias). You can manipulate
> those and get better latencies.

Those knobs aren't available anymore in EEVDF.

[...]

> point IMO, not the scheduler algorithm. If the latter need to change, it needs
> to be as the result of this friction - which what EEVDF came about from to my
> understanding. To enable implementing a latency interface easier. But Vincent
> had a working implementation with CFS too which I think would have worked fine
> by the way.

This friction is nothing new. It's why we already find ourselves in the
unfortunate position of having a large corpus of out of tree scheduler patches.
If there is a lot of performance being left on the table, vendors are going to
find a way to get that performance. Corporations don't need our consent to ship
kernels with custom schedulers on their devices. They've already been doing it
for years, and it's ultimately the users who suffer.

I genuinely believe that the fair.c scheduler will benefit from being able to
apply ideas conceived in a sched_ext scheduler which end up working well for
general use cases. For example, in scx_rusty, we’re able to get very good
interactivity [0] by determining a task’s deadline as a function of its average
runtime (along with some other great ideas that Changwoo first added to
scx_lavd) rather than from its eligibility + slice as with what EEVDF does.
Over the course of a day or two, I tried way more ideas that didn’t work than
would have been possible in that time frame than with a recompile-reboot cycle,
and ended up finding one that seems to work very well. It would be awesome if
these ideas were added to EEVDF so that everyone can benefit.

[0]: https://drive.google.com/file/d/1fyHt9BYGha6apl7HAkibwpy52UTi8-AQ/view?usp=drive_link

Thanks,
David


Attachments:
(No filename) (5.81 kB)
signature.asc (235.00 B)
Download all attachments

2024-05-14 22:06:40

by Josh Don

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

On Mon, May 13, 2024 at 1:04 AM Peter Zijlstra <[email protected]> wrote:
>
> On Sun, May 05, 2024 at 01:31:26PM -1000, Tejun Heo wrote:
>
> > > You Google/Facebook are touting collaboration, collaborate on fixing it.
> > > Instead of re-posting this over and over. After all, your main
> > > motivation for starting this was the cpu-cgroup overhead.
> >
> > The hierarchical scheduling overhead isn't the main motivation for us. We
> > can't use the CPU controller for all workloads and while it'd be nice to
> > improve that,
>
> Hurmph, I had the impression from the earlier threads that this ~5%
> cgroup overhead was most definitely a problem and a motivator for all
> this.
>
> The overhead was prohibitive, it was claimed, and you needed a solution.
> Did not previous versions use this very argument in order to push for
> all this?
>
> By improving the cgroup mess -- I very much agree that the cgroup thing
> is not very nice. This whole argument goes away and we all get a better
> cgroup implementation.

I talked with pjt to get some historical context on these patches, it
sounds like these were some advocated performance improvements but had
fairness issues that Paul pointed out. We're happy to help take a look
at this again, but this is all independent from the motivation for
sched_ext. Sounds like we're on the same page about this now though :)

Thus, cgroups are not a primary motivator for sched_ext. However, one
aspect of cgroups is made quite a bit nicer by the pluggable
scheduling. This is the fact that cgroups are a second class citizen
in CFS, because they are still a compile time option, so everything
must be built to support a thread-only model. That makes it really
hard to write group schedulers; fundamentally, task placement, load
balancing, etc. is operating on lists of tasks, not lists of cgroups.
As you can see, improving hierarchical performance of cgroups is nice
to have but not related to this goal.

> Writing a custom scheduler isn't that hard, simply ripping out
> fair_sched_class and replacing it with something simple really isn't
> *that* hard.

Getting that custom scheduler back into upstream is pretty hard
though. I like Chris' analogy of filesystems because it gives a really
good sense of what a bigger ecosystem might look like with schedulers.
It simply is not feasible to implement certain types of behavior in
CFS, because the behavior is too specialized for certain classes of
workloads, and the model/heuristics could not be made to work in a
general purpose scheduling environment, which CFS strives to achieve.
Taking these ideas and putting them into a different scheduling class
is also a bit of a non-starter. The scheduler is optimized to have
most tasks running in a single scheduling class (CFS), and adding
additional classes both adds additional static overhead, as well as
complexity from dealing with the need to ensure non-starvation of
lower priority sched classes (due to the strict priority ordering of
sched classes).

As an example, a few years ago Xi posted a simple new scheduling class
optimized for high frequency context switching
(https://lkml.org/lkml/2019/9/6/177), which was nack'd offline.
There's an argument to be made that some of the functionality there
could have been rolled into RT, but I think it serves as a good
example of the friction of adding even simple new sched classes.

Pluggable scheduling helps to fill the gap here for policies that have
elements making them not a good fit for CFS, given that we don't want
a plethora of new sched classes in the tree. That's not to say that
just because a policy cannot be fully integrated into CFS it has no
benefit for upstream contribution; there are pieces that might make
sense to adapt to CFS. For example, we've been experimenting with a
policy that schedules based on cgroups rather than simply tasks, and
uses that to provide better CCX locality for the cgroup, with better
control of how the group spills over to remote CCX. The group based
scheduling nature cannot be easily integrated into CFS, as described
above, but the CCX scheduling portion could find its way into CFS by
means of a more nuanced evaluation of migration_cost, or new placement
heuristics.

> But you can easily ignore cgroups, uclamp and a ton of other stuff and
> still boot and play around.

Please don't underestimate the value that swapping policies around at
runtime has. Sure, playing around in a VM isn't bad, but getting
performance data from real hardware can take quite a while between
boots, not including the time to actually restart the workload and
have it warmed up. We're talking several orders of magnitude more
latency in the iterative policy development and analysis portion here.
I imagine it would have been nice, for example, to swap in and out
successive versions of EEVDF while hackbench was actively running, and
observe how each swap out changes running averages of latency, etc.
Even in a world where we committed to having a single scheduler, I
think it would be nice if we had in-tree CFS BPF programs that
implemented several of the hooks, just for the purpose of improving
development velocity. What BPF has done for networking could be made
to improve heuristic heavy areas like select_task_rq, for example.

My point here is that allowing new sched policies is a big benefit,
but also placing those policies in BPF is another (separate) big
benefit that sched_ext provides natively.

Best,
Josh

2024-05-15 20:41:31

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

Hello, Peter.

On Mon, May 13, 2024 at 10:03:59AM +0200, Peter Zijlstra wrote:
> On Sun, May 05, 2024 at 01:31:26PM -1000, Tejun Heo wrote:
> > The hierarchical scheduling overhead isn't the main motivation for us. We
> > can't use the CPU controller for all workloads and while it'd be nice to
> > improve that,
>
> Hurmph, I had the impression from the earlier threads that this ~5%
> cgroup overhead was most definitely a problem and a motivator for all
> this.
>
> The overhead was prohibitive, it was claimed, and you needed a solution.
> Did not previous versions use this very argument in order to push for
> all this?

Being able to experiment with potential solutions for problems like
hierarchical scheduling overhead is important and something we wanted to
demonstrate for sched_ext. It's true that the current hierarchical
scheduling is too expensive to deploy on certain workloads but as I wrote
before it's also not that difficult to work around and isn't a high priority
problem for us.

> By improving the cgroup mess -- I very much agree that the cgroup thing
> is not very nice. This whole argument goes away and we all get a better
> cgroup implementation.

Improving the cgroup CPU controller performance would be great. However, I
don't see how that'd be an argument against sched_ext. Sure, with sched_ext,
we can easily test out potential ideas which can lower the hierarchical
scheduling overhead but, if anything, that should make us want it more. Why
wouldn't we want to have such ability for other problems too?

> > This view works only if you assume that the entire world contains only a
> > handful of developers who can work on schedulers. The only way that would be
> > the case is if the barrier of entry is raised unreasonably high. Sometimes a
> > high barrier of entry can't be avoided or is beneficial. However, if it's
> > pushed up high enough to leave only a handful of people to work on an area
> > as large as scheduling, something probably is wrong.
>
> I've never really felt there were too few sched patches to stare at on
> any one day (quite the opposite on many days in fact).
>
> There have also always been plenty out of tree scheduler patches --
> although I rarely if ever have time to look at them.
..
> > I believe we agree that we want more people contributing to the scheduling
> > area.
>
> I think therein lies the rub -- contribution. If we were to do this
> thing, random loadable BPF schedulers, then how do we ensure people will
> contribute back?

Everything has cost and benefits. Forcing potential contributors into a
single narrow funnel has the benefit of concentrating the effort as you're
pointing out. However, the cost is that it's a single funnel. In addition to
the inherent downsides of having only one of anything, it can handle only so
much, and pushes people away from even considering contributing.

There are multiple types of contributions. Getting concrete patches into the
main scheduler is one. Trying out wildly different ideas and exploring the
problem space is another. Providing a viable competing implementation can be
an important contribution too by keeping everyone on their toes. If we
concentrate just on direct code contributions, we can lose the sight of the
bigger picture costing us in other areas.

During the short period of time that we've been experimenting with
sched_ext, we've already found multiple fairly generic approaches that show
significant gains. That's not because people who have been playing with
sched_ext have special abilities, but rather because there are plenty of
sometimes obvious things which have been difficult to try with the in-kernel
scheduler. Sure, anyone can modify the kernel, but, without a practical way
to publish, deploy and maintain such modifications, it’s really difficult to
justify such effort when the chance of landing upstream is really low. If
our experience up to this point is any indication, capable engineers who are
interested in the area don't seem to be in particularly short supply. What
is in short supply is an environment in which they can participate, develop
and refine their ideas.

Opportunity cost is often more difficult to appreciate but it is as real as
any cost. While there may be more than enough patches for you to review, we
are leaving a lot of opportunities unpursued and potential contributors
outside the fence because the funnel is too narrow and the barrier of entry
too high. Yes, there are benefits to the current setup where we tell
everyone to contribute to a single code base but at this point I believe
it's costing us more than benefiting.

> That is, from where I am sitting I see $vendor mandate their $enterprise
> product needs their $BPF scheduler. At which point $vendor will have no
> incentive to ever contribute back.
>
> And customers of $vendor that want to run additional workloads on
> their machine are then stuck with that scheduler, irrespective of it
> being suitable for them or not. This is not a good experience.

The above scenario sounds contrived to me. The situation is already like
this with vendor patched kernels. Just like for patched kernels, the vendor
has to share the code for sched_ext schedulers due to GPL. After all, the
BPF verifier will flat out reject loading any non-GPL programs. In addition,
sched_ext has benefits in terms of user experience. Because sched_ext is
designed to be supplemental to the default scheduler, its users have an easy
out - falling back to CFS/EEVDF by simply unloading the sched_ext scheduler.
With patched kernels, they'd have to reboot and a stock kernel might not
even be available.

> So I don't at all mind people playing around with schedulers -- they can
> do so today, there are a ton of out of tree patches to start or learn
> from, or like I said, it really isn't all that hard to just rip out fair
> and write something new.
>
> Open source, you get to do your own thing. Have at.
>
> But part of what made Linux work so well, is in my opinion the GPL. GPL
> forces people to contribute back -- to work on the shared project. And I
> see the whole BPF thing as a run-around on that.
>
> Even the large cloud vendors and service providers (Amazon, Google,
> Facebook etc.) contribute back because of rebase pain -- as you well
> know. The rebase pain offsets the 'TIVO hole'.

Two things are being conflated here. What GPL gives us is that ideas and
code don't get locked up behind a paywall. If someone based their work on a
GPL project, others get to take a look at what they did to learn and copy
from them. The upstream pressure is a separate mechanism which nudges people
towards upstream because the overhead of rebase is painful regardless of the
license requirements.

The upstream pressure works well but as I wrote above it also can be pushed
too far to the point where it costs rather than benefits long term
development. Controlling too tight runs the risk of pushing changes and
proposals worth considering under the ground and potential contributors
away. It may be difficult to judge and agree on where the current situation
exactly is but it is not difficult to see signs of stress. Even just for us,
scheduling is one of the common pain points for both server workloads and
Oculus. Talking to other organizations, we hear similar concerns.

You said two conflicting things - that people can have at it as it's open
source but at the same time that even large organizations are forced to the
funnel due to the rebase pain. It's true that even for large organizations,
deviating from upstream is expensive. However, big orgs can still do it
because the benefit usually scales with the number of machines allowing them
to cross the break-even point and thus pay for it.

But the same pain applies to smaller organizations, researchers and
individuals. Imagine how big a deterrence the current situation would be for
them. It's extremely challenging for them to build a user base and community
as it's very awkward to deploy and painful to maintain custom kernels. Some
still persevere but most would be discouraged even from starting if the
prospect of their work being useful is so slim. This limits potential
contributions from a lot of organizations.

CFS / EEVDF is an excellent general purpose scheduler. It obviously is the
most used and most important scheduler in the whole world. It's difficult to
believe that the only way to get enough people to contribute to it is by
suppressing alternatives. The current approach of funneling potential
contributors to a single code base with a very high bar creates a lot of
pain for those potential contributors, and probably feels unnecessarily
punitive to anyone new to the space. If we have to really worry about losing
contributors to the main Linux scheduler just because sched_ext creates an
additional space that interested engineers can work in, something has gone
really wrong and I don't believe that matches the reality.

> But with the BPF muck; where is the motivation to help improve things?
>
> Keeping a rando github repo with BPF schedulers is not contributing.
> That's just a repo with multiple out of tree schedulers to be ignored.
> Who will put in the effort of upsteaming things if they can hack up a
> BPF and throw it over the wall?

I wouldn't be so dismissive about development happening outside the kernel
tree. We already see strong community and collaborations in the SCX repo
which is serving as an umbrella project for the sched_ext schedulers.
Different schedulers are chasing different directions but they actively
learn and borrow from each other. It can definitely serve as an incubator to
prove and refine new ideas which can be adopted widely and to grow
scheduling engineers.

For an example, it's still early but Changwoo's work on interactivity in
scx_lavd seems generally useful and has already been adopted by scx_rusty.
It's something which can easily be applied to EEVDF too. Changwoo may or may
not work on EEVDF directly (he says he wants to) but the code change
necessary is neither big nor difficult. Figuring out what actually works was
the hard part, not the implementation. Not all ideas would be like this but
this serves as a good example of how contribution is not just directly
writing patches and how work outside the tree can benefit the kernel.

> So yeah, I'm very much NOT supportive of this effort. From where I'm
> sitting there is simply not a single benefit. You're not making my life
> better, so why would I care?
>
> How does this BPF muck translate into better quality patches for me?

I'm not sure whether it would make your life better but I firmly believe
that it will benefit overall Linux scheduling in the long term. You don't
necessarily have to care. We'll do our best to ensure that it bothers you as
little as possible.

Maybe I'm mistaken and we won't find much that'd be useful enough for EEVDF
but also maybe there are enough things that we haven't tried that will make
things better for everyone. I believe in the latter and the indications till
now seem to agree. You don't have to share my optimism but wouldn’t it at
least be worthwhile to find out?

Thanks.

--
tejun

2024-05-17 09:59:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

On Tue, May 14, 2024 at 01:07:15AM +0100, Qais Yousef wrote:
> On 05/13/24 14:26, Steven Rostedt wrote:

> > > That is, from where I am sitting I see $vendor mandate their $enterprise
> > > product needs their $BPF scheduler. At which point $vendor will have no
> > > incentive to ever contribute back.
> >
> > Believe me they already have their own scheduler, and because its so
> > different, it's very hard to contribute back.

'They' are free to have their own scheduler, but since 'nobody' is using
it and 'they' want to have their product work on RHEL / SLES / etc..
therefore are bound to respect the common interfaces, no?

> > > So I don't at all mind people playing around with schedulers -- they can
> > > do so today, there are a ton of out of tree patches to start or learn
> > > from, or like I said, it really isn't all that hard to just rip out fair
> > > and write something new.
> >
> > For cloud servers, I bet a lot of schedulers are not public. Although,
> > my company tries to publish the schedulers they use.

Yeah, it's the TIVO thing. Keeping all that private creates the rebase
pain. Outside of that there's nothing we can do.

Anyway, instead of doing magic mushroom schedulers, what does the cloud
crud actually want? I know the KVM people were somewhat looking forward
to the EEVDF sched_attr::sched_runtime extension because virt likes the
longer slices. Less preemption more better for them.

In fact, some of the facebook workloads also wanted longer slices (and
no wakeup preemption).

> > From what I understand (I don't work on production, but Chromebooks), a
> > lot of changes cannot be contributed back because their updates are far
> > from what is upstream. Having a plugable scheduler would actually allow
> > them to contribute *more*.

So can we please start by telling what kind of magic hacks ChromeOS has
and whatfor?

The term contributing seems to mean different things to us. Building a
external scheduler isn't contributing, it's fragmenting.

> > > Keeping a rando github repo with BPF schedulers is not contributing.
> >
> > Agreed, and I would guess having them in the Linux kernel tree would be
> > more beneficial.

Yeah, no. Same thing. It's just a pile of junk until someone puts the
time in to figure out how to properly integrate it. Very much like Qais
argues below.

> > > That's just a repo with multiple out of tree schedulers to be ignored.
> > > Who will put in the effort of upsteaming things if they can hack up a
> > > BPF and throw it over the wall?
> >
> > If there's a place in the Linux kernel tree, I'm sure there would be
> > motivation to place it there. Having it in the kernel proper does give
> > more visibility of code, and therefore enhancements to that code. This
> > was the same rationale for putting perf into the kernel proper.

These things are very much not the same. A pile of random hacks vs a
single unified interface to PMUs. They're like the polar opposite of
one another.

> > > So yeah, I'm very much NOT supportive of this effort. From where I'm
> > > sitting there is simply not a single benefit. You're not making my life
> > > better, so why would I care?
> > >
> > > How does this BPF muck translate into better quality patches for me?
> >
> > Here's how we will be using it (we will likely be porting sched_ext to
> > ChromeOS regardless of its acceptance).
> >
> > Doing testing of scheduler changes in the field is extremely time
> > consuming and complex. We tested EEVDF vs CFS by backporting EEVDF to
> > 5.15 (as that is the kernel version we are using on the chromebooks we

/me mumbles something about necro-kernels...

> > were testing on), and then we need to add a user space "switch" to
> > change the scheduler. Note, this also risks causing a bug in adding
> > these changes. Then we push the kernel out, and then start our
> > experiment that enables our feature to a small percentage, and slowly
> > increases the number of users until we have a enough for a statistical
> > result.
> >
> > What sched_ext would give us is a easy way to try different scheduling
> > algorithms and get feedback much quicker. Once we determine a solution
> > that improves things, we would then spend the time to implement it in
> > the scheduler, and yes, send it upstream.

This sounds a little backwards... ok, a lot. How do you do actual
problem analysis in this case? Having random statistics is not really
useful - beyond determining there might be a problem.

The next step is isolating that problem locally and reproducing it. Then
analysing *what* the actual problem is and how it happens, and then try
and think of a solution.

(preferably one that then doesn't break another thing :-)

> > To me, sched_ext should never be the final solution, but it can be
> > extremely useful in testing various changes quickly in the field. Which
> > to me would encourage more contributions.

Well, the thing is, the moment sched_ext itself lands upstream, it will
become the final solution for a fair number of people and leave us, the
wider Linux scheduler community, up a creek without no paddles on.

There is absolutely no inherent incentive to further contribute. Your
immediate problem is solved, you get assigned the next problem. That is
reality.

Worse, they can share the BPF hack and get warm fuzzy feeling of
'contribution' while in fact it's useless. At best we know 'random hack
changed something for them'. No problem description, no reproducer, no
nothing.

Anyway, if you feel you need BPF hackery to do this, by all means, do
so. But realize that it is a debug tool and in general we don't merge
debug tools.

Also, I would argue that perhaps a scheduler livepatch would be more
convenient to actually debug / A-B test things.

> I really don't think the problems we have are because of EEVDF vs CFS vs
> anything else. Other major OSes have one scheduler, but what they exceed on is
> providing better QoS interfaces and mechanism to handle specific scenarios that
> Linux lacks.

Quite possibly. The immediate problem being that adding interfaces is
terrifying. Linus has a rather strong opinion about breaking stuff, and
getting this wrong will very quickly result in a paint-into-corner type
problem.

We can/could add fields to sched_attr under the understanding that
they're purely optional and try thing, *however* too many such fields
and we're up a creek again.

> The confusion I see again and again over the years is the fragmentation of
> Linux eco system and app writers don't know how to do things properly on Linux
> vs other OSes. Note our CONFIG system is part of this fragmentation.
>
> The addition of more flavours which inevitably will lead to custom QoS specific
> to that scheduler and libraries built on top of it that require that particular
> extension available is a recipe for more confusion and fragmentation.

Yes, this!

> I really don't buy the rapid development aspect too. The scheduler was heavily
> influenced by the early contributors which come from server market that had
> (few) very specific workloads they needed to optimize for and throughput had
> a heavier weight vs latency. Fast forward to now, things are different. Even on
> server market latency/responsiveness has become more important. Power and
> thermal are important on a larger class of systems now too. I'd dare say even
> on server market.

Absolutely, AFAIU racks are both power and thermal limited. There are
some crazy ACPI protocols to manage some of this.

> How do you know when it's okay for an app/task to consume too
> much power and when it is not? Hint hint, you can't unless someone in userspace
> tells you.

Yes, cluster/cloud infrastructure needs to manage that. There is nothing
smart the kernel can do here on its own, except respect the ACPI lunacy
and hard throttle itself when the panic signal comes.

> Similarly for latency vs throughput. What is the correct way to
> write an application to provide this info? Then we can ask what is missing in
> the scheduler to enable this.

Right, so the EEVDF thing is a start here. By providing a per task
request size, applications can indicate if they want frequent and short
activations or more infrequent longer activations.

An application can know it's (average) activation time, the kernel has
no clue when work starts and is completed. Applications can fairly
trivially measure this using CLOCK_THREAD_CPUTIME_ID reads before and
after and communicate this (very much like SCHED_DEADLINE).

Anyway, yes, userspace needs to change and provide more information. The
trick ofcourse is figuring out which bit of information is critical /
useful etc.

There is a definite limit on the amount of constraints you want to solve
at runtime.

Everybody going off and hacking their own thing does not help, we need
collaboration to figure out what it is that is needed.

> Note the original min/wakeup_granularity_ns, latency_ns etc were tuned by
> default for throughput by the way (server market bias). You can manipulate
> those and get better latencies.

The immediate problem with those knobs is that they are system wide. But
yes, everybody was randomly poking them knobs, sometimes in obviously
insane ways.

> FWIW IMO the biggest issues I see in the scheduler is that its testability and
> debuggability is hard. I think BPF can be a good fit for that. For the latter
> I started this project, yet I am still trying to figure out how to add tracer
> for the difficult paths to help people more easily report when a bad decision
> has happened to provide more info about the internal state of the scheduler, in
> hope to accelerate the process of finding solutions.

So the pitfalls here are that exposing that information for debug
purposes can/will lead to people consuming this information for
non-debug purposes and then when we want to change things we're stuck
because suddenly someone relies something we believed was an
implementation detail :/

I've been bitten by this before and this is why I'm so very hesitant to
put tracepoints in the scheduler.

> I think it would be great to have a clear list of the current limitations
> people see in the scheduler. It could be a failure on my end, but I haven't
> seen specifics of problems and what was tried and failed to the point it is
> impossible to move forward.

Right, list, but also ideally reproducers (yeah, I know, really hard).

The moment we merge sched_ext all motivation to do any of this work goes
out the window.

> From what I see, I am hitting bugs here and there
> all the time. But they are hard to debug to truly understand where things went
> wrong. Like this one for example where PTHREAD_PRIO_PI is a NOP for fair tasks.
> Many thought using this flag doesn't help (rather than buggy)..

Yay for the terminal backlog :/ I'll try and have a look.

2024-05-21 00:20:07

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

Hello, Peter.

A gentle ping. I don't think this subthread reached its conclusion. Would
you mind sharing your thoughts?

Thanks.

--
tejun

2024-05-27 20:29:35

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

On 05/17/24 11:58, Peter Zijlstra wrote:

> > I really don't think the problems we have are because of EEVDF vs CFS vs
> > anything else. Other major OSes have one scheduler, but what they exceed on is
> > providing better QoS interfaces and mechanism to handle specific scenarios that
> > Linux lacks.
>
> Quite possibly. The immediate problem being that adding interfaces is
> terrifying. Linus has a rather strong opinion about breaking stuff, and
> getting this wrong will very quickly result in a paint-into-corner type
> problem.

We need to move forward though. Let us find an approach and agree with Linus on
what will constitute regressions when things have to disappear.

My general worry is more about default behavior not just interfaces. As pointed
out below, the default behavior favoured throughput for a long time because
folks who care about throughput were more vocal, but it seems we have a silent
majority problem of people who need latency by default but find no path
forward.

And of course it'll never be possible to make them both happy by default.
Regression reports in this area need to consider the wider impact on other
users when deciding whether it needs to be fixed or not.

We do seem to be held hostages sometimes by older systems/workloads who IMHO if
can't move to use new facilities provided, have no good reason to complain
about regressions as we need to look forward for what new workloads and system
need by default. The world is moving on too fast - but we can't catch up due to
these regression reports. We need a better balance IMHO.

>
> We can/could add fields to sched_attr under the understanding that
> they're purely optional and try thing, *however* too many such fields
> and we're up a creek again.

My personal vision on this is this

https://lore.kernel.org/lkml/20230916213316.p36nhgnibsidoggt@airbuntu/

We don't need to continue to add new fields as this is a problem actually when
it comes to integrating to libc (who yet to have proper wrappers in pthreads
for the things we added). A u32 should be virtually infinite number of hints.
We should be able to deprecate at ease by making a specific hint type return an
error when it longer supported (-ENOSYS). We can even create uclamp alias for
this (that is called performance_hint given how widely people interpret uclamp
as a bandwidth hint) and make it the soruce of QoS truth.

> > Similarly for latency vs throughput. What is the correct way to
> > write an application to provide this info? Then we can ask what is missing in
> > the scheduler to enable this.
>
> Right, so the EEVDF thing is a start here. By providing a per task
> request size, applications can indicate if they want frequent and short
> activations or more infrequent longer activations.
>
> An application can know it's (average) activation time, the kernel has
> no clue when work starts and is completed. Applications can fairly
> trivially measure this using CLOCK_THREAD_CPUTIME_ID reads before and
> after and communicate this (very much like SCHED_DEADLINE).

I fear the concept of time in userspace will be hard to get right without some
further help from us due to DVFS/HMP having a Black Hole effect and causing
extreme Time Dilution problem. It is actually a problem for schedutil that I am
trying to find a reasonable fix for as part of my magic margins series. On one
system I ran a test on, it took 30ms to take off from util 0. And the system
stayed running at the lowest frequency for 42ms! Our utilization invariance is
very good for estimating compute demand, but terrible for bursty tasks - which
I are very common on interactive systems. I think this is a cause of many
'latency' woos in general. Things can end up running slower for longer. But
this is a different problem for a different series/day.

FWIW even for userspace trying to create dynamic uclamp control are struggling
because they can't measure time reliable. A task can seem happy, but only
because something else on another CPU with shared policy had a 'heavy' tasks
running. As soon as this goes to sleep things look really different from the
tasks runtime perspective. It was running super fast by accident.

The average time will be hard to get right in general due to the interactive
nature for some workloads and things could have big variations in practice
based on my experience. Ie; the frame to frame variations could be larger than
expected.

I think it is a helpful interface, but won't address all workload demands. The
thing I care about for example they really want to run ASAP and that's it. They
could run for a longer period of time, or a shorter period of time. next_buddy
type of behavior will help these tasks. But might need to be stronger than
current implementation. I am trying to find out..

And oversubscribed scenarios are important. It is common to have a sudden surge
of activities that cause delays that requires load balancer's help to better
distribute as some CPUs can get less busy sooner but wakeup preemption won't
save those already enqueued tasks from getting CPU time ASAP without some
additional external trigger. In contrary, I do see problems today (older CFS
LTS kernel) where a surge of short running tasks can delay enqueued tasks
considerably. I have no clue what's going on yet. I don't have a reproducer but
creeps up often enough when I look at traces.

Generally I think latency is more important in majority of systems these days
and it might be better to default to more responsive system and let those who
want throughput to opt-in, rather than the other way around.

In theory, there should be (very) few tasks in the system that are actually
need the next_buddy type of behavior to skip the queue and run ASAP if the
average default latency is good (1-2ms).

I also think we need to enable HRTICK by default too. We will have more timely
preemption points then. I generally think sched_feat should be a good way to
give admins the power to control certain aspects of the scheduler. We can also
make uclamp a sched_feat and ensure it can be made available on any system
- unlike today where it's not enabled by default on Debian at least and this
hit me and looks like the Asahi folks who managed to get good power
improvements and thankfully has higher weight than me asking for this to be
enabled by default. Read Energy Aware Scheduling section here

https://asahilinux.org/2024/01/fedora-asahi-new/

As a general topic for discussion not just for scheduler, there are core
features that must always be there from programmer's perspective. We are
shooting ourselves in the foot here by being too flexible with our usage of
CONFIGs, IMHO :)

Too much random babbling from my side maybe, but I think there's a series of
seemingly independent issues that are actually interconnected and one is
leading to the other but people are trying to find the one root-cause which
I don't think exists.

>
> Anyway, yes, userspace needs to change and provide more information. The
> trick ofcourse is figuring out which bit of information is critical /
> useful etc.
>
> There is a definite limit on the amount of constraints you want to solve
> at runtime.

+1

>
> Everybody going off and hacking their own thing does not help, we need
> collaboration to figure out what it is that is needed.

+2 - I've been trying to snoop on many use cases to further understand what
truly goes wrong. Some of the issues I've seen were actually due to bugs in the
kernel. Other issues could already be fixed with existing facilities, but
users didn't know how to use them. So the task is not easy to untangle.

>
> > Note the original min/wakeup_granularity_ns, latency_ns etc were tuned by
> > default for throughput by the way (server market bias). You can manipulate
> > those and get better latencies.
>
> The immediate problem with those knobs is that they are system wide. But
> yes, everybody was randomly poking them knobs, sometimes in obviously
> insane ways.

Yes. I fear though that because they were system wide is that the
out-of-the-box experience for many (especially with CFS defaults) were bad
latencies. I like the new 3ms base_slice_ns, but I think for many who care
about 120Hz refresh for example this is too large. It's almost half of the
frame time.

The TICK value plays a big role too. On 4ms TICK, this 3ms will become 4ms if
wakeup preemption decided not to preempt immediately.

Is this 3ms a constant by the way? I see it still depends on NR_CPUS, but
I read it on different systems and I got 3ms. I think having a constant value
across all systems makes more sense. With EAS (which I think someone should put
effort to enable it for SMP systems) we tend to pack. And a lot of systems have
too few of CPUs and things being packed is common case - I think the rationale
in the past was that we distribute tasks to idle CPUs at wake up which is good
for latency, but I don't know if this is a good assumption to make still to
decide these values.

And looks like we have a bug. I didn't spend a lot of time on studying EEVDF
impact on latencies, but I had this simple run with my pi_test [1]. You need
sched-analyzer/sched-analyzer-pp somewhere in your path which you can download
from [2]. setup Perfetto traced [3]. Running on 6.8.8 M1 Mac Mini

./run.sh 0 0 0

=====================================
:: 2255 | pi_test | ['./pi_test'] ::
====================================================================================================
----------------------------------------------------------------------------------------------------
───────────────────────────── Sum Time in State Exclude Sleeping (ms) ──────────────────────────────
R ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4533.25
Running ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4533.75

────────────────────────────── % Time in State Exclude Sleeping (ms) ───────────────────────────────
R ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 50.0
Running ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 50.0

─────────────────────────────────── Sum Time Running on CPU (ms) ───────────────────────────────────
CPU0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4533.75

──────────────────────────────────── % Time Running on CPU (ms) ────────────────────────────────────
CPU0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 100.0

Time in State (ms):
----------------------------------------------------------------------------------------------------
count mean std min 50% 75% 90% 95% 99% max
state
R 1149.0 3.95 1.12 -0.0 4.0 4.0 6.0 6.01 7.0 7.01
Running 1149.0 3.95 1.11 0.0 4.0 4.0 6.0 6.00 7.0 7.00

Time Running on CPU (ms):
----------------------------------------------------------------------------------------------------
count mean std min 50% 75% 90% 95% 99% max
cpu
0.0 1149.0 3.95 1.11 0.0 4.0 4.0 6.0 6.0 7.0 7.0



=========================================
:: 2257 | pi_test_low | ['./pi_test'] ::
====================================================================================================
----------------------------------------------------------------------------------------------------
───────────────────────────── Sum Time in State Exclude Sleeping (ms) ──────────────────────────────
R ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4533.89
Running ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4533.11

────────────────────────────── % Time in State Exclude Sleeping (ms) ───────────────────────────────
R ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 50.0
Running ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 50.0

─────────────────────────────────── Sum Time Running on CPU (ms) ───────────────────────────────────
CPU0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4533.11

──────────────────────────────────── % Time Running on CPU (ms) ────────────────────────────────────
CPU0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 100.0

Time in State (ms):
----------------------------------------------------------------------------------------------------
count mean std min 50% 75% 90% 95% 99% max
state
R 1154.0 3.93 1.14 0.0 4.0 4.0 6.0 6.0 7.0 7.0
Running 1154.0 3.93 1.13 -0.0 4.0 4.0 6.0 6.0 7.0 7.0

Time Running on CPU (ms):
----------------------------------------------------------------------------------------------------
count mean std min 50% 75% 90% 95% 99% max
cpu
0.0 1154.0 3.93 1.13 -0.0 4.0 4.0 6.0 6.0 7.0 7.0

Note that the average RUNNING (R is for RUNNABLE) time is ~4ms instead of 3ms.
Oour P90 and max values are almost double the 3ms slice. I am running with 1ms
TICK, so I think there has to be a bug somewhere preventing timely preemption..

If I enable HRTICK it looks much better

=====================================
:: 2517 | pi_test | ['./pi_test'] ::
====================================================================================================
----------------------------------------------------------------------------------------------------
───────────────────────────── Sum Time in State Exclude Sleeping (ms) ──────────────────────────────
R ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4911.05
Running ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4911.96

────────────────────────────── % Time in State Exclude Sleeping (ms) ───────────────────────────────
R ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 50.0
Running ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 50.0

─────────────────────────────────── Sum Time Running on CPU (ms) ───────────────────────────────────
CPU0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4911.96

──────────────────────────────────── % Time Running on CPU (ms) ────────────────────────────────────
CPU0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 100.0

Time in State (ms):
----------------------------------------------------------------------------------------------------
count mean std min 50% 75% 90% 95% 99% max
state
R 1645.0 2.99 0.22 0.0 3.0 3.0 3.0 3.0 3.01 3.97
Running 1646.0 2.98 0.17 -0.0 3.0 3.0 3.0 3.0 3.00 3.01

Time Running on CPU (ms):
----------------------------------------------------------------------------------------------------
count mean std min 50% 75% 90% 95% 99% max
cpu
0.0 1646.0 2.98 0.17 -0.0 3.0 3.0 3.0 3.0 3.0 3.01



=========================================
:: 2519 | pi_test_low | ['./pi_test'] ::
====================================================================================================
----------------------------------------------------------------------------------------------------
───────────────────────────── Sum Time in State Exclude Sleeping (ms) ──────────────────────────────
R ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4912.11
Running ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4910.89

────────────────────────────── % Time in State Exclude Sleeping (ms) ───────────────────────────────
R ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 50.01
Running ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 49.99

─────────────────────────────────── Sum Time Running on CPU (ms) ───────────────────────────────────
CPU0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4910.89

──────────────────────────────────── % Time Running on CPU (ms) ────────────────────────────────────
CPU0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 100.0

Time in State (ms):
----------------------------------------------------------------------------------------------------
count mean std min 50% 75% 90% 95% 99% max
state
R 1644.0 2.99 0.19 -0.00 3.0 3.0 3.0 3.0 3.01 3.95
Running 1643.0 2.99 0.16 0.51 3.0 3.0 3.0 3.0 3.00 3.01

Time Running on CPU (ms):
----------------------------------------------------------------------------------------------------
count mean std min 50% 75% 90% 95% 99% max
cpu
0.0 1643.0 2.99 0.16 0.51 3.0 3.0 3.0 3.0 3.0 3.01

[1] https://github.com/qais-yousef/pi_test
[2] https://github.com/qais-yousef/sched-analyzer/releases
[3] https://github.com/qais-yousef/sched-analyzer?tab=readme-ov-file#perfetto-mode

>
> > FWIW IMO the biggest issues I see in the scheduler is that its testability and
> > debuggability is hard. I think BPF can be a good fit for that. For the latter
> > I started this project, yet I am still trying to figure out how to add tracer
> > for the difficult paths to help people more easily report when a bad decision
> > has happened to provide more info about the internal state of the scheduler, in
> > hope to accelerate the process of finding solutions.
>
> So the pitfalls here are that exposing that information for debug
> purposes can/will lead to people consuming this information for
> non-debug purposes and then when we want to change things we're stuck
> because suddenly someone relies something we believed was an
> implementation detail :/
>
> I've been bitten by this before and this is why I'm so very hesitant to
> put tracepoints in the scheduler.

I was hoping the 'bare' tracepoint approach I added is okay? I don't need more
than that. Function signature and structure internals can never be ABIs.
I already had to deal with util_est changes across kernel versions.

If our emperor penguin is reading, it'd be great if he has new thoughts on
debug features and userspace dependency. I think we really need to help people
to better debug and understand why things aren't behaving as they anticipate.
Or at least make it easier to provide info on the list to help us understand
what could have gone wrong.

Your concerns are real. These should not prevent code from moving on without
worrying about breakages. If anyone latched into those I hope we can tell them
sorry, but this one is expected breakage.. I think by design the bare
tracepoints can never be ABI though.

> > From what I see, I am hitting bugs here and there
> > all the time. But they are hard to debug to truly understand where things went
> > wrong. Like this one for example where PTHREAD_PRIO_PI is a NOP for fair tasks.
> > Many thought using this flag doesn't help (rather than buggy)..
>
> Yay for the terminal backlog :/ I'll try and have a look.

It seems hard to fix without Proxy Execution :( If you have ideas for
a temporary solution that'd be great. But looks like we just need to get PE
merged and available for users - I think John's series doesn't tie this to
futex_pi yet.

2024-05-27 21:25:55

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

On 05/14/24 16:34, David Vernet wrote:
> On Tue, May 14, 2024 at 01:07:15AM +0100, Qais Yousef wrote:
>
> [...]
>
> > > >
> > > > How does this BPF muck translate into better quality patches for me?
> > >
> > > Here's how we will be using it (we will likely be porting sched_ext to
> > > ChromeOS regardless of its acceptance).
> > >
> > > Doing testing of scheduler changes in the field is extremely time
> > > consuming and complex. We tested EEVDF vs CFS by backporting EEVDF to
> > > 5.15 (as that is the kernel version we are using on the chromebooks we
> > > were testing on), and then we need to add a user space "switch" to
> > > change the scheduler. Note, this also risks causing a bug in adding
> > > these changes. Then we push the kernel out, and then start our
> > > experiment that enables our feature to a small percentage, and slowly
> > > increases the number of users until we have a enough for a statistical
> > > result.
> > >
> > > What sched_ext would give us is a easy way to try different scheduling
> > > algorithms and get feedback much quicker. Once we determine a solution
> > > that improves things, we would then spend the time to implement it in
> > > the scheduler, and yes, send it upstream.
> > >
> > > To me, sched_ext should never be the final solution, but it can be
> > > extremely useful in testing various changes quickly in the field. Which
> > > to me would encourage more contributions.
>
> Hello Qais,
>
> [...]
>
> > I really don't buy the rapid development aspect too. The scheduler was heavily
>
> There are already several examples from users who have shown that the rapid
> development and experimentation is extremely useful. Imagine if you're
> iterating on the scheduler to improve p99 frame rates on the Steam Deck, as
> Changwoo described. It's much more efficient to be able to just tweak and load
> a BPF scheduler (that is safe and can't crash the machine) to try some random
> idea out than it is to:
>
> 1. Tweak and recompile the kernel
> 2. Reinstall the kernel on the Steam Deck
> 3. Reboot the Steam Deck
> 4. Reload a game and let caches rewarm
> 5. Measure FPS
>
> You're talking about a 5 second compile job + 1 second to reload a safe BPF
> scheduler vs. having to do all of the above steps _and_ potentially making a
> mistake that brings the machine down. These benefits are also extremely useful
> for testing workloads on production servers, etc. Let’s also not forget that
> unlike many other kernel features, you probably can’t get reliable scheduling
> results from running in a VM. The experimentation overhead is very real.

What I read here is that I can hack my system quickly. Is the intention to
extend the kernel? If yes, I can't see how this experimentation is actually
valid if not implemented in the kernel first taking into account the real
constraint that you have to deal with sooner or later.

>
> [...]
>
> > influenced by the early contributors which come from server market that had
> > (few) very specific workloads they needed to optimize for and throughput had
> > a heavier weight vs latency. Fast forward to now, things are different. Even on
> > server market latency/responsiveness has become more important. Power and
> > thermal are important on a larger class of systems now too. I'd dare say even
> > on server market. How do you know when it's okay for an app/task to consume too
> > much power and when it is not? Hint hint, you can't unless someone in userspace
> > tells you. Similarly for latency vs throughput. What is the correct way to
> > write an application to provide this info? Then we can ask what is missing in
> > the scheduler to enable this.
>
> Hmm, you seem to be arguing that the way forward here is to have our one
> general purpose scheduler be entirely driven by user space hinting. Assuming
> I’m not misunderstanding you, I strongly disagree with this sentiment. User
> space hinting can be powerful, but I think we need to have a general purpose
> scheduler that's completely agnostic to whatever is running in user space.
> We’ve also been able to get strong results from sched_ext schedulers that don’t
> use any user space hinting.

I'm curious. If you believe in general purpose system, what work was done to
improve the current one? What debugging and analysis was done to improve the
current situation? It seems you reached a conclusion that we need something
different - but no reasons behind it why is that.

Is the problem with the default behavior of the system? Or are your problems
focused on corner cases where things seem to fail?

>
> Also, even if this ended up being the way forward, I don’t see it being
> practical to implement. Wouldn’t it require us to update all of user space

People swear by Apple's GCD by the way. It'd be really great if someone can
create something similar that works properly on Linux. I have never tried the
libdispatch port to see how well it does.

And have you seen these?

https://developer.android.com/stories/games/mediatek-adpf

> globally just to update how it interfaces with the scheduler?

I think you're confusing default scheduler behavior and dealing with corner
cases that are impossible for the scheduler to resolve. These corner cases are
when help is needed. Note that the thermal API is actually info from the system
to the app. If the app decides to listen, then they can help reduce the thermal
impact without causing throttling. If they decide not to listen, then the best
the system can do is throttle everything hard to protect from damage. And under
bad thermal pressure, the scheduler can know which tasks to prioritize to
performance if it has explicit knowledge/hints.

If the default behavior is not working for you; could you provide more details
on what goes wrong? It's unlikely that a new algorithm is the solution, but
likely a bug somewhere or some configuration problem.

And if someone wants to optimize for best perf, power and thermal, they need to
do the work. There's only so much you can do on their behalf that is actually
scalable.

System designers want apps (all type of apps) to take best advantage of the
hardware they built.

App writers want to write portable software that gives the desired experience
on all type of systems without special optimization.

>
> [...]
>
> > Note the original min/wakeup_granularity_ns, latency_ns etc were tuned by
> > default for throughput by the way (server market bias). You can manipulate
> > those and get better latencies.
>
> Those knobs aren't available anymore in EEVDF.

I generalized my statement as I didn't expect many have moved to 6.6 LTS, which
is the only one that has EEVDF.

EEVDF has base_slice_ns. What value do you read on your system? What is your
TICK value and how m any CPUs do you have?

>
> [...]
>
> > point IMO, not the scheduler algorithm. If the latter need to change, it needs
> > to be as the result of this friction - which what EEVDF came about from to my
> > understanding. To enable implementing a latency interface easier. But Vincent
> > had a working implementation with CFS too which I think would have worked fine
> > by the way.
>
> This friction is nothing new. It's why we already find ourselves in the
> unfortunate position of having a large corpus of out of tree scheduler patches.
> If there is a lot of performance being left on the table, vendors are going to
> find a way to get that performance. Corporations don't need our consent to ship
> kernels with custom schedulers on their devices. They've already been doing it
> for years, and it's ultimately the users who suffer.

I think everyone agrees on the need to improve. But..

>
> I genuinely believe that the fair.c scheduler will benefit from being able to
> apply ideas conceived in a sched_ext scheduler which end up working well for
> general use cases. For example, in scx_rusty, we’re able to get very good
> interactivity [0] by determining a task’s deadline as a function of its average

. I am really failing to see why you jumped to the fact we need a new
scheduler. And you'll find a lot of skepticism about the validity of your
results. We have no clue what kind of unknown constraint you've left out with
these test. And how limited your environment is.

> runtime (along with some other great ideas that Changwoo first added to
> scx_lavd) rather than from its eligibility + slice as with what EEVDF does.

You'll find soon that the concept of runtime is hard. And generally there's
a big soup of tasks running in the system. Most of which have no real deadline.
Only few do. And corner cases are the complexity of any situation when you
have more tasks that need to run immediately than you have CPUs to distribute
them on. I don't think we can figure this out automatically based on runtime.

> Over the course of a day or two, I tried way more ideas that didn’t work than
> would have been possible in that time frame than with a recompile-reboot cycle,
> and ended up finding one that seems to work very well. It would be awesome if
> these ideas were added to EEVDF so that everyone can benefit.

Why do you think they're applicable? And how do you know you're not working
around different problems? Or have missed constraints in your testing once
applied will make the whole results invalid?

Too many unknowns IMHO. I am not against a different scheduler algorithm if it
proves to be a more generic default. But you'll find first you have to explain
what has failed in current one and what kind of analysis made you reach this
conclusion. And then you'll find you'll need to actually do it in the kernel
taking into account all the constraints that you must handle to prove it is
still as valid as you initially thought.

And I can only share my experience, I don't think the algorithm itself is the
bottleneck here. The devil is in the corner cases. And these are hard to deal
with without explicit hints.

The biggest issue I see generally with the default behavior is that
traditionally it has been biased towards throughput because those folks are
the one that keep reporting regressions when anything changes on the list.

Please add your voice and report problems when you notice things don't work for
you. That's the best way to ensure there's visibility of these issues. It seems
to me you're hitting problems that people expect to work. But I have no clue
what problems you have. I am not sure if this was reported somewhere else, but
it seems not.


Thanks!

--
Qais Yousef

2024-05-28 23:46:29

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

Hello,

BTW, David is off for the week and might be a bit slow to respond. I just
want to comment on one part.

On Mon, May 27, 2024 at 10:25:40PM +0100, Qais Yousef wrote:
..
> And I can only share my experience, I don't think the algorithm itself is the
> bottleneck here. The devil is in the corner cases. And these are hard to deal
> with without explicit hints.

Our perceptions of the scope of the problem space seem very different. To
me, it seems pretty unexplored. Here's just one area: Constantly increasing
number of cores and popularization of more complex cache hierarchies.

Over a hundred CPUs in a system is fairly normal now with a couple layers of
cache hierarchy. Once we have so many, things can look a bit different from
the days when we had a few. Flipping the approach so that we can dynamically
assign close-by CPUs to related groups of threads becomes attractive.

e.g. If you have a bunch of services which aren't latency critical but are
needed to maintain system integrity (updates, monitoring, security and so
on), soft-affining them to a number of CPUs while allowing some CPU headroom
can give you noticeable gain both in performance (partly from cleaner
caches) and power consumption while not adding that much to latency. This is
something the scheduler can and, I believe, should do transparently.

It's not obvious how to do it though. It doesn't quite fit the current LB
model. cgroup hierarchy seems to provide some hints on how threads can be
grouped but the boundaries might not match that well. Even if we figure out
how to define these groups, figuring out group-vs-group competition isn't
trivial (naive load-sums don't work when comparing across groups spanning
multiple CPUs).

Also, what about the threads with oddball cpumasks? Should we begin to treat
CPUs more like other resources, e.g., memory? We don't generally allow
applications to specify which specific physical pages they get because that
doesn't buy anything while adding a lot of constraints. If we have dozens
and hundreds of CPUs, are there fundamental reason to view them differently
from other resources which are treated fungible?

The claim that the current scheduler has the fundamentals all figured out
and it's mostly about handling edge cases and educating users seems wildly
off mark to me.

Maybe we can develop all that in the current framework in a gradual fashion,
but when the problem space is so wide open, that is not a good approach to
take. The cost of constricting is likely significantly higher than the
benefits of having a single code base. Imagine having to develop all the
features of btrfs in the ext2 code base. It's probably doable, at least
theoretically, but that would have been massively stifling, maybe to the
point of most of it not happening.

To the above particular problem of soft-affinity, scx_layered has something
really simple and dumb implemented and we're testing and deploying it in the
fleet with noticeable perf gains, and there are early efforts to see whether
we can automatically figure out grouping based on the cgroup hierarchy and
possibly minimal xattr hints on them.

I don't yet know what generic form soft-affinity should take eventually,
but, with sched_ext, we have a way to try out different ideas in production
and iterate on them learning each step of the way. Given how generic both
the problem and benefits from solving it are, we'll have to reach some
generic solution at one point. Maybe it will come from sched_ext or maybe it
will come from people working on fair like yourself. Either way, sched_ext
is already showing us what can be achieved and prodding people towards
solving it.

Thanks.

--
tejun

2024-05-29 22:10:00

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

On 05/28/24 13:46, Tejun Heo wrote:
> Hello,
>
> BTW, David is off for the week and might be a bit slow to respond. I just
> want to comment on one part.
>
> On Mon, May 27, 2024 at 10:25:40PM +0100, Qais Yousef wrote:
> ...
> > And I can only share my experience, I don't think the algorithm itself is the
> > bottleneck here. The devil is in the corner cases. And these are hard to deal
> > with without explicit hints.
>
> Our perceptions of the scope of the problem space seem very different. To
> me, it seems pretty unexplored. Here's just one area: Constantly increasing
> number of cores and popularization of more complex cache hierarchies.
>
> Over a hundred CPUs in a system is fairly normal now with a couple layers of
> cache hierarchy. Once we have so many, things can look a bit different from
> the days when we had a few. Flipping the approach so that we can dynamically
> assign close-by CPUs to related groups of threads becomes attractive.

I had this use case in mind actually for sched-qos [1] idea I am trying to
develop. There are workloads that can benefit if 2 or 3 tasks are kept withing
the closest cache. And I think we can describe that with a hint.

I was thinking to borrow from core scheduling concept of cookie to tag a group
of task via the hint and try to find reasonable higher level behavior that we
can translate correctly into different systems.

>
> e.g. If you have a bunch of services which aren't latency critical but are
> needed to maintain system integrity (updates, monitoring, security and so
> on), soft-affining them to a number of CPUs while allowing some CPU headroom
> can give you noticeable gain both in performance (partly from cleaner
> caches) and power consumption while not adding that much to latency. This is
> something the scheduler can and, I believe, should do transparently.

This looks similar to what I am trying to do with uclamp_max and extending load
balancer to allow to balance workloads based on power - keeping in mind freeing
resources for tasks that need performance too. I don't think we can fix this
problem on wake up balance only. The system is in a constant flux and we need
load balancer to do corrections when other things wake up and we need better
decisions to be made.

Generally if we have EAS type of behavior available for SMP systems where we
don't distribute by default but try to pack based on compute demand - and
a hint to tell us that some tasks really want to be spread as an exception for
those that packing really hurts them, I think we'd be in a much better place to
be able to distribute resources like you describe.

>
> It's not obvious how to do it though. It doesn't quite fit the current LB
> model. cgroup hierarchy seems to provide some hints on how threads can be
> grouped but the boundaries might not match that well. Even if we figure out

cgroups is too aggressive IMHO. We really need per-task hints. It's coarse vs
fine grained hinting. There's only so much classification you can give to
a larger group of tasks. Especially if you can't control the codebase of this
group of tasks.

Some people can get invested in tuning specific apps. But this is not
scalable and fragile.

> how to define these groups, figuring out group-vs-group competition isn't
> trivial (naive load-sums don't work when comparing across groups spanning
> multiple CPUs).

I think the implementation is trickier than the definition. There's lots of
demands to keep the fast path as fast as possible. To do smarter decisions this
will get expensive. Personally I think today we have abundant of compute power
and the challenge is how to smartly distribute resources, which justify slowing
things down in favour of making better choices. But I don't know how much we
can afford to be honest.

Generally as I was telling David, people who tend to come forward more to
support or complain are those who have pure throughput in mind. Maybe I am
wrong, but from my perception a lot of decisions were biased this way. We need
to be more vocal about our needs to make sure that things move in the right
direction. It's hard to help a use case or fix a problem when you don't know
about it.

>
> Also, what about the threads with oddball cpumasks? Should we begin to treat
> CPUs more like other resources, e.g., memory? We don't generally allow
> applications to specify which specific physical pages they get because that
> doesn't buy anything while adding a lot of constraints. If we have dozens
> and hundreds of CPUs, are there fundamental reason to view them differently
> from other resources which are treated fungible?

I'd be more than happy to see affinity and cpuset disappear :) But I fear it
might be a little too late..

Can't some selinux rule or some syscall filter be used to block userspace from
playing with affinity?

I'm assuming you're not referring to in-kernel usage of affinity. Which might
be worth scrutinizing. But we have more control over that in general to make it
better when a problem arises.

>
> The claim that the current scheduler has the fundamentals all figured out
> and it's mostly about handling edge cases and educating users seems wildly
> off mark to me.

I don't think anyone claimed that. But EEVDF or CFS is about how tasks enqueued
on the CPU will be ordered and run. It's not about selecting which CPU to run
the task on.

EAS modifies the selection algorithm (which is not what David was talking about
IIUC). It seems your problems are more with CPU selection then?

>
> Maybe we can develop all that in the current framework in a gradual fashion,
> but when the problem space is so wide open, that is not a good approach to
> take. The cost of constricting is likely significantly higher than the
> benefits of having a single code base. Imagine having to develop all the
> features of btrfs in the ext2 code base. It's probably doable, at least
> theoretically, but that would have been massively stifling, maybe to the
> point of most of it not happening.
>
> To the above particular problem of soft-affinity, scx_layered has something

What layered refers to here? Is it akin to different sched classes?

> really simple and dumb implemented and we're testing and deploying it in the
> fleet with noticeable perf gains, and there are early efforts to see whether
> we can automatically figure out grouping based on the cgroup hierarchy and
> possibly minimal xattr hints on them.
>
> I don't yet know what generic form soft-affinity should take eventually,
> but, with sched_ext, we have a way to try out different ideas in production
> and iterate on them learning each step of the way. Given how generic both
> the problem and benefits from solving it are, we'll have to reach some
> generic solution at one point. Maybe it will come from sched_ext or maybe it
> will come from people working on fair like yourself. Either way, sched_ext
> is already showing us what can be achieved and prodding people towards
> solving it.

To be honest this doesn't look any different to all the hacks out there that do
the same. The path I see this is going into is the same as I mentioned above
where some people manually tune for specific usage. I really struggle to see
how this is going to be applicable later and all I see a divergence and
parallel universes - which ultimately will hurt the user as Linux behavior is
just not predictable.

This Linus rant [2] is relevant to the situation. In this case people who write
applications will just find that Linux is not reliable because every system
doesn't behave the same.

[1] https://lore.kernel.org/lkml/20230916213316.p36nhgnibsidoggt@airbuntu/
[2] https://lore.kernel.org/lkml/CAHk-=wgtb7y-bEh7tPDvDWru7ZKQ8-KMjZ53Tsk37zsPPdwXbA@mail.gmail.com/


Thanks!

--
Qais Yousef

2024-05-30 16:49:22

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

Hello,

It has been a couple weeks, so I take it that you aren't intending to
respond. I think it'd be useful to summarize the arguments against sched_ext
and list the counter-points.

(1) Merging sched_ext will weaken the incentive to contribute.

While this may partially be true, it isn't looking at the whole picture.
This argument looks at the costs of sched_ext while ignoring the benefits,
and it ignores the costs of funneling all scheduler work through one
codebase.

If you look at the whole picture, I think you’ll see that:

- The problem space of CPU scheduling is too big for a single code base to
be effective. Hardware has changed a lot and so have the workloads. There
are many areas that we haven't mapped out. It's difficult to try anything
radical in a code base which has to satisfy everyone all the time, but
holding the bar so high that experimentation is suppressed means we will
all be worse off.

- The bar for contribution is too high, driving away potential contributors.
Many vendors and users carry internal patches as the upstreaming cost is
too high. We are already seeing multiple developers who have not
previously contributed to fair.c actively participating in and driving
sched_ext schedulers. It’s possible those developers will eventually
contribute to fair.c, but if sched_ext didn’t exist this would be less
likely.

The constraint of only one scheduler codebase makes it very difficult to
contribute. You say that this constraint is necessary to force
collaboration, but I think the opposite is happening - many people don't
bother trying to contribute because the bar is too high. If sched_ext is
merged, the scheduler code base may lose some of the enforcement. However,
in the longer term, I believe we will gain more talented and motivated
engineers working in the problem space and some of them will surely find it
worthwhile to contribute to fair.c. It will be the most widely used
scheduler in the world no matter what, and will be attractive for people to
work on.

EEVDF worked out because you have worked on the scheduler for a long time
and have gained a ton of context on what works and doesn't. It also worked
out because you were more confident that it'd get merged. How do we build
confidence in other developers who want to explore whatever comes after
EEVDF without worrying that it is hopeless to try? sched_ext provides an
outlet for people who aren't already established to take a smaller risk
first, which is likely to lead to more people contributing.

(2) Efforts and developments out of the kernel tree are worthless.

I believe this is too narrow a view. Direct contribution is one form of
contribution but there are many others, including research. EEVDF itself is
based on a research paper. Figuring out what works and sharing them seems as
important as anything to me.

One reason cited for the uselessness is that out-of-tree efforts are often
throw-away and don't build up to anything. There is some truth to this but
the main reason is the difficulty of working with out-of-tree kernel
modifications. Rebase is painful and there is no convenient way to
distribute to users. Some still power through but it's near impossible to
build a user base and community for things that are out-of-tree. sched_ext
solves these problems and the umbrella repo serves as the central repository
for the developers to collaborate and learn from each other. This isn’t a
prediction for the future, it is something which is already actively
happening.

Given the right environment, they will keep flourishing and finding new ways
to improve scheduling. Many of them won't be applicable to the built-in
scheduler, but some will. It's also likely that, in the long term, the
larger scheduler developer base will be directly beneficial to the built-in
scheduler too.

(3) This will lead to vendor-specific fragmentation.

This is already happening with or without sched_ext whether that's in the
form of out-of-tree scheduler patches or people trying to circumvent the
scheduler with creative uses of the RT class.

sched_ext will introduce a different mode of doing it. There are scenarios
where the situation can become a bit worse but I don't believe the
difference would be drastic. Because all sched_ext schedulers have to be
under the GPL, any vendor shipping a sched_ext scheduler to a customer will
have to publish the code. If there are useful ideas we'll be just as free to
take them as now. Also, users would have the benefit that it's a lot easier
to opt out of the vendor's scheduler.

On balance, yes, sched_ext may lead to more or at least different types of
fragmentation, but that seems like a minor downside compared to the overall
benefits especially given that we have to live with some level of
fragmentation no matter what.

(4) sched_ext is a debug tool and we don't merge debug tools.

I think both parts of the above claim are wrong. sched_ext can be used
purely as a debug tool but it's also performant and flexible enough to
readily enable non-trivial practical use cases. We are using it in
production today, and as stated elsewhere in this thread, there are multiple
other companies in various stages of rolling it out to production. It can be
a debug tool, a temporary bridge to field early ideas while working on
something more permanent, a proper solution to specific problems which don't
quite fit the general scheduler (an extreme example would be
standard-dictated scheduling for avionics), and so on.

Also, we merge debug tools all the time. Lockdep is a debug tool. The code
base is full of debug features and components. Why wouldn't we merge
something if it makes the lives of the developers and users better by making
it easier to understand and debug problems? We don't merge printks someone
sprinkled over the code base to debug one particular problem. We do and
should merge tools and frameworks which improve visibility and debugging.


To reiterate our proposition: Let’s please open it up. Scheduling doesn’t
have to be this closed. Many open subsystems survive fine and often thrive
thanks to their openness. sched_ext hooks into the core scheduling but the
contact surface is limited, and, if they ever get in the way, we’ll do our
best to resolve them. The balance in the trade-offs seems pretty obvious.

Thanks.

--
tejun

2024-06-11 21:35:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

[ Tejun reminded me about this, and discussion hasn't really gone
anywhere for much too long, so now I just need to be the person who
makes a decision and people can hate on ]

On Wed, 1 May 2024 at 08:13, Tejun Heo <[email protected]> wrote:
>
> This is v6 of sched_ext (SCX) patchset.
>
> During the past five months, both the development and adoption of sched_ext
> have been progressing briskly. Here are some highlights around adoption:
[...]

I honestly see no reason to delay this any more. This whole patchset
was the major (private) discussion at last year's kernel maintainer
summit, and I don't find any value in having the same discussion
(whether off-list or as an actual event) at the upcoming maintainer
summit one year later, so to make any kind of sane progress, my
current plan is to merge this for 6.11.

At least that way, we're making progress, and the discussion at KS
2024 can be about my mental acuity - or lack thereof - rather than
about rehashing the same thing that clearly made no progress last
year.

I've never been a huge believer in trying to make everybody happy with
code that is out of tree - we're better off working together on it
in-tree.

And using the "in order to accept this, some other thing has to be
fixed first" argument doesn't really work well either (and _that_ has
been discussed for over a decade at various maintainer summits).

Maybe the people who have concerns about this can work on those
concerns when it's in-tree.

I'm also not a believer in the argument that has been used (multiple
times) that the BPF scheduler would keep people from participating in
scheduler development. I personally think the main thing that keeps
people from participating is too high barriers to participation.

Anyway, this is the heads-up to Tejun to please just send me a pull
request for the next merge window.

And for everybody else as a "It's happening" heads-up.

[ Please just mentally insert the "IT'S HAPPENING" meme gif here -
because if I actually were to include it here, lkml would just reject
this email. Sometimes the anti-html rules don't work in our favor ].

Linus

2024-06-13 23:39:03

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

Hello,

On Tue, Jun 11, 2024 at 02:34:56PM -0700, Linus Torvalds wrote:
> Anyway, this is the heads-up to Tejun to please just send me a pull
> request for the next merge window.

As there have been some bug fixes and minor updates since v6, I'll refresh
and post the v7 patchset and start the sched_ext korg tree off of it. To be
on the safe side, I'll separate out parts which interact with other
subsystems (e.g. cgroup and schedutil) into their own patchsets and handle
them as normal patches targeting the sched_ext tree.

Thank you so much.

--
tejun