Overview
--------
This patch set proposes a new scheduler class called ‘ext_sched_class’, or
sched_ext, which allows scheduling policies to be implemented as BPF programs.
More details will be provided on the overall architecture of sched_ext
throughout the various patches in this set, as well as in the “How” section
below. We realize that this patch set is a significant proposal, so we will be
going into depth in the following “Motivation” section to explain why we think
it’s justified. That section is laid out as follows, touching on three main
axes where we believe that sched_ext provides significant value:
1. Ease of experimentation and exploration: Enabling rapid iteration of new
scheduling policies.
2. Customization: Building application-specific schedulers which implement
policies that are not applicable to general-purpose schedulers.
3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling
policies in production environments.
After the motivation section, we’ll provide a more detailed (but still
high-level) overview of how sched_ext works.
Motivation
----------
1. Ease of experimentation and exploration
*Why is exploration important?*
Scheduling is a challenging problem space. Small changes in scheduling
behavior can have a significant impact on various components of a system, with
the corresponding effects varying widely across different platforms,
architectures, and workloads.
While complexities have always existed in scheduling, they have increased
dramatically over the past 10-15 years. In the mid-late 2000s, cores were
typically homogeneous and further apart from each other, with the criteria for
scheduling being roughly the same across the entire die.
Systems in the modern age are by comparison much more complex. Modern CPU
designs, where the total power budget of all CPU cores often far exceeds the
power budget of the socket, with dynamic frequency scaling, and with or
without chiplets, have significantly expanded the scheduling problem space.
Cache hierarchies have become less uniform, with Core Complex (CCX) designs
such as recent AMD processors having multiple shared L3 caches within a single
socket. Such topologies resemble NUMA sans persistent NUMA node stickiness.
Use-cases have become increasingly complex and diverse as well. Applications
such as mobile and VR have strict latency requirements to avoid missing
deadlines that impact user experience. Stacking workloads in servers is
constantly pushing the demands on the scheduler in terms of workload isolation
and resource distribution.
Experimentation and exploration are important for any non-trivial problem
space. However, given the recent hardware and software developments, we
believe that experimentation and exploration are not just important, but
_critical_ in the scheduling problem space.
Indeed, other approaches in industry are already being explored. AMD has
proposed an experimental patch set [0] which enables userspace to provide
hints to the scheduler via “Userspace Hinting”. The approach adds a prctl()
API which allows callers to set a numerical “hint” value on a struct
task_struct. This hint is then optionally read by the scheduler to adjust the
cost calculus for various scheduling decisions.
[0]: https://lore.kernel.org/lkml/[email protected]/
Huawei have also expressed interest [1] in enabling some form of programmable
scheduling. While we’re unaware of any patch sets which have been sent to the
upstream list for this proposal, it similarly illustrates the need for more
flexibility in the scheduler.
[1]: https://lore.kernel.org/bpf/[email protected]/
Additionally, Google has developed ghOSt [2] with the goal of enabling custom,
userspace driven scheduling policies. Prior presentations at LPC [3] have
discussed ghOSt and how BPF can be used to accelerate scheduling.
[2]: https://dl.acm.org/doi/pdf/10.1145/3477132.3483542
[3]: https://lpc.events/event/16/contributions/1365/
*Why can’t we just explore directly with CFS?*
Experimenting with CFS directly or implementing a new sched_class from scratch
is of course possible, but is often difficult and time consuming. Newcomers to
the scheduler often require years to understand the codebase and become
productive contributors. Even for seasoned kernel engineers, experimenting
with and upstreaming features can take a very long time. The iteration process
itself is also time consuming, as testing scheduler changes on real hardware
requires reinstalling the kernel and rebooting the host.
Core scheduling is an example of a feature that took a significant amount of
time and effort to integrate into the kernel. Part of the difficulty with core
scheduling was the inherent mismatch in abstraction between the desire to
perform core-wide scheduling, and the per-cpu design of the kernel scheduler.
This caused issues, for example ensuring proper fairness between the
independent runqueues of SMT siblings.
The high barrier to entry for working on the scheduler is an impediment to
academia as well. Master’s/PhD candidates who are interested in improving the
scheduler will spend years ramping-up, only to complete their degrees just as
they’re finally ready to make significant changes. A lower entrance barrier
would allow researchers to more quickly ramp up, test out hypotheses, and
iterate on novel ideas. Research methodology is also severely hampered by the
high barrier of entry to make modifications; for example, the Shenango [4] and
Shinjuku scheduling policies used sched affinity to replicate the desired
policy semantics, due to the difficulty of incorporating these policies into
the kernel directly.
[4]: https://www.usenix.org/system/files/nsdi19-ousterhout.pdf
The iterative process itself also imposes a significant cost to working on the
scheduler. Testing changes requires developers to recompile and reinstall the
kernel, reboot their machines, rewarm their workloads, and then finally rerun
their benchmarks. Though some of this overhead could potentially be mitigated
by enabling schedulers to be implemented as kernel modules, a machine crash or
subtle system state corruption is always only one innocuous mistake away.
These problems are exacerbated when testing production workloads in a
datacenter environment as well, where multiple hosts may be involved in an
experiment; requiring a significantly longer ramp up time. Warming up memcache
instances in the Meta production environment takes hours, for example.
*How does sched_ext help with exploration?*
sched_ext attempts to address all of the problems described above. In this
section, we’ll describe the benefits to experimentation and exploration that
are afforded by sched_ext, provide real-world examples of those benefits, and
discuss some of the trade-offs and considerations in our design choices.
One of our main goals was to lower the barrier to entry for experimenting with
the scheduler. sched_ext provides ergonomic callbacks and helpers to ease
common operations such as managing idle CPUs, scheduling tasks on arbitrary
CPUs, handling preemptions from other scheduling classes, and more. While
sched_ext does require some ramp-up, the complexity is self-contained, and the
learning curve gradual. Developers can ramp up by first implementing simple
policies such as global FIFO in only tens of lines of code, and then continue
to learn the APIs and building blocks available with sched_ext as they build
more featureful and complex schedulers.
Another critical advantage provided by sched_ext is the use of BPF. BPF
provides strong safety guarantees by statically analyzing programs at load
time to ensure that they cannot corrupt or crash the system. sched_ext
guarantees system integrity no matter what BPF scheduler is loaded, and
provides mechanisms to safely disable the current BPF scheduler and migrate
tasks back to a trusted scheduler. For example, we also implement in-kernel
safety mechanisms to guarantee that a misbehaving scheduler cannot
indefinitely starve tasks. BPF also enables sched_ext to significantly improve
iteration speed for running experiments. Loading and unloading a BPF scheduler
is simply a matter of running and terminating a sched_ext binary.
BPF also provides programs with a rich set of APIs, such as maps, kfuncs, and
BPF helpers. In addition to providing useful building blocks to programs that
run entirely in kernel space (such as many of our example schedulers), these
APIs also allow programs to leverage user space in making scheduling
decisions. Specifically, the Atropos sample scheduler has a relatively simple
FIFO scheduling layer in BPF, paired with a load balancing component in
userspace written in Rust. As described in more detail below, we also built a
more general user-space scheduling framework called “rhone” by leveraging
various BPF features.
On the other hand, BPF does have shortcomings, as can be plainly seen from the
complexity in some of the example schedulers. scx_example_pair.bpf.c
illustrates this point well. To start, it requires a good amount of code to
emulate cgroup-local-storage. In the kernel proper, this would simply be a
matter of adding another pointer to the struct cgroup, but in BPF, it requires
a complex juggling of data amongst multiple different maps, a good amount of
boilerplate code, and some unwieldy bpf_loop()‘s and atomics. The code is also
littered with explicit and often unnecessary sanity checks to appease the
verifier.
That being said, BPF is being rapidly improved. For example, Yonghong Song
recently upstreamed a patch set [5] to add a cgroup local storage map type,
allowing scx_example_pair.bpf.c to be simplified. There are plans to address
other issues as well, such as providing statically-verified locking, and
avoiding the need for unnecessary sanity checks. Addressing these shortcomings
is a high priority for BPF, and as progress continues to be made, we expect
most deficiencies to be addressed in the not-too-distant future.
[5]: https://lore.kernel.org/bpf/[email protected]/
Yet another exploration advantage of sched_ext is helping widening the scope
of experiments. For example, sched_ext makes it easy to defer CPU assignment
until a task starts executing, allowing schedulers to share scheduling queues
at any granularity (hyper-twin, CCX and so on). Additionally, higher level
frameworks can be built on top to further widen the scope. For example, the
aforementioned “rhone” [6] library allows implementing scheduling policies in
user-space by encapsulating the complexity around communicating scheduling
decisions with the kernel. This allows taking advantage of a richer
programming environment in user-space, enabling experimenting with, for
instance, more complex mathematical models.
[6]: https://github.com/Decave/rhone
sched_ext also allows developers to leverage machine learning. At Meta, we
experimented with using machine learning to predict whether a running task
would soon yield its CPU. These predictions can be used to aid the scheduler
in deciding whether to keep a runnable task on its current CPU rather than
migrating it to an idle CPU, with the hope of avoiding unnecessary cache
misses. Using a tiny neural net model with only one hidden layer of size 16,
and a decaying count of 64 syscalls as a feature, we were able to achieve a
15% throughput improvement on an Nginx benchmark, with an 87% inference
accuracy.
2. Customization
This section discusses how sched_ext can enable users to run workloads on
application-specific schedulers.
*Why deploy custom schedulers rather than improving CFS?*
Implementing application-specific schedulers and improving CFS are not
conflicting goals. Scheduling features explored with sched_ext which yield
beneficial results, and which are sufficiently generalizable, can and should
be integrated into CFS. However, CFS is fundamentally designed to be a general
purpose scheduler, and thus is not conducive to being extended with some
highly targeted application or hardware specific changes.
Targeted, bespoke scheduling has many potential use cases. For example, VM
scheduling can make certain optimizations that are infeasible in CFS due to
the constrained problem space (scheduling a static number of long-running
VCPUs versus an arbitrary number of threads). Additionally, certain
applications might want to make targeted policy decisions based on hints
directly from the application (for example, a service that knows the different
deadlines of incoming RPCs).
Google has also experimented with some promising, novel scheduling policies.
One example is “central” scheduling, wherein a single CPU makes all scheduling
decisions for the entire system. This allows most cores on the system to be
fully dedicated to running workloads, and can have significant performance
improvements for certain use cases. For example, central scheduling with VCPUs
can avoid expensive vmexits and cache flushes, by instead delegating the
responsibility of preemption checks from the tick to a single CPU. See
scx_example_central.bpf.c for a simple example of a central scheduling policy
built in sched_ext.
Some workloads also have non-generalizable constraints which enable
optimizations in a scheduling policy which would otherwise not be feasible.
For example,VM workloads at Google typically have a low overcommit ratio
compared to the number of physical CPUs. This allows the scheduler to support
bounded tail latencies, as well as longer blocks of uninterrupted time.
Yet another interesting use case is the scx_example_cgfifo[7] scheduler,
which provides FIFO policies for individual workloads, and a flattened
hierarchical vtree for cgroups. This scheduler does not account for
thundering herd problems among cgroups, and therefore may not be suitable
for inclusion in CFS. However, in a simple benchmark using wrk[8] on apache
serving a CGI script calculating sha1sum of a small file, it outperformed
CFS by ~3% with CPU controller disabled and by ~10% with two apache
instances competing with 2:1 weight ratio nested four level deep.
Note that the scx_example_cgfifo[7] scheduler isn't included in this
patchset because it depends on the BPF rbtree which is still being
developed. The linked version is an earlier draft. We will forward-port and
include it in the series once the BPF rbtree support is in place.
[7] https://github.com/htejun/sched_ext/commit/f2fcd3147fb6286e0a35fcbed33c3bac69546a96
[8] https://github.com/wg/wrk
Certain industries require specific scheduling behaviors that do not apply
broadly. For example, ARINC 653 defines scheduling behavior that is widely
used by avionic software, and some out-of-tree implementations
(https://ieeexplore.ieee.org/document/7005306) have been built. While the
upstream community may decide to merge one such implementation in the future,
it would also be entirely reasonable to not do so given the narrowness of
use-case, and non-generalizable, strict requirements. Such cases can be well
served by sched_ext in all stages of the software development lifecycle --
development, testing, deployment and maintenance.
There are also classes of policy exploration, such as machine learning, or
responding in real-time to application hints, that are significantly harder
(and not necessarily appropriate) to integrate within the kernel itself.
*Won’t this increase fragmentation?*
We acknowledge that to some degree, sched_ext does run the risk of increasing
the fragmentation of scheduler implementations. As a result of exploration,
however, we believe that enabling the larger ecosystem to innovate will
ultimately accelerate the overall development and performance of Linux.
Additionally, our licensing and API stability policies should incentivize
users to upstream their schedulers.
BPF programs are required to be GPLv2, which is enforced by the verifier on
program loads. With regards to API stability, just as with other semi-internal
interfaces such as BPF kfuncs, we won’t be providing any API stability
guarantees to BPF schedulers. While we intend to make an effort to provide
compatibility when possible, we will not provide any explicit, strong
guarantees as the kernel typically does with e.g. UAPI headers. For users who
decide to keep their schedulers out-of-tree,the licensing and maintenance
overheads will be fundamentally the same as for carrying out-of-tree patches.
With regards to the schedulers included in this patch set, and any other
schedulers we implement in the future, both Meta and Google will open-source
all of the schedulers we implement which have any relevance to the broader
upstream community. We expect that some of these, such as the example
schedulers and scx_example_cgfifo scheduler, will be upstreamed as part of
the kernel tree. Distros will be able to package and release these
schedulers with the kernel, allowing users to utilize these schedulers
out-of-the-box without requiring any additional work or dependencies such as
clang or building the scheduler programs themselves. Other schedulers and
scheduling frameworks such as rhone may be open-sourced through separate
per-project repos.
3. Rapid scheduler deployments
Rolling out kernel upgrades is a slow and iterative process. At a large scale
it can take months to roll a new kernel out to a fleet of servers. While this
latency is expected and inevitable for normal kernel upgrades, it can become
highly problematic when kernel changes are required to fix bugs. Livepatch [9]
is available to quickly roll out critical security fixes to large fleets, but
the scope of changes that can be applied with livepatching is fairly limited,
and would likely not be usable for patching scheduling policies. With
sched_ext, new scheduling policies can be rapidly rolled out to production
environments.
[9]: https://www.kernel.org/doc/html/latest/livepatch/livepatch.html
As an example, one of the variants of the L1 Terminal Fault (L1TF) [10]
vulnerability allows a VCPU running a VM to read arbitrary host kernel
memory for pages in L1 data cache. The solution was to implement core
scheduling, which ensures that tasks running as hypertwins have the same
“cookie”.
[10]: https://www.intel.com/content/www/us/en/architecture-and-technology/l1tf.html
While core scheduling works well, it took a long time to finalize and land
upstream. This long rollout period was painful, and required organizations to
make difficult choices amongst a bad set of options. Some companies such as
Google chose to implement and use their own custom L1TF-safe scheduler, others
chose to run without hyper-threading enabled, and yet others left
hyper-threading enabled and crossed their fingers.
Once core scheduling was upstream, organizations had to upgrade the kernels on
their entire fleets. As downtime is not an option for many, these upgrades had
to be gradually rolled out, which can take a very long time for large fleets.
An example of an sched_ext scheduler that illustrates core scheduling
semantics is scx_example_pair.bpf.c, which co-schedules pairs of tasks from
the same cgroup, and is resilient to L1TF vulnerabilities. While this example
scheduler is certainly not suitable for production in its current form, a
similar scheduler that is more performant and featureful could be written and
deployed if necessary.
Rapid scheduling deployments can similarly be useful to quickly roll-out new
scheduling features without requiring kernel upgrades. At Google, for example,
it was observed that some low-priority workloads were causing degraded
performance for higher-priority workloads due to consuming a disproportionate
share of memory bandwidth. While a temporary mitigation was to use sched
affinity to limit the footprint of this low-priority workload to a small
subset of CPUs, a preferable solution would be to implement a more featureful
task-priority mechanism which automatically throttles lower-priority tasks
which are causing memory contention for the rest of the system. Implementing
this in CFS and rolling it out to the fleet could take a very long time.
sched_ext would directly address these gaps. If another hardware bug or
resource contention issue comes in that requires scheduler support to
mitigate, sched_ext can be used to experiment with and test different
policies. Once a scheduler is available, it can quickly be rolled out to as
many hosts as necessary, and function as a stop-gap solution until a
longer-term mitigation is upstreamed.
How
-----
sched_ext is a new sched_class which allows scheduling policies to be
implemented in BPF programs.
sched_ext leverages BPF’s struct_ops feature to define a structure which
exports function callbacks and flags to BPF programs that wish to implement
scheduling policies. The struct_ops structure exported by sched_ext is struct
sched_ext_ops, and is conceptually similar to struct sched_class. The role of
sched_ext is to map the complex sched_class callbacks to the more simple and
ergonomic struct sched_ext_ops callbacks.
Unlike some other BPF program types which have ABI requirements due to
exporting UAPIs, struct_ops has no ABI requirements whatsoever. This provides
us with the flexibility to change the APIs provided to schedulers as
necessary. BPF struct_ops is also already being used successfully in other
subsystems, such as in support of TCP congestion control.
The only struct_ops field that is required to be specified by a scheduler is
the ‘name’ field. Otherwise, sched_ext will provide sane default behavior,
such as automatically choosing an idle CPU on the task wakeup path if
.select_cpu() is missing.
*Dispatch queues*
To bridge the workflow imbalance between the scheduler core and sched_ext_ops
callbacks, sched_ext uses simple FIFOs called dispatch queues (dsq's). By
default, there is one global dsq (SCX_DSQ_GLOBAL), and one local per-CPU dsq
(SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for convenience and need not be
used by a scheduler that doesn't require it. As described in more detail
below, SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when
putting the next task on the CPU. The BPF scheduler can manage an arbitrary
number of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().
*Scheduling cycle*
The following briefly shows a typical workflow for how a waking task is
scheduled and executed.
1. When a task is waking up, .select_cpu() is the first operation invoked.
This serves two purposes. It both allows a scheduler to optimize task
placement by specifying a CPU where it expects the task to eventually be
scheduled, and the latter is that the selected CPU will be woken if it’s
idle.
2. Once the target CPU is selected, .enqueue() is invoked. It can make one of
the following decisions:
- Immediately dispatch the task to either the global dsq (SCX_DSQ_GLOBAL)
or the current CPU’s local dsq (SCX_DSQ_LOCAL).
- Immediately dispatch the task to a user-created dispatch queue.
- Queue the task on the BPF side, e.g. in an rbtree map for a vruntime
scheduler, with the intention of dispatching it at a later time from
.dispatch().
3. When a CPU is ready to schedule, it first looks at its local dsq. If empty,
it invokes .consume() which should make one or more scx_bpf_consume() calls
to consume tasks from dsq's. If a scx_bpf_consume() call succeeds, the CPU
has the next task to run and .consume() can return. If .consume() is not
defined, sched_ext will by-default consume from only the built-in
SCX_DSQ_GLOBAL dsq.
4. If there's still no task to run, .dispatch() is invoked which should make
one or more scx_bpf_dispatch() calls to dispatch tasks from the BPF
scheduler to one of the dsq's. If more than one task has been dispatched,
go back to the previous consumption step.
*Verifying callback behavior*
sched_ext always verifies that any value returned from a callback is valid,
and will issue an error and unload the scheduler if it is not. For example, if
.select_cpu() returns an invalid CPU, or if an attempt is made to invoke the
scx_bpf_dispatch() with invalid enqueue flags. Furthermore, if a task remains
runnable for too long without being scheduled, sched_ext will detect it and
error-out the scheduler.
Closing Thoughts
----------------
Both Meta and Google have experimented quite a lot with schedulers in the last
several years. Google has benchmarked various workloads using user space
scheduling, and have achieved performance wins by trading off generality for
application specific needs. At Meta, we have not yet deployed sched_ext on any
production workloads, though our preliminary experiments indicate that
sched_ext would provide significant performance wins when deployed at scale.
If successfully upstreamed, we expect to leverage it extensively to run
various experiments and develop customized schedulers for a number of critical
workloads.
In closing, both Meta and Google believe that sched_ext will significantly
evolve how the broader community explores the scheduling problem space,
empowering continued improvement to the in-kernel scheduler, while also
enabling targeted policies for custom applications. We’ll be able to
experiment easier and faster, explore uncharted areas, and deploy emergency
scheduler changes when necessary. The same applies to anyone who wants to work
on the scheduler, including academia and specialized industries. sched_ext
will push forward the state of the art when it comes to scheduling and
performance in Linux.
Written By
----------
David Vernet <[email protected]>
Josh Don <[email protected]>
Tejun Heo <[email protected]>
Barret Rhoden <[email protected]>
Supported By
------------
Paul Turner <[email protected]>
Neel Natu <[email protected]>
Patrick Bellasi <[email protected]>
Hao Luo <[email protected]>
Dimitrios Skarlatos <[email protected]>
Patchset
--------
This patchset is on top of bpf/for-next as of 2022-11-14:
de763fbb2c5b ("Merge branch 'libbpf: Fixed various checkpatch issues'")
and contains the following patches:
NOTE: The doc added by 0028 contains a high-level overview and might be a
good place to start.
0001-rhashtable-Allow-rhashtable-to-be-used-from-irq-safe.patch
0002-cgroup-Implement-cgroup_show_cftypes.patch
0003-BPF-Add-prog-to-bpf_struct_ops-check_member.patch
0004-sched-Allow-sched_cgroup_fork-to-fail-and-introduce-.patch
0005-sched-Add-sched_class-reweight_task.patch
0006-sched-Add-sched_class-switching_to-and-expose-check_.patch
0007-sched-Factor-out-cgroup-weight-conversion-functions.patch
0008-sched-Expose-css_tg-and-__setscheduler_prio-in-kerne.patch
0009-sched-Enumerate-CPU-cgroup-file-types.patch
0010-sched-Add-reason-to-sched_class-rq_-on-off-line.patch
0011-sched-Add-reason-to-sched_move_task.patch
0012-sched-Add-normal_policy.patch
0013-sched_ext-Add-boilerplate-for-extensible-scheduler-c.patch
0014-sched_ext-Implement-BPF-extensible-scheduler-class.patch
0015-sched_ext-TEMPORARY-Add-temporary-workaround-kfunc-h.patch
0016-sched_ext-Add-scx_example_dummy-and-scx_example_qmap.patch
0017-sched_ext-Add-sysrq-S-which-disables-the-BPF-schedul.patch
0018-sched_ext-Implement-runnable-task-stall-watchdog.patch
0019-sched_ext-Allow-BPF-schedulers-to-disallow-specific-.patch
0020-sched_ext-Allow-BPF-schedulers-to-switch-all-eligibl.patch
0021-sched_ext-Implement-scx_bpf_kick_cpu-and-task-preemp.patch
0022-sched_ext-Add-task-state-tracking-operations.patch
0023-sched_ext-Implement-tickless-support.patch
0024-sched_ext-Add-cgroup-support.patch
0025-sched_ext-Implement-SCX_KICK_WAIT.patch
0026-sched_ext-Implement-sched_ext_ops.cpu_acquire-releas.patch
0027-sched_ext-Implement-sched_ext_ops.cpu_online-offline.patch
0028-sched_ext-Add-Documentation-scheduler-sched-ext.rst.patch
0029-sched_ext-Add-a-basic-userland-vruntime-scheduler.patch
0030-BPF-TEMPORARY-Nerf-BTF-scalar-value-check.patch
0031-sched_ext-Add-a-rust-userspace-hybrid-example-schedu.patch
0001-0003: Misc prep.
0004-0012: Scheduler prep.
0013-0016: sched_ext core implementation and a couple example BPF scheduler.
0017-0020: Utility features including safety mechanisms and switch-all.
0021-0023: Kicking and preempting other CPUs, task state transition tracking
and tickless support. Demonstrated with an example central
scheduler which makes all scheduling decisions on one CPU.
0024-0026: cgroup support and the ability to wait for other CPUs after
kicking them. Demonstrated with an example pair scheduler which
guarantees that a hyperthread pair always executes tasks from the
same cgroup at any given time.
0027 : Add CPU hotplug callbacks.
0028 : Add documentation.
0029-0031: Add two example schedulers. One demonstrating deferring most
scheduling decisions to userland. The other demonstrating a
hybrid approach where load balancing decisions are made by
userspace written in rust.
0015 and 0030 are temporary patches to work around missing BPF features.
0014 and 0023 also contain such workarounds.
The patchset is also available in the following git branch:
https://github.com/htejun/sched_ext sched_ext
diffstat follows.
Documentation/scheduler/index.rst | 1
Documentation/scheduler/sched-ext.rst | 230 +++
drivers/tty/sysrq.c | 1
include/asm-generic/vmlinux.lds.h | 1
include/linux/bpf.h | 3
include/linux/cgroup-defs.h | 8
include/linux/cgroup.h | 1
include/linux/rhashtable.h | 51
include/linux/sched.h | 5
include/linux/sched/ext.h | 651 ++++++++
include/linux/sched/task.h | 3
include/uapi/linux/sched.h | 1
init/Kconfig | 5
init/init_task.c | 12
kernel/Kconfig.preempt | 4
kernel/bpf/bpf_struct_ops_types.h | 4
kernel/bpf/btf.c | 5
kernel/bpf/verifier.c | 2
kernel/cgroup/cgroup.c | 97 +
kernel/fork.c | 17
kernel/sched/autogroup.c | 4
kernel/sched/build_policy.c | 5
kernel/sched/core.c | 298 +++-
kernel/sched/deadline.c | 4
kernel/sched/debug.c | 6
kernel/sched/ext.c | 3710 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/ext.h | 252 +++
kernel/sched/fair.c | 9
kernel/sched/idle.c | 2
kernel/sched/rt.c | 4
kernel/sched/sched.h | 127 +
kernel/sched/topology.c | 4
lib/rhashtable.c | 16
net/ipv4/bpf_tcp_ca.c | 3
tools/sched_ext/.gitignore | 8
tools/sched_ext/Makefile | 211 ++
tools/sched_ext/atropos/.gitignore | 3
tools/sched_ext/atropos/Cargo.toml | 34
tools/sched_ext/atropos/build.rs | 70
tools/sched_ext/atropos/rustfmt.toml | 8
tools/sched_ext/atropos/src/bpf/atropos.bpf.c | 501 ++++++
tools/sched_ext/atropos/src/bpf/atropos.h | 38
tools/sched_ext/atropos/src/main.rs | 648 ++++++++
tools/sched_ext/atropos/src/oss/atropos_sys.rs | 10
tools/sched_ext/atropos/src/oss/mod.rs | 29
tools/sched_ext/atropos/src/util.rs | 24
tools/sched_ext/gnu/stubs.h | 1
tools/sched_ext/scx_common.bpf.h | 120 +
tools/sched_ext/scx_example_central.bpf.c | 377 +++++
tools/sched_ext/scx_example_central.c | 92 +
tools/sched_ext/scx_example_dummy.bpf.c | 67
tools/sched_ext/scx_example_dummy.c | 97 +
tools/sched_ext/scx_example_pair.bpf.c | 645 ++++++++
tools/sched_ext/scx_example_pair.c | 143 +
tools/sched_ext/scx_example_pair.h | 10
tools/sched_ext/scx_example_qmap.bpf.c | 288 +++
tools/sched_ext/scx_example_qmap.c | 101 +
tools/sched_ext/scx_example_userland.bpf.c | 265 +++
tools/sched_ext/scx_example_userland.c | 403 +++++
tools/sched_ext/scx_example_userland_common.h | 19
tools/sched_ext/user_exit_info.h | 50
61 files changed, 9672 insertions(+), 136 deletions(-)
Thanks.
From: David Vernet <[email protected]>
This patch adds a new scx_example_userland BPF scheduler that implements a
fairly unsophisticated sorted-list vruntime scheduler in userland to
demonstrate how most scheduling decisions can be delegated to userland. The
scheduler doesn't implement load balancing, and treats all tasks as part of
a single domain.
Signed-off-by: David Vernet <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
tools/sched_ext/.gitignore | 1 +
tools/sched_ext/Makefile | 10 +-
tools/sched_ext/scx_example_userland.bpf.c | 265 ++++++++++++
tools/sched_ext/scx_example_userland.c | 403 ++++++++++++++++++
tools/sched_ext/scx_example_userland_common.h | 19 +
5 files changed, 696 insertions(+), 2 deletions(-)
create mode 100644 tools/sched_ext/scx_example_userland.bpf.c
create mode 100644 tools/sched_ext/scx_example_userland.c
create mode 100644 tools/sched_ext/scx_example_userland_common.h
diff --git a/tools/sched_ext/.gitignore b/tools/sched_ext/.gitignore
index ebc34dcf925b..75a536dfebfc 100644
--- a/tools/sched_ext/.gitignore
+++ b/tools/sched_ext/.gitignore
@@ -2,6 +2,7 @@ scx_example_dummy
scx_example_qmap
scx_example_central
scx_example_pair
+scx_example_userland
*.skel.h
*.subskel.h
/tools/
diff --git a/tools/sched_ext/Makefile b/tools/sched_ext/Makefile
index 45ab39139afc..563b54333ab1 100644
--- a/tools/sched_ext/Makefile
+++ b/tools/sched_ext/Makefile
@@ -114,7 +114,8 @@ BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH) \
-Wno-compare-distinct-pointer-types \
-O2 -mcpu=v3
-all: scx_example_dummy scx_example_qmap scx_example_central scx_example_pair
+all: scx_example_dummy scx_example_qmap scx_example_central scx_example_pair \
+ scx_example_userland
# sort removes libbpf duplicates when not cross-building
MAKE_DIRS := $(sort $(BUILD_DIR)/libbpf $(HOST_BUILD_DIR)/libbpf \
@@ -181,11 +182,16 @@ scx_example_pair: scx_example_pair.c scx_example_pair.skel.h user_exit_info.h
$(CC) $(CFLAGS) -c $< -o [email protected]
$(CC) -o $@ [email protected] $(HOST_BPFOBJ) $(LDFLAGS)
+scx_example_userland: scx_example_userland.c scx_example_userland.skel.h \
+ scx_example_userland_common.h user_exit_info.h
+ $(CC) $(CFLAGS) -c $< -o [email protected]
+ $(CC) -o $@ [email protected] $(HOST_BPFOBJ) $(LDFLAGS)
+
clean:
rm -rf $(SCRATCH_DIR) $(HOST_SCRATCH_DIR)
rm -f *.o *.bpf.o *.skel.h *.subskel.h
rm -f scx_example_dummy scx_example_qmap scx_example_central \
- scx_example_pair
+ scx_example_pair scx_example_userland
.PHONY: all clean
diff --git a/tools/sched_ext/scx_example_userland.bpf.c b/tools/sched_ext/scx_example_userland.bpf.c
new file mode 100644
index 000000000000..b0af532b4db0
--- /dev/null
+++ b/tools/sched_ext/scx_example_userland.bpf.c
@@ -0,0 +1,265 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A minimal userland scheduler.
+ *
+ * In terms of scheduling, this provides two different types of behaviors:
+ * 1. A global FIFO scheduling order for _any_ tasks that have CPU affinity.
+ * All such tasks are direct-dispatched from the kernel, and are never
+ * enqueued in user space.
+ * 2. A primitive vruntime scheduler that is implemented in user space, for all
+ * other tasks.
+ *
+ * Some parts of this example user space scheduler could be implemented more
+ * efficiently using more complex and sophisticated data structures. For
+ * example, rather than using BPF_MAP_TYPE_QUEUE's,
+ * BPF_MAP_TYPE_{USER_}RINGBUF's could be used for exchanging messages between
+ * user space and kernel space. Similarly, we use a simple vruntime-sorted list
+ * in user space, but an rbtree could be used instead.
+ *
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#include <string.h>
+#include "scx_common.bpf.h"
+#include "scx_example_userland_common.h"
+
+char _license[] SEC("license") = "GPL";
+
+const volatile bool switch_all;
+const volatile u32 num_possible_cpus;
+const volatile s32 usersched_pid;
+
+/* Stats that are printed by user space. */
+u64 nr_failed_enqueues, nr_kernel_enqueues, nr_user_enqueues;
+
+struct user_exit_info uei;
+
+/*
+ * Whether the user space scheduler needs to be scheduled due to a task being
+ * enqueued in user space.
+ */
+static bool usersched_needed;
+
+/*
+ * The map containing tasks that are enqueued in user space from the kernel.
+ *
+ * This map is drained by the user space scheduler.
+ */
+struct {
+ __uint(type, BPF_MAP_TYPE_QUEUE);
+ __uint(max_entries, USERLAND_MAX_TASKS);
+ __type(value, struct scx_userland_enqueued_task);
+} enqueued SEC(".maps");
+
+/*
+ * The map containing tasks that are dispatched to the kernel from user space.
+ *
+ * Drained by the kernel in userland_dispatch().
+ */
+struct {
+ __uint(type, BPF_MAP_TYPE_QUEUE);
+ __uint(max_entries, USERLAND_MAX_TASKS);
+ __type(value, s32);
+} dispatched SEC(".maps");
+
+/* Per-task scheduling context */
+struct task_ctx {
+ bool force_local; /* Dispatch directly to local DSQ */
+};
+
+/* Map that contains task-local storage. */
+struct {
+ __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+ __type(key, int);
+ __type(value, struct task_ctx);
+} task_ctx_stor SEC(".maps");
+
+static bool is_usersched_task(const struct task_struct *p)
+{
+ return p->pid == usersched_pid;
+}
+
+static bool keep_in_kernel(const struct task_struct *p)
+{
+ return p->nr_cpus_allowed < num_possible_cpus;
+}
+
+static struct task_struct *usersched_task(void)
+{
+ struct task_struct *p;
+
+ p = scx_bpf_find_task_by_pid(usersched_pid);
+ /*
+ * Should never happen -- the usersched task should always be managed
+ * by sched_ext.
+ */
+ if (!p)
+ scx_bpf_error("Failed to find usersched task %d", usersched_pid);
+
+ /*
+ * While p should never be NULL, have logic to return current so the
+ * caller doesn't have to bother with checking for NULL.
+ */
+ return p ?: bpf_get_current_task_btf();
+}
+
+s32 BPF_STRUCT_OPS(userland_select_cpu, struct task_struct *p,
+ s32 prev_cpu, u64 wake_flags)
+{
+ if (keep_in_kernel(p)) {
+ s32 cpu;
+ struct task_ctx *tctx;
+
+ tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
+ if (!tctx) {
+ scx_bpf_error("Failed to look up task-local storage for %s", p->comm);
+ return -ESRCH;
+ }
+
+ if (p->nr_cpus_allowed == 1 ||
+ scx_bpf_test_and_clear_cpu_idle(prev_cpu)) {
+ tctx->force_local = true;
+ return prev_cpu;
+ }
+
+ cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr);
+ if (cpu >= 0) {
+ tctx->force_local = true;
+ return cpu;
+ }
+ }
+
+ return prev_cpu;
+}
+
+static void dispatch_user_scheduler(void)
+{
+ usersched_needed = false;
+ scx_bpf_dispatch(usersched_task(), SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0);
+}
+
+static void enqueue_task_in_user_space(struct task_struct *p, u64 enq_flags)
+{
+ struct scx_userland_enqueued_task task;
+
+ memset(&task, 0, sizeof(task));
+ task.pid = p->pid;
+ task.sum_exec_runtime = p->se.sum_exec_runtime;
+ task.weight = p->scx.weight;
+
+ if (bpf_map_push_elem(&enqueued, &task, 0)) {
+ /*
+ * If we fail to enqueue the task in user space, put it
+ * directly on the global DSQ.
+ */
+ __sync_fetch_and_add(&nr_failed_enqueues, 1);
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+ } else {
+ __sync_fetch_and_add(&nr_user_enqueues, 1);
+ usersched_needed = true;
+ }
+}
+
+void BPF_STRUCT_OPS(userland_enqueue, struct task_struct *p, u64 enq_flags)
+{
+ if (keep_in_kernel(p)) {
+ u64 dsq_id = SCX_DSQ_GLOBAL;
+ struct task_ctx *tctx;
+
+ tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
+ if (!tctx) {
+ scx_bpf_error("Failed to lookup task ctx for %s", p->comm);
+ return;
+ }
+
+ if (tctx->force_local)
+ dsq_id = SCX_DSQ_LOCAL;
+ tctx->force_local = false;
+ scx_bpf_dispatch(p, dsq_id, SCX_SLICE_DFL, enq_flags);
+ __sync_fetch_and_add(&nr_kernel_enqueues, 1);
+ return;
+ } else if (!is_usersched_task(p)) {
+ enqueue_task_in_user_space(p, enq_flags);
+ }
+}
+
+static int drain_dispatch_q_loopfn(u32 idx, void *data)
+{
+ s32 cpu = *(s32 *)data;
+ s32 pid;
+ struct task_struct *p;
+
+ if (bpf_map_pop_elem(&dispatched, &pid))
+ return 1;
+
+ /*
+ * The task could have exited by the time we get around to dispatching
+ * it. Treat this as a normal occurrence, and simply move onto the next
+ * iteration.
+ */
+ p = scx_bpf_find_task_by_pid(pid);
+ if (!p)
+ return 0;
+
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0);
+ return 0;
+}
+
+void BPF_STRUCT_OPS(userland_dispatch, s32 cpu, struct task_struct *prev)
+{
+ if (usersched_needed)
+ dispatch_user_scheduler();
+
+ /* XXX: Use an iterator when it's available. */
+ bpf_loop(4096, drain_dispatch_q_loopfn, &cpu, 0);
+}
+
+s32 BPF_STRUCT_OPS(userland_prep_enable, struct task_struct *p,
+ struct scx_enable_args *args)
+{
+ if (bpf_task_storage_get(&task_ctx_stor, p, 0,
+ BPF_LOCAL_STORAGE_GET_F_CREATE))
+ return 0;
+ else
+ return -ENOMEM;
+}
+
+s32 BPF_STRUCT_OPS(userland_init)
+{
+ int ret;
+
+ if (num_possible_cpus == 0) {
+ scx_bpf_error("User scheduler # CPUs uninitialized (%d)",
+ num_possible_cpus);
+ return -EINVAL;
+ }
+
+ if (usersched_pid <= 0) {
+ scx_bpf_error("User scheduler pid uninitialized (%d)",
+ usersched_pid);
+ return -EINVAL;
+ }
+
+ if (switch_all)
+ scx_bpf_switch_all();
+ return 0;
+}
+
+void BPF_STRUCT_OPS(userland_exit, struct scx_exit_info *ei)
+{
+ uei_record(&uei, ei);
+}
+
+SEC(".struct_ops")
+struct sched_ext_ops userland_ops = {
+ .select_cpu = (void *)userland_select_cpu,
+ .enqueue = (void *)userland_enqueue,
+ .dispatch = (void *)userland_dispatch,
+ .prep_enable = (void *)userland_prep_enable,
+ .init = (void *)userland_init,
+ .exit = (void *)userland_exit,
+ .timeout_ms = 3000,
+ .name = "userland",
+};
diff --git a/tools/sched_ext/scx_example_userland.c b/tools/sched_ext/scx_example_userland.c
new file mode 100644
index 000000000000..4ddd257b9e42
--- /dev/null
+++ b/tools/sched_ext/scx_example_userland.c
@@ -0,0 +1,403 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A demo sched_ext user space scheduler which provides vruntime semantics
+ * using a simple ordered-list implementation.
+ *
+ * Each CPU in the system resides in a single, global domain. This precludes
+ * the need to do any load balancing between domains. The scheduler could
+ * easily be extended to support multiple domains, with load balancing
+ * happening in user space.
+ *
+ * Any task which has any CPU affinity is scheduled entirely in BPF. This
+ * program only schedules tasks which may run on any CPU.
+ *
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <unistd.h>
+#include <sched.h>
+#include <signal.h>
+#include <assert.h>
+#include <libgen.h>
+#include <pthread.h>
+#include <bpf/bpf.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+
+#include "user_exit_info.h"
+#include "scx_example_userland_common.h"
+#include "scx_example_userland.skel.h"
+
+const char help_fmt[] =
+"A minimal userland sched_ext scheduler.\n"
+"\n"
+"See the top-level comment in .bpf.c for more details.\n"
+"\n"
+"Usage: %s [-a]\n"
+"\n"
+" -a Switch all tasks\n"
+" -b The number of tasks to batch when dispatching.\n"
+" Defaults to 8\n"
+" -h Display this help and exit\n";
+
+/* Defined in UAPI */
+#define SCHED_EXT 7
+
+/* Number of tasks to batch when dispatching to user space. */
+static __u32 batch_size = 8;
+
+static volatile int exit_req;
+static int enqueued_fd, dispatched_fd;
+
+static struct scx_example_userland *skel;
+static struct bpf_link *ops_link;
+
+/* Stats collected in user space. */
+static __u64 nr_vruntime_enqueues, nr_vruntime_dispatches;
+
+/* The data structure containing tasks that are enqueued in user space. */
+struct enqueued_task {
+ LIST_ENTRY(enqueued_task) entries;
+ __u64 sum_exec_runtime;
+ double vruntime;
+};
+
+/*
+ * Use a vruntime-sorted list to store tasks. This could easily be extended to
+ * a more optimal data structure, such as an rbtree as is done in CFS. We
+ * currently elect to use a sorted list to simplify the example for
+ * illustrative purposes.
+ */
+LIST_HEAD(listhead, enqueued_task);
+
+/*
+ * A vruntime-sorted list of tasks. The head of the list contains the task with
+ * the lowest vruntime. That is, the task that has the "highest" claim to be
+ * scheduled.
+ */
+static struct listhead vruntime_head = LIST_HEAD_INITIALIZER(vruntime_head);
+
+/*
+ * The statically allocated array of tasks. We use a statically allocated list
+ * here to avoid having to allocate on the enqueue path, which could cause a
+ * deadlock. A more substantive user space scheduler could e.g. provide a hook
+ * for newly enabled tasks that are passed to the scheduler from the
+ * .prep_enable() callback to allows the scheduler to allocate on safe paths.
+ */
+struct enqueued_task tasks[USERLAND_MAX_TASKS];
+
+static double min_vruntime;
+
+static void sigint_handler(int userland)
+{
+ exit_req = 1;
+}
+
+static __u32 task_pid(const struct enqueued_task *task)
+{
+ return ((uintptr_t)task - (uintptr_t)tasks) / sizeof(*task);
+}
+
+static int dispatch_task(s32 pid)
+{
+ int err;
+
+ err = bpf_map_update_elem(dispatched_fd, NULL, &pid, 0);
+ if (err) {
+ fprintf(stderr, "Failed to dispatch task %d\n", pid);
+ exit_req = 1;
+ } else {
+ nr_vruntime_dispatches++;
+ }
+
+ return err;
+}
+
+static struct enqueued_task *get_enqueued_task(__s32 pid)
+{
+ if (pid >= USERLAND_MAX_TASKS)
+ return NULL;
+
+ return &tasks[pid];
+}
+
+static double calc_vruntime_delta(__u64 weight, __u64 delta)
+{
+ double weight_f = (double)weight / 100.0;
+ double delta_f = (double)delta;
+
+ return delta_f / weight_f;
+}
+
+static void update_enqueued(struct enqueued_task *enqueued, const struct scx_userland_enqueued_task *bpf_task)
+{
+ __u64 delta;
+
+ delta = bpf_task->sum_exec_runtime - enqueued->sum_exec_runtime;
+
+ enqueued->vruntime += calc_vruntime_delta(bpf_task->weight, delta);
+ if (min_vruntime > enqueued->vruntime)
+ enqueued->vruntime = min_vruntime;
+ enqueued->sum_exec_runtime = bpf_task->sum_exec_runtime;
+}
+
+static int vruntime_enqueue(const struct scx_userland_enqueued_task *bpf_task)
+{
+ struct enqueued_task *curr, *enqueued, *prev;
+
+ curr = get_enqueued_task(bpf_task->pid);
+ if (!curr)
+ return ENOENT;
+
+ update_enqueued(curr, bpf_task);
+ nr_vruntime_enqueues++;
+
+ /*
+ * Enqueue the task in a vruntime-sorted list. A more optimal data
+ * structure such as an rbtree could easily be used as well. We elect
+ * to use a list here simply because it's less code, and thus the
+ * example is less convoluted and better serves to illustrate what a
+ * user space scheduler could look like.
+ */
+
+ if (LIST_EMPTY(&vruntime_head)) {
+ LIST_INSERT_HEAD(&vruntime_head, curr, entries);
+ return 0;
+ }
+
+ LIST_FOREACH(enqueued, &vruntime_head, entries) {
+ if (curr->vruntime <= enqueued->vruntime) {
+ LIST_INSERT_BEFORE(enqueued, curr, entries);
+ return 0;
+ }
+ prev = enqueued;
+ }
+
+ LIST_INSERT_AFTER(prev, curr, entries);
+
+ return 0;
+}
+
+static void drain_enqueued_map(void)
+{
+ while (1) {
+ struct scx_userland_enqueued_task task;
+ int err;
+
+ if (bpf_map_lookup_and_delete_elem(enqueued_fd, NULL, &task))
+ return;
+
+ err = vruntime_enqueue(&task);
+ if (err) {
+ fprintf(stderr, "Failed to enqueue task %d: %s\n",
+ task.pid, strerror(err));
+ exit_req = 1;
+ return;
+ }
+ }
+}
+
+static void dispatch_batch(void)
+{
+ __u32 i;
+
+ for (i = 0; i < batch_size; i++) {
+ struct enqueued_task *task;
+ int err;
+ __s32 pid;
+
+ task = LIST_FIRST(&vruntime_head);
+ if (!task)
+ return;
+
+ min_vruntime = task->vruntime;
+ pid = task_pid(task);
+ LIST_REMOVE(task, entries);
+ err = dispatch_task(pid);
+ if (err) {
+ fprintf(stderr, "Failed to dispatch task %d in %u\n",
+ pid, i);
+ return;
+ }
+ }
+}
+
+static void *run_stats_printer(void *arg)
+{
+ while (!exit_req) {
+ __u64 nr_failed_enqueues, nr_kernel_enqueues, nr_user_enqueues, total;
+
+ nr_failed_enqueues = skel->bss->nr_failed_enqueues;
+ nr_kernel_enqueues = skel->bss->nr_kernel_enqueues;
+ nr_user_enqueues = skel->bss->nr_user_enqueues;
+ total = nr_failed_enqueues + nr_kernel_enqueues + nr_user_enqueues;
+
+ printf("o-----------------------o\n");
+ printf("| BPF ENQUEUES |\n");
+ printf("|-----------------------|\n");
+ printf("| kern: %10llu |\n", nr_kernel_enqueues);
+ printf("| user: %10llu |\n", nr_user_enqueues);
+ printf("| failed: %10llu |\n", nr_failed_enqueues);
+ printf("| -------------------- |\n");
+ printf("| total: %10llu |\n", total);
+ printf("| |\n");
+ printf("|-----------------------|\n");
+ printf("| VRUNTIME / USER |\n");
+ printf("|-----------------------|\n");
+ printf("| enq: %10llu |\n", nr_vruntime_enqueues);
+ printf("| disp: %10llu |\n", nr_vruntime_dispatches);
+ printf("o-----------------------o\n");
+ printf("\n\n");
+ sleep(1);
+ }
+
+ return NULL;
+}
+
+static int spawn_stats_thread(void)
+{
+ pthread_t stats_printer;
+
+ return pthread_create(&stats_printer, NULL, run_stats_printer, NULL);
+}
+
+static int bootstrap(int argc, char **argv)
+{
+ int err;
+ __u32 opt;
+ struct sched_param sched_param = {
+ .sched_priority = sched_get_priority_max(SCHED_EXT),
+ };
+ bool switch_all = false;
+
+ signal(SIGINT, sigint_handler);
+ signal(SIGTERM, sigint_handler);
+ libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
+
+ /*
+ * Enforce that the user scheduler task is managed by sched_ext. The
+ * task eagerly drains the list of enqueued tasks in its main work
+ * loop, and then yields the CPU. The BPF scheduler only schedules the
+ * user space scheduler task when at least one other task in the system
+ * needs to be scheduled.
+ */
+ err = syscall(__NR_sched_setscheduler, getpid(), SCHED_EXT, &sched_param);
+ if (err) {
+ fprintf(stderr, "Failed to set scheduler to SCHED_EXT: %s\n", strerror(err));
+ return err;
+ }
+
+ while ((opt = getopt(argc, argv, "ahb:")) != -1) {
+ switch (opt) {
+ case 'a':
+ switch_all = true;
+ break;
+ case 'b':
+ batch_size = strtoul(optarg, NULL, 0);
+ break;
+ default:
+ fprintf(stderr, help_fmt, basename(argv[0]));
+ exit(opt != 'h');
+ }
+ }
+
+ /*
+ * It's not always safe to allocate in a user space scheduler, as an
+ * enqueued task could hold a lock that we require in order to be able
+ * to allocate.
+ */
+ err = mlockall(MCL_CURRENT | MCL_FUTURE);
+ if (err) {
+ fprintf(stderr, "Failed to prefault and lock address space: %s\n",
+ strerror(err));
+ return err;
+ }
+
+ skel = scx_example_userland__open();
+ if (!skel) {
+ fprintf(stderr, "Failed to open scheduler: %s\n", strerror(errno));
+ return errno;
+ }
+ skel->rodata->num_possible_cpus = libbpf_num_possible_cpus();
+ assert(skel->rodata->num_possible_cpus > 0);
+ skel->rodata->usersched_pid = getpid();
+ assert(skel->rodata->usersched_pid > 0);
+ skel->rodata->switch_all = switch_all;
+
+ err = scx_example_userland__load(skel);
+ if (err) {
+ fprintf(stderr, "Failed to load scheduler: %s\n", strerror(err));
+ goto destroy_skel;
+ }
+
+ enqueued_fd = bpf_map__fd(skel->maps.enqueued);
+ dispatched_fd = bpf_map__fd(skel->maps.dispatched);
+ assert(enqueued_fd > 0);
+ assert(dispatched_fd > 0);
+
+ err = spawn_stats_thread();
+ if (err) {
+ fprintf(stderr, "Failed to spawn stats thread: %s\n", strerror(err));
+ goto destroy_skel;
+ }
+
+ ops_link = bpf_map__attach_struct_ops(skel->maps.userland_ops);
+ if (!ops_link) {
+ fprintf(stderr, "Failed to attach struct ops: %s\n", strerror(errno));
+ err = errno;
+ goto destroy_skel;
+ }
+
+ return 0;
+
+destroy_skel:
+ scx_example_userland__destroy(skel);
+ exit_req = 1;
+ return err;
+}
+
+static void sched_main_loop(void)
+{
+ while (!exit_req) {
+ /*
+ * Perform the following work in the main user space scheduler
+ * loop:
+ *
+ * 1. Drain all tasks from the enqueued map, and enqueue them
+ * to the vruntime sorted list.
+ *
+ * 2. Dispatch a batch of tasks from the vruntime sorted list
+ * down to the kernel.
+ *
+ * 3. Yield the CPU back to the system. The BPF scheduler will
+ * reschedule the user space scheduler once another task has
+ * been enqueued to user space.
+ */
+ drain_enqueued_map();
+ dispatch_batch();
+ sched_yield();
+ }
+}
+
+int main(int argc, char **argv)
+{
+ int err;
+
+ err = bootstrap(argc, argv);
+ if (err) {
+ fprintf(stderr, "Failed to bootstrap scheduler: %s\n", strerror(err));
+ return err;
+ }
+
+ sched_main_loop();
+
+ exit_req = 1;
+ bpf_link__destroy(ops_link);
+ uei_print(&skel->bss->uei);
+ scx_example_userland__destroy(skel);
+ return 0;
+}
diff --git a/tools/sched_ext/scx_example_userland_common.h b/tools/sched_ext/scx_example_userland_common.h
new file mode 100644
index 000000000000..639c6809c5ff
--- /dev/null
+++ b/tools/sched_ext/scx_example_userland_common.h
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta, Inc */
+
+#ifndef __SCX_USERLAND_COMMON_H
+#define __SCX_USERLAND_COMMON_H
+
+#define USERLAND_MAX_TASKS 8192
+
+/*
+ * An instance of a task that has been enqueued by the kernel for consumption
+ * by a user space global scheduler thread.
+ */
+struct scx_userland_enqueued_task {
+ __s32 pid;
+ u64 sum_exec_runtime;
+ u64 weight;
+};
+
+#endif // __SCX_USERLAND_COMMON_H
--
2.38.1
rhashtable currently only does bh-safe synchronization making it impossible
to use from irq-safe contexts. Switch it to use irq-safe synchronization to
remove the restriction.
Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/rhashtable.h | 51 ++++++++++++++++++++++----------------
lib/rhashtable.c | 16 +++++++-----
2 files changed, 39 insertions(+), 28 deletions(-)
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 68dab3e08aad..785fba3464f2 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -324,28 +324,31 @@ static inline struct rhash_lock_head __rcu **rht_bucket_insert(
*/
static inline void rht_lock(struct bucket_table *tbl,
- struct rhash_lock_head __rcu **bkt)
+ struct rhash_lock_head __rcu **bkt,
+ unsigned long *flags)
{
- local_bh_disable();
+ local_irq_save(*flags);
bit_spin_lock(0, (unsigned long *)bkt);
lock_map_acquire(&tbl->dep_map);
}
static inline void rht_lock_nested(struct bucket_table *tbl,
struct rhash_lock_head __rcu **bucket,
- unsigned int subclass)
+ unsigned int subclass,
+ unsigned long *flags)
{
- local_bh_disable();
+ local_irq_save(*flags);
bit_spin_lock(0, (unsigned long *)bucket);
lock_acquire_exclusive(&tbl->dep_map, subclass, 0, NULL, _THIS_IP_);
}
static inline void rht_unlock(struct bucket_table *tbl,
- struct rhash_lock_head __rcu **bkt)
+ struct rhash_lock_head __rcu **bkt,
+ unsigned long *flags)
{
lock_map_release(&tbl->dep_map);
bit_spin_unlock(0, (unsigned long *)bkt);
- local_bh_enable();
+ local_irq_restore(*flags);
}
static inline struct rhash_head *__rht_ptr(
@@ -393,7 +396,8 @@ static inline void rht_assign_locked(struct rhash_lock_head __rcu **bkt,
static inline void rht_assign_unlock(struct bucket_table *tbl,
struct rhash_lock_head __rcu **bkt,
- struct rhash_head *obj)
+ struct rhash_head *obj,
+ unsigned long *flags)
{
if (rht_is_a_nulls(obj))
obj = NULL;
@@ -401,7 +405,7 @@ static inline void rht_assign_unlock(struct bucket_table *tbl,
rcu_assign_pointer(*bkt, (void *)obj);
preempt_enable();
__release(bitlock);
- local_bh_enable();
+ local_irq_restore(*flags);
}
/**
@@ -706,6 +710,7 @@ static inline void *__rhashtable_insert_fast(
struct rhash_head __rcu **pprev;
struct bucket_table *tbl;
struct rhash_head *head;
+ unsigned long flags;
unsigned int hash;
int elasticity;
void *data;
@@ -720,11 +725,11 @@ static inline void *__rhashtable_insert_fast(
if (!bkt)
goto out;
pprev = NULL;
- rht_lock(tbl, bkt);
+ rht_lock(tbl, bkt, &flags);
if (unlikely(rcu_access_pointer(tbl->future_tbl))) {
slow_path:
- rht_unlock(tbl, bkt);
+ rht_unlock(tbl, bkt, &flags);
rcu_read_unlock();
return rhashtable_insert_slow(ht, key, obj);
}
@@ -756,9 +761,9 @@ static inline void *__rhashtable_insert_fast(
RCU_INIT_POINTER(list->rhead.next, head);
if (pprev) {
rcu_assign_pointer(*pprev, obj);
- rht_unlock(tbl, bkt);
+ rht_unlock(tbl, bkt, &flags);
} else
- rht_assign_unlock(tbl, bkt, obj);
+ rht_assign_unlock(tbl, bkt, obj, &flags);
data = NULL;
goto out;
}
@@ -785,7 +790,7 @@ static inline void *__rhashtable_insert_fast(
}
atomic_inc(&ht->nelems);
- rht_assign_unlock(tbl, bkt, obj);
+ rht_assign_unlock(tbl, bkt, obj, &flags);
if (rht_grow_above_75(ht, tbl))
schedule_work(&ht->run_work);
@@ -797,7 +802,7 @@ static inline void *__rhashtable_insert_fast(
return data;
out_unlock:
- rht_unlock(tbl, bkt);
+ rht_unlock(tbl, bkt, &flags);
goto out;
}
@@ -991,6 +996,7 @@ static inline int __rhashtable_remove_fast_one(
struct rhash_lock_head __rcu **bkt;
struct rhash_head __rcu **pprev;
struct rhash_head *he;
+ unsigned long flags;
unsigned int hash;
int err = -ENOENT;
@@ -999,7 +1005,7 @@ static inline int __rhashtable_remove_fast_one(
if (!bkt)
return -ENOENT;
pprev = NULL;
- rht_lock(tbl, bkt);
+ rht_lock(tbl, bkt, &flags);
rht_for_each_from(he, rht_ptr(bkt, tbl, hash), tbl, hash) {
struct rhlist_head *list;
@@ -1043,14 +1049,14 @@ static inline int __rhashtable_remove_fast_one(
if (pprev) {
rcu_assign_pointer(*pprev, obj);
- rht_unlock(tbl, bkt);
+ rht_unlock(tbl, bkt, &flags);
} else {
- rht_assign_unlock(tbl, bkt, obj);
+ rht_assign_unlock(tbl, bkt, obj, &flags);
}
goto unlocked;
}
- rht_unlock(tbl, bkt);
+ rht_unlock(tbl, bkt, &flags);
unlocked:
if (err > 0) {
atomic_dec(&ht->nelems);
@@ -1143,6 +1149,7 @@ static inline int __rhashtable_replace_fast(
struct rhash_lock_head __rcu **bkt;
struct rhash_head __rcu **pprev;
struct rhash_head *he;
+ unsigned long flags;
unsigned int hash;
int err = -ENOENT;
@@ -1158,7 +1165,7 @@ static inline int __rhashtable_replace_fast(
return -ENOENT;
pprev = NULL;
- rht_lock(tbl, bkt);
+ rht_lock(tbl, bkt, &flags);
rht_for_each_from(he, rht_ptr(bkt, tbl, hash), tbl, hash) {
if (he != obj_old) {
@@ -1169,15 +1176,15 @@ static inline int __rhashtable_replace_fast(
rcu_assign_pointer(obj_new->next, obj_old->next);
if (pprev) {
rcu_assign_pointer(*pprev, obj_new);
- rht_unlock(tbl, bkt);
+ rht_unlock(tbl, bkt, &flags);
} else {
- rht_assign_unlock(tbl, bkt, obj_new);
+ rht_assign_unlock(tbl, bkt, obj_new, &flags);
}
err = 0;
goto unlocked;
}
- rht_unlock(tbl, bkt);
+ rht_unlock(tbl, bkt, &flags);
unlocked:
return err;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index e12bbfb240b8..9781572b2f31 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -231,6 +231,7 @@ static int rhashtable_rehash_one(struct rhashtable *ht,
struct rhash_head *head, *next, *entry;
struct rhash_head __rcu **pprev = NULL;
unsigned int new_hash;
+ unsigned long flags;
if (new_tbl->nest)
goto out;
@@ -253,13 +254,14 @@ static int rhashtable_rehash_one(struct rhashtable *ht,
new_hash = head_hashfn(ht, new_tbl, entry);
- rht_lock_nested(new_tbl, &new_tbl->buckets[new_hash], SINGLE_DEPTH_NESTING);
+ rht_lock_nested(new_tbl, &new_tbl->buckets[new_hash],
+ SINGLE_DEPTH_NESTING, &flags);
head = rht_ptr(new_tbl->buckets + new_hash, new_tbl, new_hash);
RCU_INIT_POINTER(entry->next, head);
- rht_assign_unlock(new_tbl, &new_tbl->buckets[new_hash], entry);
+ rht_assign_unlock(new_tbl, &new_tbl->buckets[new_hash], entry, &flags);
if (pprev)
rcu_assign_pointer(*pprev, next);
@@ -276,18 +278,19 @@ static int rhashtable_rehash_chain(struct rhashtable *ht,
{
struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
struct rhash_lock_head __rcu **bkt = rht_bucket_var(old_tbl, old_hash);
+ unsigned long flags;
int err;
if (!bkt)
return 0;
- rht_lock(old_tbl, bkt);
+ rht_lock(old_tbl, bkt, &flags);
while (!(err = rhashtable_rehash_one(ht, bkt, old_hash)))
;
if (err == -ENOENT)
err = 0;
- rht_unlock(old_tbl, bkt);
+ rht_unlock(old_tbl, bkt, &flags);
return err;
}
@@ -590,6 +593,7 @@ static void *rhashtable_try_insert(struct rhashtable *ht, const void *key,
struct bucket_table *new_tbl;
struct bucket_table *tbl;
struct rhash_lock_head __rcu **bkt;
+ unsigned long flags;
unsigned int hash;
void *data;
@@ -607,7 +611,7 @@ static void *rhashtable_try_insert(struct rhashtable *ht, const void *key,
new_tbl = rht_dereference_rcu(tbl->future_tbl, ht);
data = ERR_PTR(-EAGAIN);
} else {
- rht_lock(tbl, bkt);
+ rht_lock(tbl, bkt, &flags);
data = rhashtable_lookup_one(ht, bkt, tbl,
hash, key, obj);
new_tbl = rhashtable_insert_one(ht, bkt, tbl,
@@ -615,7 +619,7 @@ static void *rhashtable_try_insert(struct rhashtable *ht, const void *key,
if (PTR_ERR(new_tbl) != -EEXIST)
data = ERR_CAST(new_tbl);
- rht_unlock(tbl, bkt);
+ rht_unlock(tbl, bkt, &flags);
}
} while (!IS_ERR_OR_NULL(new_tbl));
--
2.38.1
From: David Vernet <[email protected]>
The most common and critical way that a BPF scheduler can misbehave is by
failing to run runnable tasks for too long. This patch implements a
watchdog.
* All tasks record when they become runnable.
* A watchdog work periodically scans all runnable tasks. If any task has
stayed runnable for too long, the BPF scheduler is aborted.
* scheduler_tick() monitors whether the watchdog itself is stuck. If so, the
BPF scheduler is aborted.
Because the watchdog only scans the tasks which are currently runnable and
usually very infrequently, the overhead should be negligible.
scx_example_qmap is updated so that it can be told to stall user and/or
kernel tasks.
A detected task stall looks like the following:
sched_ext: BPF scheduler "qmap" errored, disabling
sched_ext: runnable task stall (dbus-daemon[953] failed to run for 6.478s)
scx_check_timeout_workfn+0x10e/0x1b0
process_one_work+0x287/0x560
worker_thread+0x234/0x420
kthread+0xe9/0x100
ret_from_fork+0x1f/0x30
A detected watchdog stall:
sched_ext: BPF scheduler "qmap" errored, disabling
sched_ext: runnable task stall (watchdog failed to check in for 5.001s)
scheduler_tick+0x2eb/0x340
update_process_times+0x7a/0x90
tick_sched_timer+0xd8/0x130
__hrtimer_run_queues+0x178/0x3b0
hrtimer_interrupt+0xfc/0x390
__sysvec_apic_timer_interrupt+0xb7/0x2b0
sysvec_apic_timer_interrupt+0x90/0xb0
asm_sysvec_apic_timer_interrupt+0x1b/0x20
default_idle+0x14/0x20
arch_cpu_idle+0xf/0x20
default_idle_call+0x50/0x90
do_idle+0xe8/0x240
cpu_startup_entry+0x1d/0x20
kernel_init+0x0/0x190
start_kernel+0x0/0x392
start_kernel+0x324/0x392
x86_64_start_reservations+0x2a/0x2c
x86_64_start_kernel+0x104/0x109
secondary_startup_64_no_verify+0xce/0xdb
Note that this patch exposes scx_ops_error[_type]() in kernel/sched/ext.h to
inline scx_notify_sched_tick().
Signed-off-by: David Vernet <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/sched/ext.h | 14 +++
init/init_task.c | 2 +
kernel/sched/core.c | 3 +
kernel/sched/ext.c | 133 +++++++++++++++++++++++--
kernel/sched/ext.h | 26 +++++
kernel/sched/sched.h | 1 +
tools/sched_ext/scx_example_qmap.bpf.c | 12 +++
tools/sched_ext/scx_example_qmap.c | 14 ++-
8 files changed, 193 insertions(+), 12 deletions(-)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 1ec8be53057f..1a57945abea0 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -59,6 +59,7 @@ enum scx_exit_type {
SCX_EXIT_ERROR = 1024, /* runtime error, error msg contains details */
SCX_EXIT_ERROR_BPF, /* ERROR but triggered through scx_bpf_error() */
+ SCX_EXIT_ERROR_STALL, /* watchdog detected stalled runnable tasks */
};
/*
@@ -323,6 +324,15 @@ struct sched_ext_ops {
*/
u64 flags;
+ /**
+ * timeout_ms - The maximum amount of time, in milliseconds, that a
+ * runnable task should be able to wait before being scheduled. The
+ * maximum timeout may not exceed the default timeout of 30 seconds.
+ *
+ * Defaults to the maximum allowed timeout value of 30 seconds.
+ */
+ u32 timeout_ms;
+
/**
* name - BPF scheduler's name
*
@@ -357,6 +367,8 @@ enum scx_ent_flags {
SCX_TASK_OPS_PREPPED = 1 << 3, /* prepared for BPF scheduler enable */
SCX_TASK_OPS_ENABLED = 1 << 4, /* task has BPF scheduler enabled */
+ SCX_TASK_WATCHDOG_RESET = 1 << 5, /* task watchdog counter should be reset */
+
SCX_TASK_CURSOR = 1 << 6, /* iteration cursor, not a task */
};
@@ -367,11 +379,13 @@ enum scx_ent_flags {
struct sched_ext_entity {
struct scx_dispatch_q *dsq;
struct list_head dsq_node;
+ struct list_head watchdog_node;
u32 flags; /* protected by rq lock */
u32 weight;
s32 sticky_cpu;
s32 holding_cpu;
atomic64_t ops_state;
+ unsigned long runnable_at;
/* BPF scheduler modifiable fields */
diff --git a/init/init_task.c b/init/init_task.c
index bdbc663107bf..913194aab623 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -106,9 +106,11 @@ struct task_struct init_task
#ifdef CONFIG_SCHED_CLASS_EXT
.scx = {
.dsq_node = LIST_HEAD_INIT(init_task.scx.dsq_node),
+ .watchdog_node = LIST_HEAD_INIT(init_task.scx.watchdog_node),
.sticky_cpu = -1,
.holding_cpu = -1,
.ops_state = ATOMIC_INIT(0),
+ .runnable_at = INITIAL_JIFFIES,
.slice = SCX_SLICE_DFL,
},
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dc499f18573a..39d9ccb64f40 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4366,11 +4366,13 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
#ifdef CONFIG_SCHED_CLASS_EXT
p->scx.dsq = NULL;
INIT_LIST_HEAD(&p->scx.dsq_node);
+ INIT_LIST_HEAD(&p->scx.watchdog_node);
p->scx.flags = 0;
p->scx.weight = 0;
p->scx.sticky_cpu = -1;
p->scx.holding_cpu = -1;
atomic64_set(&p->scx.ops_state, 0);
+ p->scx.runnable_at = INITIAL_JIFFIES;
p->scx.slice = SCX_SLICE_DFL;
#endif
@@ -5514,6 +5516,7 @@ void scheduler_tick(void)
if (sched_feat(LATENCY_WARN) && resched_latency)
resched_latency_warn(cpu, resched_latency);
+ scx_notify_sched_tick();
perf_event_task_tick();
#ifdef CONFIG_SMP
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index a0057a8447cb..030175f2b1d6 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -9,6 +9,7 @@
enum scx_internal_consts {
SCX_NR_ONLINE_OPS = SCX_OP_IDX(init),
SCX_DSP_DFL_MAX_BATCH = 32,
+ SCX_MAX_RUNNABLE_TIMEOUT = 30 * MSEC_PER_SEC,
};
enum scx_ops_enable_state {
@@ -87,6 +88,23 @@ static struct scx_exit_info scx_exit_info;
static atomic64_t scx_nr_rejected = ATOMIC64_INIT(0);
+/*
+ * The maximum amount of time that a task may be runnable without being
+ * scheduled on a CPU. If this timeout is exceeded, it will trigger
+ * scx_ops_error().
+ */
+unsigned long task_runnable_timeout_ms;
+
+/*
+ * The last time the delayed work was run. This delayed work relies on
+ * ksoftirqd being able to run to service timer interrupts, so it's possible
+ * that this work itself could get wedged. To account for this, we check that
+ * it's not stalled in the timer tick, and trigger an error if it is.
+ */
+unsigned long last_timeout_check = INITIAL_JIFFIES;
+
+static struct delayed_work check_timeout_work;
+
/* idle tracking */
#ifdef CONFIG_SMP
#ifdef CONFIG_CPUMASK_OFFSTACK
@@ -148,10 +166,6 @@ static DEFINE_PER_CPU(struct consume_ctx, consume_ctx);
void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice,
u64 enq_flags);
-__printf(2, 3) static void scx_ops_error_type(enum scx_exit_type type,
- const char *fmt, ...);
-#define scx_ops_error(fmt, args...) \
- scx_ops_error_type(SCX_EXIT_ERROR, fmt, ##args)
struct scx_task_iter {
struct sched_ext_entity cursor;
@@ -597,6 +611,27 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
dispatch_enqueue(&scx_dsq_global, p, enq_flags);
}
+static bool watchdog_task_watched(const struct task_struct *p)
+{
+ return !list_empty(&p->scx.watchdog_node);
+}
+
+static void watchdog_watch_task(struct rq *rq, struct task_struct *p)
+{
+ lockdep_assert_rq_held(rq);
+ if (p->scx.flags & SCX_TASK_WATCHDOG_RESET)
+ p->scx.runnable_at = jiffies;
+ p->scx.flags &= ~SCX_TASK_WATCHDOG_RESET;
+ list_add_tail(&p->scx.watchdog_node, &rq->scx.watchdog_list);
+}
+
+static void watchdog_unwatch_task(struct task_struct *p, bool reset_timeout)
+{
+ list_del_init(&p->scx.watchdog_node);
+ if (reset_timeout)
+ p->scx.flags |= SCX_TASK_WATCHDOG_RESET;
+}
+
static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags)
{
int sticky_cpu = p->scx.sticky_cpu;
@@ -613,9 +648,12 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
if (unlikely(enq_flags & ENQUEUE_RESTORE) && task_current(rq, p))
sticky_cpu = cpu_of(rq);
- if (p->scx.flags & SCX_TASK_QUEUED)
+ if (p->scx.flags & SCX_TASK_QUEUED) {
+ WARN_ON_ONCE(!watchdog_task_watched(p));
return;
+ }
+ watchdog_watch_task(rq, p);
p->scx.flags |= SCX_TASK_QUEUED;
rq->scx.nr_running++;
add_nr_running(rq, 1);
@@ -628,8 +666,12 @@ static void dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags
struct scx_rq *scx_rq = &rq->scx;
u64 opss;
- if (!(p->scx.flags & SCX_TASK_QUEUED))
+ if (!(p->scx.flags & SCX_TASK_QUEUED)) {
+ WARN_ON_ONCE(watchdog_task_watched(p));
return;
+ }
+
+ watchdog_unwatch_task(p, false);
/* acquire ensures that we see the preceding updates on QUEUED */
opss = atomic64_read_acquire(&p->scx.ops_state);
@@ -1168,6 +1210,8 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
}
p->se.exec_start = rq_clock_task(rq);
+
+ watchdog_unwatch_task(p, true);
}
static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
@@ -1180,11 +1224,14 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
*/
if (p->scx.flags & SCX_TASK_BAL_KEEP) {
p->scx.flags &= ~SCX_TASK_BAL_KEEP;
+ watchdog_watch_task(rq, p);
dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
return;
}
if (p->scx.flags & SCX_TASK_QUEUED) {
+ watchdog_watch_task(rq, p);
+
/*
* If @p has slice left and balance_scx() didn't tag it for
* keeping, @p is getting preempted by a higher priority
@@ -1405,6 +1452,48 @@ static void reset_idle_masks(void) {}
#endif /* CONFIG_SMP */
+static bool check_rq_for_timeouts(struct rq *rq)
+{
+ struct task_struct *p;
+ unsigned long flags;
+ bool timed_out = false;
+ unsigned long timeout = msecs_to_jiffies(task_runnable_timeout_ms);
+
+ raw_spin_rq_lock_irqsave(rq, flags);
+ list_for_each_entry(p, &rq->scx.watchdog_list, scx.watchdog_node) {
+ unsigned long last_runnable = p->scx.runnable_at;
+
+ if (unlikely(time_after(jiffies, last_runnable + timeout))) {
+ u32 dur_ms = jiffies_to_msecs(jiffies - last_runnable);
+
+ scx_ops_error_type(SCX_EXIT_ERROR_STALL,
+ "%s[%d] failed to run for %u.%03us",
+ p->comm, p->pid,
+ dur_ms / 1000, dur_ms % 1000);
+ timed_out = true;
+ break;
+ }
+ }
+ raw_spin_rq_unlock_irqrestore(rq, flags);
+
+ return timed_out;
+}
+
+static void scx_check_timeout_workfn(struct work_struct *work)
+{
+ int cpu;
+
+ last_timeout_check = jiffies;
+ for_each_online_cpu(cpu) {
+ if (unlikely(check_rq_for_timeouts(cpu_rq(cpu))))
+ break;
+
+ cond_resched();
+ }
+ queue_delayed_work(system_unbound_wq, to_delayed_work(work),
+ task_runnable_timeout_ms / 2);
+}
+
static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued)
{
update_curr_scx(rq);
@@ -1430,7 +1519,7 @@ static int scx_ops_prepare_task(struct task_struct *p, struct task_group *tg)
}
}
- p->scx.flags |= SCX_TASK_OPS_PREPPED;
+ p->scx.flags |= (SCX_TASK_OPS_PREPPED | SCX_TASK_WATCHDOG_RESET);
return 0;
}
@@ -1767,6 +1856,8 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
break;
}
+ cancel_delayed_work_sync(&check_timeout_work);
+
switch (type) {
case SCX_EXIT_UNREG:
reason = "BPF scheduler unregistered";
@@ -1780,6 +1871,9 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
case SCX_EXIT_ERROR_BPF:
reason = "scx_bpf_error";
break;
+ case SCX_EXIT_ERROR_STALL:
+ reason = "runnable task stall";
+ break;
default:
reason = "<UNKNOWN>";
}
@@ -1954,8 +2048,8 @@ static void scx_ops_error_irq_workfn(struct irq_work *irq_work)
static DEFINE_IRQ_WORK(scx_ops_error_irq_work, scx_ops_error_irq_workfn);
-__printf(2, 3) static void scx_ops_error_type(enum scx_exit_type type,
- const char *fmt, ...)
+__printf(2, 3) void scx_ops_error_type(enum scx_exit_type type,
+ const char *fmt, ...)
{
struct scx_exit_info *ei = &scx_exit_info;
int none = SCX_EXIT_NONE;
@@ -2059,6 +2153,20 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
goto err_disable;
}
+ task_runnable_timeout_ms = SCX_MAX_RUNNABLE_TIMEOUT;
+ if (ops->timeout_ms) {
+ /*
+ * Excessively large timeouts should have been rejected in
+ * bpf_scx_init_member().
+ */
+ WARN_ON_ONCE(ops->timeout_ms > SCX_MAX_RUNNABLE_TIMEOUT);
+ task_runnable_timeout_ms = ops->timeout_ms;
+ }
+
+ last_timeout_check = jiffies;
+ queue_delayed_work(system_unbound_wq, &check_timeout_work,
+ task_runnable_timeout_ms / 2);
+
/*
* Lock out forks before opening the floodgate so that they don't wander
* into the operations prematurely.
@@ -2316,6 +2424,11 @@ static int bpf_scx_init_member(const struct btf_type *t,
if (ret == 0)
return -EINVAL;
return 1;
+ case offsetof(struct sched_ext_ops, timeout_ms):
+ if (*(u32 *)(udata + moff) > SCX_MAX_RUNNABLE_TIMEOUT)
+ return -E2BIG;
+ ops->timeout_ms = *(u32 *)(udata + moff);
+ return 1;
}
return 0;
@@ -2416,10 +2529,12 @@ void __init init_sched_ext_class(void)
struct rq *rq = cpu_rq(cpu);
init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL);
+ INIT_LIST_HEAD(&rq->scx.watchdog_list);
rq->scx.nr_running = 0;
}
register_sysrq_key('S', &sysrq_sched_ext_reset_op);
+ INIT_DELAYED_WORK(&check_timeout_work, scx_check_timeout_workfn);
}
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 6d5669481274..bda1d9c11486 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -67,6 +67,8 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx);
extern const struct sched_class ext_sched_class;
extern const struct bpf_verifier_ops bpf_sched_ext_verifier_ops;
extern const struct file_operations sched_ext_fops;
+extern unsigned long task_runnable_timeout_ms;
+extern unsigned long last_timeout_check;
DECLARE_STATIC_KEY_FALSE(__scx_ops_enabled);
#define scx_enabled() static_branch_unlikely(&__scx_ops_enabled)
@@ -79,6 +81,29 @@ void scx_cancel_fork(struct task_struct *p);
int balance_scx(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
void init_sched_ext_class(void);
+__printf(2, 3) void scx_ops_error_type(enum scx_exit_type type,
+ const char *fmt, ...);
+#define scx_ops_error(fmt, args...) \
+ scx_ops_error_type(SCX_EXIT_ERROR, fmt, ##args)
+
+static inline void scx_notify_sched_tick(void)
+{
+ unsigned long last_check, timeout;
+
+ if (!scx_enabled())
+ return;
+
+ last_check = last_timeout_check;
+ timeout = msecs_to_jiffies(task_runnable_timeout_ms);
+ if (unlikely(time_after(jiffies, last_check + timeout))) {
+ u32 dur_ms = jiffies_to_msecs(jiffies - last_check);
+
+ scx_ops_error_type(SCX_EXIT_ERROR_STALL,
+ "watchdog failed to check in for %u.%03us",
+ dur_ms / 1000, dur_ms % 1000);
+ }
+}
+
static inline const struct sched_class *next_active_class(const struct sched_class *class)
{
class++;
@@ -112,6 +137,7 @@ static inline void scx_cancel_fork(struct task_struct *p) {}
static inline int balance_scx(struct rq *rq, struct task_struct *prev,
struct rq_flags *rf) { return 0; }
static inline void init_sched_ext_class(void) {}
+static inline void scx_notify_sched_tick(void) {}
#define for_each_active_class for_each_class
#define for_balance_class_range for_class_range
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 18a80d5b542b..40692fa7cb3c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -692,6 +692,7 @@ struct cfs_rq {
#ifdef CONFIG_SCHED_CLASS_EXT
struct scx_rq {
struct scx_dispatch_q local_dsq;
+ struct list_head watchdog_list;
u64 ops_qseq;
u32 nr_running;
#ifdef CONFIG_SMP
diff --git a/tools/sched_ext/scx_example_qmap.bpf.c b/tools/sched_ext/scx_example_qmap.bpf.c
index 742c866d2a8e..9e0b6519c8a4 100644
--- a/tools/sched_ext/scx_example_qmap.bpf.c
+++ b/tools/sched_ext/scx_example_qmap.bpf.c
@@ -22,6 +22,8 @@
char _license[] SEC("license") = "GPL";
const volatile u64 slice_ns = SCX_SLICE_DFL;
+const volatile u32 stall_user_nth;
+const volatile u32 stall_kernel_nth;
u32 test_error_cnt;
@@ -106,6 +108,15 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
u32 pid = p->pid;
int idx;
void *ring;
+ static u32 user_cnt, kernel_cnt;
+
+ if (p->flags & PF_KTHREAD) {
+ if (stall_kernel_nth && !(++kernel_cnt % stall_kernel_nth))
+ return;
+ } else {
+ if (stall_user_nth && !(++user_cnt % stall_user_nth))
+ return;
+ }
if (test_error_cnt && !--test_error_cnt)
scx_bpf_error("test triggering error");
@@ -224,5 +235,6 @@ struct sched_ext_ops qmap_ops = {
.dispatch = (void *)qmap_dispatch,
.prep_enable = (void *)qmap_prep_enable,
.exit = (void *)qmap_exit,
+ .timeout_ms = 5000U,
.name = "qmap",
};
diff --git a/tools/sched_ext/scx_example_qmap.c b/tools/sched_ext/scx_example_qmap.c
index c5a208533404..34c764c38e19 100644
--- a/tools/sched_ext/scx_example_qmap.c
+++ b/tools/sched_ext/scx_example_qmap.c
@@ -20,10 +20,12 @@ const char help_fmt[] =
"\n"
"See the top-level comment in .bpf.c for more details.\n"
"\n"
-"Usage: %s [-s SLICE_US] [-e COUNT]\n"
+"Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT]\n"
"\n"
" -s SLICE_US Override slice duration\n"
" -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n"
+" -t COUNT Stall every COUNT'th user thread\n"
+" -T COUNT Stall every COUNT'th kernel thread\n"
" -h Display this help and exit\n";
static volatile int exit_req;
@@ -47,13 +49,19 @@ int main(int argc, char **argv)
skel = scx_example_qmap__open();
assert(skel);
- while ((opt = getopt(argc, argv, "hs:e:tTd:")) != -1) {
+ while ((opt = getopt(argc, argv, "hs:e:t:T:d:")) != -1) {
switch (opt) {
case 's':
skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
break;
case 'e':
- skel->bss->test_error_cnt = strtoull(optarg, NULL, 0);
+ skel->bss->test_error_cnt = strtoul(optarg, NULL, 0);
+ break;
+ case 't':
+ skel->rodata->stall_user_nth = strtoul(optarg, NULL, 0);
+ break;
+ case 'T':
+ skel->rodata->stall_kernel_nth = strtoul(optarg, NULL, 0);
break;
default:
fprintf(stderr, help_fmt, basename(argv[0]));
--
2.38.1
Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
Documentation/scheduler/index.rst | 1 +
Documentation/scheduler/sched-ext.rst | 230 ++++++++++++++++++++++++++
include/linux/sched/ext.h | 2 +
kernel/sched/ext.c | 2 +
kernel/sched/ext.h | 2 +
5 files changed, 237 insertions(+)
create mode 100644 Documentation/scheduler/sched-ext.rst
diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index b430d856056a..8a27a9967284 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -18,6 +18,7 @@ Linux Scheduler
sched-nice-design
sched-rt-group
sched-stats
+ sched-ext
sched-debug
text_files
diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
new file mode 100644
index 000000000000..81f78e05a6c2
--- /dev/null
+++ b/Documentation/scheduler/sched-ext.rst
@@ -0,0 +1,230 @@
+==========================
+Extensible Scheduler Class
+==========================
+
+sched_ext is a scheduler class whose behavior can be defined by a set of BPF
+programs - the BPF scheduler.
+
+* sched_ext exports a full scheduling interface so that any scheduling
+ algorithm can be implemented on top.
+
+* The BPF scheduler can group CPUs however it sees fit and schedule them
+ together, as tasks aren't tied to specific CPUs at the time of wakeup.
+
+* The BPF scheduler can be turned on and off dynamically anytime.
+
+* The system integrity is maintained no matter what the BPF scheduler does.
+ The default scheduling behavior is restored anytime an error is detected,
+ a runnable task stalls, or on sysrq-S.
+
+Switching to and from sched_ext
+===============================
+
+``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and
+``tools/sched_ext`` contains the example schedulers.
+
+sched_ext is used only when the BPF scheduler is loaded and running.
+
+If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be
+treated as ``SCHED_NORMAL`` and scheduled by CFS until the BPF scheduler is
+loaded. On load, such tasks will be switched to and scheduled by sched_ext.
+
+The BPF scheduler can choose to schedule all normal and lower class tasks by
+calling ``scx_bpf_switch_all()`` from its ``init()`` operation. In this
+case, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE`` and
+``SCHED_EXT`` tasks are scheduled by sched_ext. In the example schedulers,
+this mode can be selected with the ``-a`` option.
+
+Terminating the sched_ext scheduler program, triggering sysrq-S, or
+detection of any internal error including stalled runnable tasks aborts the
+BPF scheduler and reverts all tasks back to CFS.
+
+.. code-block:: none
+
+ # make -j16 -C tools/sched_ext
+ # tools/sched_ext/scx_example_dummy -a
+ local=0 global=3
+ local=5 global=24
+ local=9 global=44
+ local=13 global=56
+ local=17 global=72
+ ^CEXIT: BPF scheduler unregistered
+
+If ``CONFIG_SCHED_DEBUG`` is set, the current status of the BPF scheduler
+and whether a given task is on sched_ext can be determined as follows:
+
+.. code-block:: none
+
+ # cat /sys/kernel/debug/sched/ext
+ ops : dummy
+ enabled : 1
+ switching_all : 1
+ switched_all : 1
+ enable_state : enabled
+
+ # grep ext /proc/self/sched
+ ext.enabled : 1
+
+The Basics
+==========
+
+Userspace can implement an arbitrary BPF scheduler by loading a set of BPF
+programs that implement ``struct sched_ext_ops``. The only mandatory field
+is ``.name`` which must be a valid BPF object name. All operations are
+optional. The following modified excerpt is from
+``tools/sched/scx_example_dummy.bpf.c`` showing a minimal global FIFO
+scheduler.
+
+.. code-block:: c
+
+ s32 BPF_STRUCT_OPS(dummy_init)
+ {
+ if (switch_all)
+ scx_bpf_switch_all();
+ return 0;
+ }
+
+ void BPF_STRUCT_OPS(dummy_enqueue, struct task_struct *p, u64 enq_flags)
+ {
+ if (enq_flags & SCX_ENQ_LOCAL)
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL, enq_flags);
+ else
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, enq_flags);
+ }
+
+ void BPF_STRUCT_OPS(dummy_exit, struct scx_exit_info *ei)
+ {
+ exit_type = ei->type;
+ }
+
+ SEC(".struct_ops")
+ struct sched_ext_ops dummy_ops = {
+ .enqueue = (void *)dummy_enqueue,
+ .init = (void *)dummy_init,
+ .exit = (void *)dummy_exit,
+ .name = "dummy",
+ };
+
+Dispatch Queues
+---------------
+
+To match the impedance between the scheduler core and the BPF scheduler,
+sched_ext uses simple FIFOs called dsq's (dispatch queues). By default,
+there is one global FIFO (``SCX_DSQ_GLOBAL``), and one local dsq per CPU
+(``SCX_DSQ_LOCAL``). The BPF scheduler can manage an arbitrary number of
+dsq's using ``scx_bpf_create_dsq()`` and ``scx_bpf_destroy_dsq()``.
+
+A task is always *dispatch*ed to a dsq for execution. The task starts
+execution when a CPU *consume*s the task from the dsq.
+
+Internally, a CPU only executes tasks which are running on its local dsq,
+and the ``.consume()`` operation is in fact a transfer of a task from a
+remote dsq to the CPU's local dsq. A CPU therefore only consumes from other
+dsq's when its local dsq is empty, and dispatching a task to a local dsq
+will cause it to be executed before the CPU attempts to consume tasks which
+were previously dispatched to other dsq's.
+
+Scheduling Cycle
+----------------
+
+The following briefly shows how a waking task is scheduled and executed.
+
+1. When a task is waking up, ``.select_cpu()`` is the first operation
+ invoked. This serves two purposes. First, CPU selection optimization
+ hint. Second, waking up the selected CPU if idle.
+
+ The CPU selected by ``.select_cpu()`` is an optimization hint and not
+ binding. The actual decision is made at the last step of scheduling.
+ However, there is a small performance gain if the CPU ``.select_cpu()``
+ returns matches the CPU the task eventually runs on.
+
+ A side-effect of selecting a CPU is waking it up from idle. While a BPF
+ scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper,
+ using ``.select_cpu()`` judiciously can be simpler and more efficient.
+
+ Note that the scheduler core will ignore an invalid CPU selection, for
+ example, if it's outside the allowed cpumask of the task.
+
+2. Once the target CPU is selected, ``.enqueue()`` is invoked. It can make
+ one of the following decisions:
+
+ * Immediately dispatch the task to either the global or local dsq by
+ calling ``scx_bpf_dispatch()`` with ``SCX_DSQ_GLOBAL`` or
+ ``SCX_DSQ_LOCAL``, respectively.
+
+ * Immediately dispatch the task to a user-created dsq by calling
+ ``scx_bpf_dispatch()`` with a dsq ID which is smaller than 2^63.
+
+ * Queue the task on the BPF side.
+
+3. When a CPU is ready to schedule, it first looks at its local dsq. If
+ empty, it invokes ``.consume()`` which should make one or more
+ ``scx_bpf_consume()`` calls to consume tasks from dsq's. If a
+ ``scx_bpf_consume()`` call succeeds, the CPU has the next task to run and
+ ``.consume()`` can return.
+
+ If ``.consume()`` is not implemented, the built-in ``SCX_DSQ_GLOBAL`` dsq
+ is consumed by default.
+
+4. If there's still no task to run, ``.dispatch()`` is invoked which should
+ make one or more ``scx_bpf_dispatch()`` calls to dispatch tasks from the
+ BPF scheduler to one of the dsq's. If more than one task has been
+ dispatched, go back to the previous consumption step.
+
+5. If there's still no task to run, ``.consume_final()`` is invoked.
+ ``.consume_final()`` is equivalent to ``.consume()``, but is invoked
+ right before the CPU goes idle. This provide schedulers with a hook that
+ can be used to implement, e.g., more aggressive work stealing from remote
+ dsq's.
+
+Note that the BPF scheduler can always choose to dispatch tasks immediately
+in ``.enqueue()`` as illustrated in the above dummy example. In such case,
+there's no need to implement ``.dispatch()`` as a task is never queued on
+the BPF side.
+
+Where to Look
+=============
+
+* ``include/linux/sched/ext.h`` defines the core data structures, ops table
+ and constants.
+
+* ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers.
+ The functions prefixed with ``scx_bpf_`` can be called from the BPF
+ scheduler.
+
+* ``tools/sched_ext/`` hosts example BPF scheduler implementations.
+
+ * ``scx_example_dummy[.bpf].c``: Minimal global FIFO scheduler example
+ using a custom dsq.
+
+ * ``scx_example_qmap[.bpf].c``: A multi-level FIFO scheduler supporting
+ five levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``.
+
+ABI Instability
+===============
+
+The APIs provided by sched_ext to BPF schedulers programs have no stability
+guarantees. This includes the ops table callbacks and constants defined in
+``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in
+``kernel/sched/ext.c``.
+
+While we will attempt to provide a relatively stable API surface when
+possible, they are subject to change without warning between kernel
+versions.
+
+Caveats
+=======
+
+* The current implementation isn't safe in that the BPF scheduler can crash
+ the kernel.
+
+ * Unsafe cpumask helpers should be replaced by proper generic BPF helpers.
+
+ * Currently, all kfunc helpers can be called by any operation as BPF
+ doesn't yet support filtering kfunc calls per struct_ops operation. Some
+ helpers are context sensitive as should be restricted accordingly.
+
+ * Timers used by the BPF scheduler should be shut down when aborting.
+
+* There are a couple BPF hacks which are still needed even for sched_ext
+ proper. They should be removed in the near future.
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index d9f941e23011..49eda3adeecf 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -1,5 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
* Copyright (c) 2022 Tejun Heo <[email protected]>
* Copyright (c) 2022 David Vernet <[email protected]>
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index aab9ae13b88f..a28144220501 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1,5 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
* Copyright (c) 2022 Tejun Heo <[email protected]>
* Copyright (c) 2022 David Vernet <[email protected]>
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index e1eaaba3d4c7..b97dbb840ac9 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -1,5 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
* Copyright (c) 2022 Tejun Heo <[email protected]>
* Copyright (c) 2022 David Vernet <[email protected]>
--
2.38.1
When a task switches to a new sched_class, the prev and new classes are
notified through ->switched_from() and ->switched_to(), respectively, after
the switching is done. However, a new sched_class needs to prepare the task
state before it is enqueued on the new class for the first time.
This patch adds ->switching_to() which is called during sched_class switch
through check_class_changing() before the task is restored and exposes
check_class_changing/changed() in kernel/sched/sched.h.
This is a prep patch and doesn't cause any behavior changes. The new
operation and exposed functions aren't used yet.
Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/core.c | 20 +++++++++++++++++---
kernel/sched/sched.h | 7 +++++++
2 files changed, 24 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 70ec74dbb45a..d2247e8144e3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2151,6 +2151,17 @@ inline int task_curr(const struct task_struct *p)
return cpu_curr(task_cpu(p)) == p;
}
+/*
+ * ->switching_to() is called with the pi_lock and rq_lock held and must not
+ * mess with locking.
+ */
+void check_class_changing(struct rq *rq, struct task_struct *p,
+ const struct sched_class *prev_class)
+{
+ if (prev_class != p->sched_class && p->sched_class->switching_to)
+ p->sched_class->switching_to(rq, p);
+}
+
/*
* switched_from, switched_to and prio_changed must _NOT_ drop rq->lock,
* use the balance_callback list if you want balancing.
@@ -2158,9 +2169,9 @@ inline int task_curr(const struct task_struct *p)
* this means any call to check_class_changed() must be followed by a call to
* balance_callback().
*/
-static inline void check_class_changed(struct rq *rq, struct task_struct *p,
- const struct sched_class *prev_class,
- int oldprio)
+void check_class_changed(struct rq *rq, struct task_struct *p,
+ const struct sched_class *prev_class,
+ int oldprio)
{
if (prev_class != p->sched_class) {
if (prev_class->switched_from)
@@ -6974,6 +6985,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
}
__setscheduler_prio(p, prio);
+ check_class_changing(rq, p, prev_class);
if (queued)
enqueue_task(rq, p, queue_flag);
@@ -7603,6 +7615,8 @@ static int __sched_setscheduler(struct task_struct *p,
}
__setscheduler_uclamp(p, attr);
+ check_class_changing(rq, p, prev_class);
+
if (queued) {
/*
* We enqueue to tail when the priority of a task is
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 08799b2a566e..3f98773d66dd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2191,6 +2191,7 @@ struct sched_class {
* cannot assume the switched_from/switched_to pair is serialized by
* rq->lock. They are however serialized by p->pi_lock.
*/
+ void (*switching_to) (struct rq *this_rq, struct task_struct *task);
void (*switched_from)(struct rq *this_rq, struct task_struct *task);
void (*switched_to) (struct rq *this_rq, struct task_struct *task);
void (*reweight_task)(struct rq *this_rq, struct task_struct *task,
@@ -2427,6 +2428,12 @@ static inline void sub_nr_running(struct rq *rq, unsigned count)
extern void activate_task(struct rq *rq, struct task_struct *p, int flags);
extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);
+extern void check_class_changing(struct rq *rq, struct task_struct *p,
+ const struct sched_class *prev_class);
+extern void check_class_changed(struct rq *rq, struct task_struct *p,
+ const struct sched_class *prev_class,
+ int oldprio);
+
extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);
#ifdef CONFIG_PREEMPT_RT
--
2.38.1
It's often useful to wake up and/or trigger reschedule on other CPUs. This
patch adds scx_bpf_kick_cpu() kfunc helper that BPF scheduler can call to
kick the target CPU into the scheduling path.
As a sched_ext task relinquishes its CPU only after its slice is depleted,
this patch also adds SCX_KICK_PREEMPT and SCX_ENQ_PREEMPT which clears the
slice of the target CPU's current task to guarantee that sched_ext's
scheduling path runs on the CPU.
This patch also adds a new example scheduler, scx_example_central, which
demonstrates central scheduling where one CPU is responsible for making all
scheduling decisions in the system. The central CPU makes scheduling
decisions for all CPUs in the system, queues tasks on the appropriate local
dsq's and preempts the worker CPUs. The worker CPUs in turn preempt the
central CPU when it needs tasks to run.
Currently, every CPU depends on its own tick to expire the current task. A
follow-up patch implementing tickless support for sched_ext will allow the
worker CPUs to go full tickless so that they can run completely undisturbed.
Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/sched/ext.h | 4 +
kernel/sched/ext.c | 88 ++++++++-
kernel/sched/ext.h | 12 ++
kernel/sched/sched.h | 1 +
tools/sched_ext/.gitignore | 1 +
tools/sched_ext/Makefile | 8 +-
tools/sched_ext/scx_common.bpf.h | 1 +
tools/sched_ext/scx_example_central.bpf.c | 229 ++++++++++++++++++++++
tools/sched_ext/scx_example_central.c | 91 +++++++++
9 files changed, 430 insertions(+), 5 deletions(-)
create mode 100644 tools/sched_ext/scx_example_central.bpf.c
create mode 100644 tools/sched_ext/scx_example_central.c
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 82dcbecfcfb9..6e25c3431bb4 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -394,6 +394,10 @@ struct sched_ext_entity {
* scx_bpf_dispatch() but can also be modified directly by the BPF
* scheduler. Automatically decreased by SCX as the task executes. On
* depletion, a scheduling event is triggered.
+ *
+ * This value is cleared to zero if the task is preempted by
+ * %SCX_KICK_PREEMPT and shouldn't be used to determine how long the
+ * task ran. Use p->se.sum_exec_runtime instead.
*/
u64 slice;
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index ba0a7a9ea5f2..4a98047a06bc 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -404,7 +404,7 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,
}
}
- if (enq_flags & SCX_ENQ_HEAD)
+ if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
list_add(&p->scx.dsq_node, &dsq->fifo);
else
list_add_tail(&p->scx.dsq_node, &dsq->fifo);
@@ -420,8 +420,16 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,
if (is_local) {
struct rq *rq = container_of(dsq, struct rq, scx.local_dsq);
+ bool preempt = false;
- if (sched_class_above(&ext_sched_class, rq->curr->sched_class))
+ if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->curr &&
+ rq->curr->sched_class == &ext_sched_class) {
+ rq->curr->scx.slice = 0;
+ preempt = true;
+ }
+
+ if (preempt || sched_class_above(&ext_sched_class,
+ rq->curr->sched_class))
resched_curr(rq);
} else {
raw_spin_unlock(&dsq->lock);
@@ -1707,7 +1715,9 @@ int scx_check_setscheduler(struct task_struct *p, int policy)
* Omitted operations:
*
* - check_preempt_curr: NOOP as it isn't useful in the wakeup path because the
- * task isn't tied to the CPU at that point.
+ * task isn't tied to the CPU at that point. Preemption is implemented by
+ * resetting the victim task's slice to 0 and triggering reschedule on the
+ * target CPU.
*
* - migrate_task_rq: Unncessary as task to cpu mapping is transient.
*
@@ -2564,6 +2574,34 @@ static const struct sysrq_key_op sysrq_sched_ext_reset_op = {
.enable_mask = SYSRQ_ENABLE_RTNICE,
};
+#ifdef CONFIG_SMP
+static void kick_cpus_irq_workfn(struct irq_work *irq_work)
+{
+ struct rq *this_rq = this_rq();
+ int this_cpu = cpu_of(this_rq);
+ int cpu;
+
+ for_each_cpu(cpu, this_rq->scx.cpus_to_kick) {
+ struct rq *rq = cpu_rq(cpu);
+ unsigned long flags;
+
+ raw_spin_rq_lock_irqsave(rq, flags);
+
+ if (cpu_online(cpu) || cpu == this_cpu) {
+ if (cpumask_test_cpu(cpu, this_rq->scx.cpus_to_preempt) &&
+ rq->curr->sched_class == &ext_sched_class)
+ rq->curr->scx.slice = 0;
+ resched_curr(rq);
+ }
+
+ raw_spin_rq_unlock_irqrestore(rq, flags);
+ }
+
+ cpumask_clear(this_rq->scx.cpus_to_kick);
+ cpumask_clear(this_rq->scx.cpus_to_preempt);
+}
+#endif
+
void __init init_sched_ext_class(void)
{
int cpu;
@@ -2587,6 +2625,11 @@ void __init init_sched_ext_class(void)
init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL);
INIT_LIST_HEAD(&rq->scx.watchdog_list);
rq->scx.nr_running = 0;
+#ifdef CONFIG_SMP
+ BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick, GFP_KERNEL));
+ BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_preempt, GFP_KERNEL));
+ init_irq_work(&rq->scx.kick_cpus_irq_work, kick_cpus_irq_workfn);
+#endif
}
register_sysrq_key('S', &sysrq_sched_ext_reset_op);
@@ -2772,6 +2815,44 @@ static const struct btf_kfunc_id_set scx_kfunc_set_consume = {
.set = &scx_kfunc_ids_consume,
};
+/**
+ * scx_bpf_kick_cpu - Trigger reschedule on a CPU
+ * @cpu: cpu to kick
+ * @flags: SCX_KICK_* flags
+ *
+ * Kick @cpu into rescheduling. This can be used to wake up an idle CPU or
+ * trigger rescheduling on a busy CPU. This can be called from any online
+ * scx_ops operation and the actual kicking is performed asynchronously through
+ * an irq work.
+ */
+void scx_bpf_kick_cpu(s32 cpu, u64 flags)
+{
+ if (!ops_cpu_valid(cpu)) {
+ scx_ops_error("invalid cpu %d", cpu);
+ return;
+ }
+#ifdef CONFIG_SMP
+ {
+ struct rq *rq;
+
+ preempt_disable();
+ rq = this_rq();
+
+ /*
+ * Actual kicking is bounced to kick_cpus_irq_workfn() to avoid
+ * nesting rq locks. We can probably be smarter and avoid
+ * bouncing if called from ops which don't hold a rq lock.
+ */
+ cpumask_set_cpu(cpu, rq->scx.cpus_to_kick);
+ if (flags & SCX_KICK_PREEMPT)
+ cpumask_set_cpu(cpu, rq->scx.cpus_to_preempt);
+
+ irq_work_queue(&rq->scx.kick_cpus_irq_work);
+ preempt_enable();
+ }
+#endif
+}
+
/**
* scx_bpf_dsq_nr_queued - Return the number of queued tasks
* @dsq_id: id of the dsq
@@ -2845,6 +2926,7 @@ s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed)
}
BTF_SET8_START(scx_kfunc_ids_online)
+BTF_ID_FLAGS(func, scx_bpf_kick_cpu)
BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued)
BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu)
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 142dce30764d..3597b7b5829e 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -19,6 +19,14 @@ enum scx_enq_flags {
/* high 32bits are ext specific flags */
+ /*
+ * Set the following to trigger preemption when calling
+ * scx_bpf_dispatch() with a local dsq as the target. The slice of the
+ * current task is cleared to zero and the CPU is kicked into the
+ * scheduling path. Implies %SCX_ENQ_HEAD.
+ */
+ SCX_ENQ_PREEMPT = 1LLU << 32,
+
/*
* The task being enqueued is the only task available for the cpu. By
* default, ext core keeps executing such tasks but when
@@ -51,6 +59,10 @@ enum scx_deq_flags {
SCX_DEQ_SLEEP = DEQUEUE_SLEEP,
};
+enum scx_kick_flags {
+ SCX_KICK_PREEMPT = 1LLU << 0, /* force scheduling on the CPU */
+};
+
#ifdef CONFIG_SCHED_CLASS_EXT
struct sched_enq_and_set_ctx {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 40692fa7cb3c..0d8b52c52e2b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -698,6 +698,7 @@ struct scx_rq {
#ifdef CONFIG_SMP
cpumask_var_t cpus_to_kick;
cpumask_var_t cpus_to_preempt;
+ struct irq_work kick_cpus_irq_work;
#endif
};
#endif /* CONFIG_SCHED_CLASS_EXT */
diff --git a/tools/sched_ext/.gitignore b/tools/sched_ext/.gitignore
index 6734f7fd9324..389f0e5b0970 100644
--- a/tools/sched_ext/.gitignore
+++ b/tools/sched_ext/.gitignore
@@ -1,5 +1,6 @@
scx_example_dummy
scx_example_qmap
+scx_example_central
*.skel.h
*.subskel.h
/tools/
diff --git a/tools/sched_ext/Makefile b/tools/sched_ext/Makefile
index db99745e566d..d406b7586e08 100644
--- a/tools/sched_ext/Makefile
+++ b/tools/sched_ext/Makefile
@@ -114,7 +114,7 @@ BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH) \
-Wno-compare-distinct-pointer-types \
-O2 -mcpu=v3
-all: scx_example_dummy scx_example_qmap
+all: scx_example_dummy scx_example_qmap scx_example_central
# sort removes libbpf duplicates when not cross-building
MAKE_DIRS := $(sort $(BUILD_DIR)/libbpf $(HOST_BUILD_DIR)/libbpf \
@@ -173,10 +173,14 @@ scx_example_qmap: scx_example_qmap.c scx_example_qmap.skel.h user_exit_info.h
$(CC) $(CFLAGS) -c $< -o [email protected]
$(CC) -o $@ [email protected] $(HOST_BPFOBJ) $(LDFLAGS)
+scx_example_central: scx_example_central.c scx_example_central.skel.h user_exit_info.h
+ $(CC) $(CFLAGS) -c $< -o [email protected]
+ $(CC) -o $@ [email protected] $(HOST_BPFOBJ) $(LDFLAGS)
+
clean:
rm -rf $(SCRATCH_DIR) $(HOST_SCRATCH_DIR)
rm -f *.o *.bpf.o *.skel.h *.subskel.h
- rm -f scx_example_dummy scx_example_qmap
+ rm -f scx_example_dummy scx_example_qmap scx_example_central
.PHONY: all clean
diff --git a/tools/sched_ext/scx_common.bpf.h b/tools/sched_ext/scx_common.bpf.h
index 212cb934db2d..dc4d3f7b461f 100644
--- a/tools/sched_ext/scx_common.bpf.h
+++ b/tools/sched_ext/scx_common.bpf.h
@@ -53,6 +53,7 @@ extern s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) __ksym;
extern bool scx_bpf_consume(u64 dsq_id) __ksym;
extern u32 scx_bpf_dispatch_nr_slots(void) __ksym;
extern void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice, u64 enq_flags) __ksym;
+extern void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym;
extern s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym;
extern bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) __ksym;
extern s32 scx_bpf_pick_idle_cpu(const cpumask_t *cpus_allowed) __ksym;
diff --git a/tools/sched_ext/scx_example_central.bpf.c b/tools/sched_ext/scx_example_central.bpf.c
new file mode 100644
index 000000000000..f53ed4baf92d
--- /dev/null
+++ b/tools/sched_ext/scx_example_central.bpf.c
@@ -0,0 +1,229 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A central FIFO sched_ext scheduler which demonstrates the followings:
+ *
+ * a. Making all scheduling decisions from one CPU:
+ *
+ * The central CPU is the only one making scheduling decisions. All other
+ * CPUs kick the central CPU when they run out of tasks to run.
+ *
+ * There is one global BPF queue and the central CPU schedules all CPUs by
+ * dispatching from the global queue to each CPU's local dsq from dispatch().
+ * This isn't the most straight-forward. e.g. It'd be easier to bounce
+ * through per-CPU BPF queues. The current design is chosen to maximally
+ * utilize and verify various scx mechanisms such as LOCAL_ON dispatching and
+ * consume_final().
+ *
+ * b. Preemption
+ *
+ * SCX_KICK_PREEMPT is used to trigger scheduling and CPUs to move to the
+ * next tasks.
+ *
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#include "scx_common.bpf.h"
+
+char _license[] SEC("license") = "GPL";
+
+enum {
+ FALLBACK_DSQ_ID = 0,
+ MAX_CPUS = 4096,
+ MS_TO_NS = 1000LLU * 1000,
+ TIMER_INTERVAL_NS = 1 * MS_TO_NS,
+};
+
+const volatile bool switch_all;
+const volatile s32 central_cpu;
+const volatile u32 nr_cpu_ids;
+
+u64 nr_total, nr_locals, nr_queued, nr_lost_pids;
+u64 nr_dispatches, nr_mismatches, nr_overflows;
+
+struct user_exit_info uei;
+
+struct {
+ __uint(type, BPF_MAP_TYPE_QUEUE);
+ __uint(max_entries, 4096);
+ __type(value, s32);
+} central_q SEC(".maps");
+
+/* can't use percpu map due to bad lookups */
+static bool cpu_gimme_task[MAX_CPUS];
+
+struct central_timer {
+ struct bpf_timer timer;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 1);
+ __type(key, u32);
+ __type(value, struct central_timer);
+} central_timer SEC(".maps");
+
+static bool vtime_before(u64 a, u64 b)
+{
+ return (s64)(a - b) < 0;
+}
+
+s32 BPF_STRUCT_OPS(central_select_cpu, struct task_struct *p,
+ s32 prev_cpu, u64 wake_flags)
+{
+ /*
+ * Steer wakeups to the central CPU as much as possible to avoid
+ * disturbing other CPUs. It's safe to blindly return the central cpu as
+ * select_cpu() is a hint and if @p can't be on it, the kernel will
+ * automatically pick a fallback CPU.
+ */
+ return central_cpu;
+}
+
+void BPF_STRUCT_OPS(central_enqueue, struct task_struct *p, u64 enq_flags)
+{
+ s32 pid = p->pid;
+
+ __sync_fetch_and_add(&nr_total, 1);
+
+ if (bpf_map_push_elem(¢ral_q, &pid, 0)) {
+ __sync_fetch_and_add(&nr_overflows, 1);
+ scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_DFL, enq_flags);
+ return;
+ }
+
+ __sync_fetch_and_add(&nr_queued, 1);
+
+ if (!scx_bpf_task_running(p))
+ scx_bpf_kick_cpu(central_cpu, SCX_KICK_PREEMPT);
+}
+
+static int dispatch_a_task_loopfn(u32 idx, void *data)
+{
+ s32 cpu = *(s32 *)data;
+ s32 pid;
+ struct task_struct *p;
+ bool *gimme;
+
+ if (bpf_map_pop_elem(¢ral_q, &pid))
+ return 1;
+
+ __sync_fetch_and_sub(&nr_queued, 1);
+
+ p = scx_bpf_find_task_by_pid(pid);
+ if (!p) {
+ __sync_fetch_and_add(&nr_lost_pids, 1);
+ return 0;
+ }
+
+ /*
+ * If we can't run the task at the top, do the dumb thing and bounce it
+ * to the fallback dsq.
+ */
+ if (!scx_bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) {
+ __sync_fetch_and_add(&nr_mismatches, 1);
+ scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_DFL, 0);
+ return 0;
+ }
+
+ /* dispatch to the local and mark that @cpu doesn't need more tasks */
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, SCX_SLICE_DFL, 0);
+
+ if (cpu != central_cpu)
+ scx_bpf_kick_cpu(cpu, 0);
+
+ gimme = MEMBER_VPTR(cpu_gimme_task, [cpu]);
+ if (gimme)
+ *gimme = false;
+
+ return 1;
+}
+
+static int dispatch_to_one_cpu_loopfn(u32 idx, void *data)
+{
+ s32 cpu = idx;
+
+ if (cpu >= 0 && cpu < MAX_CPUS) {
+ bool *gimme = MEMBER_VPTR(cpu_gimme_task, [cpu]);
+ if (gimme && !*gimme)
+ return 0;
+ }
+
+ bpf_loop(1 << 23, dispatch_a_task_loopfn, &cpu, 0);
+ return 0;
+}
+
+void BPF_STRUCT_OPS(central_dispatch, s32 cpu, struct task_struct *prev)
+{
+ /* if out of tasks to run, gimme more */
+ if (!scx_bpf_dsq_nr_queued(FALLBACK_DSQ_ID)) {
+ bool *gimme = MEMBER_VPTR(cpu_gimme_task, [cpu]);
+ if (gimme)
+ *gimme = true;
+ }
+
+ if (cpu == central_cpu) {
+ /* we're the scheduling CPU, dispatch for every CPU */
+ __sync_fetch_and_add(&nr_dispatches, 1);
+ bpf_loop(nr_cpu_ids, dispatch_to_one_cpu_loopfn, NULL, 0);
+ } else {
+ /*
+ * Force dispatch on the scheduling CPU so that it finds a task
+ * to run for us.
+ */
+ scx_bpf_kick_cpu(central_cpu, SCX_KICK_PREEMPT);
+ }
+}
+
+void BPF_STRUCT_OPS(central_consume, s32 cpu)
+{
+ /*
+ * When preempted, we want the central CPU to always run dispatch() as
+ * soon as possible so that it can schedule other CPUs. Don't consume
+ * the fallback dsq if central.
+ */
+ if (cpu != central_cpu)
+ scx_bpf_consume(FALLBACK_DSQ_ID);
+}
+
+void BPF_STRUCT_OPS(central_consume_final, s32 cpu)
+{
+ /*
+ * Now that the central CPU has dispatched, we can let it consume the
+ * fallback dsq.
+ */
+ if (cpu == central_cpu)
+ scx_bpf_consume(FALLBACK_DSQ_ID);
+}
+
+int BPF_STRUCT_OPS(central_init)
+{
+ if (switch_all)
+ scx_bpf_switch_all();
+
+ return scx_bpf_create_dsq(FALLBACK_DSQ_ID, -1);
+}
+
+void BPF_STRUCT_OPS(central_exit, struct scx_exit_info *ei)
+{
+ uei_record(&uei, ei);
+}
+
+SEC(".struct_ops")
+struct sched_ext_ops central_ops = {
+ /*
+ * We are offloading all scheduling decisions to the central CPU and
+ * thus being the last task on a given CPU doesn't mean anything
+ * special. Enqueue the last tasks like any other tasks.
+ */
+ .flags = SCX_OPS_ENQ_LAST,
+
+ .select_cpu = (void *)central_select_cpu,
+ .enqueue = (void *)central_enqueue,
+ .dispatch = (void *)central_dispatch,
+ .consume = (void *)central_consume,
+ .consume_final = (void *)central_consume_final,
+ .init = (void *)central_init,
+ .exit = (void *)central_exit,
+ .name = "central",
+};
diff --git a/tools/sched_ext/scx_example_central.c b/tools/sched_ext/scx_example_central.c
new file mode 100644
index 000000000000..c85e84459c58
--- /dev/null
+++ b/tools/sched_ext/scx_example_central.c
@@ -0,0 +1,91 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <unistd.h>
+#include <signal.h>
+#include <assert.h>
+#include <libgen.h>
+#include <bpf/bpf.h>
+#include "user_exit_info.h"
+#include "scx_example_central.skel.h"
+
+const char help_fmt[] =
+"A central FIFO sched_ext scheduler.\n"
+"\n"
+"See the top-level comment in .bpf.c for more details.\n"
+"\n"
+"Usage: %s [-a] [-c CPU]\n"
+"\n"
+" -a Switch all tasks\n"
+" -c CPU Override the central CPU (default: 0)\n"
+" -h Display this help and exit\n";
+
+static volatile int exit_req;
+
+static void sigint_handler(int dummy)
+{
+ exit_req = 1;
+}
+
+int main(int argc, char **argv)
+{
+ struct scx_example_central *skel;
+ struct bpf_link *link;
+ u64 seq = 0;
+ s32 opt;
+
+ signal(SIGINT, sigint_handler);
+ signal(SIGTERM, sigint_handler);
+
+ libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
+
+ skel = scx_example_central__open();
+ assert(skel);
+
+ skel->rodata->central_cpu = 0;
+ skel->rodata->nr_cpu_ids = libbpf_num_possible_cpus();
+
+ while ((opt = getopt(argc, argv, "ahc:")) != -1) {
+ switch (opt) {
+ case 'a':
+ skel->rodata->switch_all = true;
+ break;
+ case 'c':
+ skel->rodata->central_cpu = strtoul(optarg, NULL, 0);
+ break;
+ default:
+ fprintf(stderr, help_fmt, basename(argv[0]));
+ return opt != 'h';
+ }
+ }
+
+ assert(!scx_example_central__load(skel));
+
+ link = bpf_map__attach_struct_ops(skel->maps.central_ops);
+ assert(link);
+
+ while (!exit_req && !uei_exited(&skel->bss->uei)) {
+ printf("[SEQ %lu]\n", seq++);
+ printf("total:%10lu local:%10lu queued:%10lu lost:%10lu\n",
+ skel->bss->nr_total,
+ skel->bss->nr_locals,
+ skel->bss->nr_queued,
+ skel->bss->nr_lost_pids);
+ printf(" dispatch:%10lu mismatch:%10lu overflow:%10lu\n",
+ skel->bss->nr_dispatches,
+ skel->bss->nr_mismatches,
+ skel->bss->nr_overflows);
+ fflush(stdout);
+ sleep(1);
+ }
+
+ bpf_link__destroy(link);
+ uei_print(&skel->bss->uei);
+ scx_example_central__destroy(skel);
+ return 0;
+}
--
2.38.1
sched_move_task() can be called for both cgroup and autogroup moves. Add a
parameter to distinguish the two cases. This will be used by a new
sched_class to track cgroup migrations.
Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/autogroup.c | 4 ++--
kernel/sched/core.c | 4 ++--
kernel/sched/sched.h | 7 ++++++-
3 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c
index 991fc9002535..2be1b10ce93e 100644
--- a/kernel/sched/autogroup.c
+++ b/kernel/sched/autogroup.c
@@ -151,7 +151,7 @@ void sched_autogroup_exit_task(struct task_struct *p)
* see this thread after that: we can no longer use signal->autogroup.
* See the PF_EXITING check in task_wants_autogroup().
*/
- sched_move_task(p);
+ sched_move_task(p, SCHED_MOVE_TASK_AUTOGROUP);
}
static void
@@ -183,7 +183,7 @@ autogroup_move_group(struct task_struct *p, struct autogroup *ag)
* sched_autogroup_exit_task().
*/
for_each_thread(p, t)
- sched_move_task(t);
+ sched_move_task(t, SCHED_MOVE_TASK_AUTOGROUP);
unlock_task_sighand(p, &flags);
autogroup_kref_put(prev);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0699b49b1a21..9c5bfeeb30ba 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10210,7 +10210,7 @@ static void sched_change_group(struct task_struct *tsk)
* now. This function just updates tsk->se.cfs_rq and tsk->se.parent to reflect
* its new group.
*/
-void sched_move_task(struct task_struct *tsk)
+void sched_move_task(struct task_struct *tsk, enum sched_move_task_reason reason)
{
int queued, running, queue_flags =
DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
@@ -10321,7 +10321,7 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
struct cgroup_subsys_state *css;
cgroup_taskset_for_each(task, css, tset)
- sched_move_task(task);
+ sched_move_task(task, SCHED_MOVE_TASK_CGROUP);
}
#ifdef CONFIG_UCLAMP_TASK_GROUP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3c6ea8296ae4..ef8da88e677c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -506,7 +506,12 @@ extern void sched_online_group(struct task_group *tg,
extern void sched_destroy_group(struct task_group *tg);
extern void sched_release_group(struct task_group *tg);
-extern void sched_move_task(struct task_struct *tsk);
+enum sched_move_task_reason {
+ SCHED_MOVE_TASK_CGROUP,
+ SCHED_MOVE_TASK_AUTOGROUP,
+};
+extern void sched_move_task(struct task_struct *tsk,
+ enum sched_move_task_reason reason);
#ifdef CONFIG_FAIR_GROUP_SCHED
extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
--
2.38.1
Add normal_policy() which wraps testing for %SCHED_NORMAL. Makes no behavior
change. Will be used by a new sched_class to expand what's considered normal
policy.
Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/fair.c | 2 +-
kernel/sched/sched.h | 8 +++++++-
2 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6cad7d07186b..1d71e74cb0ff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7441,7 +7441,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
* Batch and idle tasks do not preempt non-idle tasks (their preemption
* is driven by the tick):
*/
- if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
+ if (unlikely(!normal_policy(p->policy)) || !sched_feat(WAKEUP_PREEMPTION))
return;
find_matching_se(&se, &pse);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ef8da88e677c..0741827e3541 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -182,9 +182,15 @@ static inline int idle_policy(int policy)
{
return policy == SCHED_IDLE;
}
+
+static inline int normal_policy(int policy)
+{
+ return policy == SCHED_NORMAL;
+}
+
static inline int fair_policy(int policy)
{
- return policy == SCHED_NORMAL || policy == SCHED_BATCH;
+ return normal_policy(policy) || policy == SCHED_BATCH;
}
static inline int rt_policy(int policy)
--
2.38.1
->rq_{on|off}line are called either during CPU hotplug or cpuset partition
updates. Let's add an argument to distinguish the two cases. The argument
will be used by a new sched_class to track CPU hotplug events.
Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/core.c | 12 ++++++------
kernel/sched/deadline.c | 4 ++--
kernel/sched/fair.c | 4 ++--
kernel/sched/rt.c | 4 ++--
kernel/sched/sched.h | 13 +++++++++----
kernel/sched/topology.c | 4 ++--
6 files changed, 23 insertions(+), 18 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 735e94bc7dbb..0699b49b1a21 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9286,7 +9286,7 @@ static inline void balance_hotplug_wait(void)
#endif /* CONFIG_HOTPLUG_CPU */
-void set_rq_online(struct rq *rq)
+void set_rq_online(struct rq *rq, enum rq_onoff_reason reason)
{
if (!rq->online) {
const struct sched_class *class;
@@ -9296,19 +9296,19 @@ void set_rq_online(struct rq *rq)
for_each_class(class) {
if (class->rq_online)
- class->rq_online(rq);
+ class->rq_online(rq, reason);
}
}
}
-void set_rq_offline(struct rq *rq)
+void set_rq_offline(struct rq *rq, enum rq_onoff_reason reason)
{
if (rq->online) {
const struct sched_class *class;
for_each_class(class) {
if (class->rq_offline)
- class->rq_offline(rq);
+ class->rq_offline(rq, reason);
}
cpumask_clear_cpu(rq->cpu, rq->rd->online);
@@ -9404,7 +9404,7 @@ int sched_cpu_activate(unsigned int cpu)
rq_lock_irqsave(rq, &rf);
if (rq->rd) {
BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
- set_rq_online(rq);
+ set_rq_online(rq, RQ_ONOFF_HOTPLUG);
}
rq_unlock_irqrestore(rq, &rf);
@@ -9449,7 +9449,7 @@ int sched_cpu_deactivate(unsigned int cpu)
if (rq->rd) {
update_rq_clock(rq);
BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
- set_rq_offline(rq);
+ set_rq_offline(rq, RQ_ONOFF_HOTPLUG);
}
rq_unlock_irqrestore(rq, &rf);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 9ae8f41e3372..f63e5d0c5fb1 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2519,7 +2519,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
}
/* Assumes rq->lock is held */
-static void rq_online_dl(struct rq *rq)
+static void rq_online_dl(struct rq *rq, enum rq_onoff_reason reason)
{
if (rq->dl.overloaded)
dl_set_overload(rq);
@@ -2530,7 +2530,7 @@ static void rq_online_dl(struct rq *rq)
}
/* Assumes rq->lock is held */
-static void rq_offline_dl(struct rq *rq)
+static void rq_offline_dl(struct rq *rq, enum rq_onoff_reason reason)
{
if (rq->dl.overloaded)
dl_clear_overload(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 78263cef1ea8..6cad7d07186b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11436,14 +11436,14 @@ void trigger_load_balance(struct rq *rq)
nohz_balancer_kick(rq);
}
-static void rq_online_fair(struct rq *rq)
+static void rq_online_fair(struct rq *rq, enum rq_onoff_reason reason)
{
update_sysctl();
update_runtime_enabled(rq);
}
-static void rq_offline_fair(struct rq *rq)
+static void rq_offline_fair(struct rq *rq, enum rq_onoff_reason reason)
{
update_sysctl();
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ed2a47e4ddae..0fb7ee087669 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2470,7 +2470,7 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
}
/* Assumes rq->lock is held */
-static void rq_online_rt(struct rq *rq)
+static void rq_online_rt(struct rq *rq, enum rq_onoff_reason reason)
{
if (rq->rt.overloaded)
rt_set_overload(rq);
@@ -2481,7 +2481,7 @@ static void rq_online_rt(struct rq *rq)
}
/* Assumes rq->lock is held */
-static void rq_offline_rt(struct rq *rq)
+static void rq_offline_rt(struct rq *rq, enum rq_onoff_reason reason)
{
if (rq->rt.overloaded)
rt_clear_overload(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 157eeabca5db..3c6ea8296ae4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2169,6 +2169,11 @@ extern const u32 sched_prio_to_wmult[40];
#define RETRY_TASK ((void *)-1UL)
+enum rq_onoff_reason {
+ RQ_ONOFF_HOTPLUG, /* CPU is going on/offline */
+ RQ_ONOFF_TOPOLOGY, /* sched domain topology update */
+};
+
struct sched_class {
#ifdef CONFIG_UCLAMP_TASK
@@ -2201,8 +2206,8 @@ struct sched_class {
const struct cpumask *newmask,
u32 flags);
- void (*rq_online)(struct rq *rq);
- void (*rq_offline)(struct rq *rq);
+ void (*rq_online)(struct rq *rq, enum rq_onoff_reason reason);
+ void (*rq_offline)(struct rq *rq, enum rq_onoff_reason reason);
struct rq *(*find_lock_rq)(struct task_struct *p, struct rq *rq);
#endif
@@ -2726,8 +2731,8 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
raw_spin_rq_unlock(rq1);
}
-extern void set_rq_online (struct rq *rq);
-extern void set_rq_offline(struct rq *rq);
+extern void set_rq_online (struct rq *rq, enum rq_onoff_reason reason);
+extern void set_rq_offline(struct rq *rq, enum rq_onoff_reason reason);
extern bool sched_smp_initialized;
#else /* CONFIG_SMP */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..0e859bea1cb6 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -493,7 +493,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
old_rd = rq->rd;
if (cpumask_test_cpu(rq->cpu, old_rd->online))
- set_rq_offline(rq);
+ set_rq_offline(rq, RQ_ONOFF_TOPOLOGY);
cpumask_clear_cpu(rq->cpu, old_rd->span);
@@ -511,7 +511,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
cpumask_set_cpu(rq->cpu, rd->span);
if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
- set_rq_online(rq);
+ set_rq_online(rq, RQ_ONOFF_TOPOLOGY);
raw_spin_rq_unlock_irqrestore(rq, flags);
--
2.38.1
From: Dan Schatzberg <[email protected]>
Atropos is a multi-domain BPF / userspace hybrid scheduler where the BPF
part does simple round robin in each domain and the userspace part
calculates the load factor of each domain and tells the BPF part how to load
balance the domains.
This scheduler demonstrates dividing scheduling logic between BPF and
userspace and using rust to build the userspace part. An earlier variant of
this scheduler was used to balance across six domains, each representing a
chiplet in a six-chiplet AMD processor, and could match the performance of
production setup using CFS.
Signed-off-by: Dan Schatzberg <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
tools/sched_ext/Makefile | 13 +-
tools/sched_ext/atropos/.gitignore | 3 +
tools/sched_ext/atropos/Cargo.toml | 34 +
tools/sched_ext/atropos/build.rs | 70 ++
tools/sched_ext/atropos/rustfmt.toml | 8 +
tools/sched_ext/atropos/src/bpf/atropos.bpf.c | 501 ++++++++++++++
tools/sched_ext/atropos/src/bpf/atropos.h | 38 +
tools/sched_ext/atropos/src/main.rs | 648 ++++++++++++++++++
.../sched_ext/atropos/src/oss/atropos_sys.rs | 10 +
tools/sched_ext/atropos/src/oss/mod.rs | 29 +
tools/sched_ext/atropos/src/util.rs | 24 +
11 files changed, 1376 insertions(+), 2 deletions(-)
create mode 100644 tools/sched_ext/atropos/.gitignore
create mode 100644 tools/sched_ext/atropos/Cargo.toml
create mode 100644 tools/sched_ext/atropos/build.rs
create mode 100644 tools/sched_ext/atropos/rustfmt.toml
create mode 100644 tools/sched_ext/atropos/src/bpf/atropos.bpf.c
create mode 100644 tools/sched_ext/atropos/src/bpf/atropos.h
create mode 100644 tools/sched_ext/atropos/src/main.rs
create mode 100644 tools/sched_ext/atropos/src/oss/atropos_sys.rs
create mode 100644 tools/sched_ext/atropos/src/oss/mod.rs
create mode 100644 tools/sched_ext/atropos/src/util.rs
diff --git a/tools/sched_ext/Makefile b/tools/sched_ext/Makefile
index 563b54333ab1..c8802cefb98a 100644
--- a/tools/sched_ext/Makefile
+++ b/tools/sched_ext/Makefile
@@ -84,6 +84,8 @@ CFLAGS += -g -O2 -rdynamic -pthread -Wall -Werror $(GENFLAGS) \
-I$(INCLUDE_DIR) -I$(GENDIR) -I$(LIBDIR) \
-I$(TOOLSINCDIR) -I$(APIDIR)
+CARGOFLAGS := --release
+
# Silence some warnings when compiled with clang
ifneq ($(LLVM),)
CFLAGS += -Wno-unused-command-line-argument
@@ -115,7 +117,7 @@ BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH) \
-O2 -mcpu=v3
all: scx_example_dummy scx_example_qmap scx_example_central scx_example_pair \
- scx_example_userland
+ scx_example_userland atropos
# sort removes libbpf duplicates when not cross-building
MAKE_DIRS := $(sort $(BUILD_DIR)/libbpf $(HOST_BUILD_DIR)/libbpf \
@@ -187,13 +189,20 @@ scx_example_userland: scx_example_userland.c scx_example_userland.skel.h \
$(CC) $(CFLAGS) -c $< -o [email protected]
$(CC) -o $@ [email protected] $(HOST_BPFOBJ) $(LDFLAGS)
+atropos: export RUSTFLAGS = -C link-args=-lzstd
+atropos: export ATROPOS_CLANG = $(CLANG)
+atropos: export ATROPOS_BPF_CFLAGS = $(BPF_CFLAGS)
+atropos: $(INCLUDE_DIR)/vmlinux.h
+ cargo build --manifest-path=atropos/Cargo.toml $(CARGOFLAGS)
+
clean:
+ cargo clean --manifest-path=atropos/Cargo.toml
rm -rf $(SCRATCH_DIR) $(HOST_SCRATCH_DIR)
rm -f *.o *.bpf.o *.skel.h *.subskel.h
rm -f scx_example_dummy scx_example_qmap scx_example_central \
scx_example_pair scx_example_userland
-.PHONY: all clean
+.PHONY: all atropos clean
# delete failed targets
.DELETE_ON_ERROR:
diff --git a/tools/sched_ext/atropos/.gitignore b/tools/sched_ext/atropos/.gitignore
new file mode 100644
index 000000000000..186dba259ec2
--- /dev/null
+++ b/tools/sched_ext/atropos/.gitignore
@@ -0,0 +1,3 @@
+src/bpf/.output
+Cargo.lock
+target
diff --git a/tools/sched_ext/atropos/Cargo.toml b/tools/sched_ext/atropos/Cargo.toml
new file mode 100644
index 000000000000..67e543c457c4
--- /dev/null
+++ b/tools/sched_ext/atropos/Cargo.toml
@@ -0,0 +1,34 @@
+# @generated by autocargo
+
+[package]
+name = "atropos-bin"
+version = "0.5.0"
+authors = ["Dan Schatzberg <[email protected]>", "Meta"]
+edition = "2021"
+description = "Userspace scheduling with BPF"
+readme = "../README.md"
+repository = "https://github.com/facebookincubator/atropos"
+license = "Apache-2.0"
+
+[dependencies]
+anyhow = "1.0.65"
+bitvec = { version = "1.0", features = ["serde"] }
+clap = { version = "3.2.17", features = ["derive", "env", "regex", "unicode", "wrap_help"] }
+ctrlc = { version = "3.1", features = ["termination"] }
+fb_procfs = { git = "https://github.com/facebookincubator/below.git", rev = "f305730"}
+hex = "0.4.3"
+libbpf-rs = "0.19.1"
+libc = "0.2.137"
+slog = { version = "2.7", features = ["max_level_trace", "nested-values"] }
+slog-async = { version = "2.3", features = ["nested-values"] }
+slog-term = "2.8"
+
+[build-dependencies]
+bindgen = { version = "0.61.0", features = ["logging", "static"], default-features = false }
+libbpf-cargo = "0.13.0"
+
+[features]
+enable_backtrace = []
+
+[patch.crates-io]
+libbpf-sys = { git = "https://github.com/dschatzberg/libbpf-sys", rev = "0ef4011"}
diff --git a/tools/sched_ext/atropos/build.rs b/tools/sched_ext/atropos/build.rs
new file mode 100644
index 000000000000..26e792c5e17e
--- /dev/null
+++ b/tools/sched_ext/atropos/build.rs
@@ -0,0 +1,70 @@
+// Copyright (c) Meta Platforms, Inc. and affiliates.
+
+// This software may be used and distributed according to the terms of the
+// GNU General Public License version 2.
+extern crate bindgen;
+
+use std::env;
+use std::fs::create_dir_all;
+use std::path::Path;
+use std::path::PathBuf;
+
+use libbpf_cargo::SkeletonBuilder;
+
+const HEADER_PATH: &str = "src/bpf/atropos.h";
+
+fn bindgen_atropos() {
+ // Tell cargo to invalidate the built crate whenever the wrapper changes
+ println!("cargo:rerun-if-changed={}", HEADER_PATH);
+
+ // The bindgen::Builder is the main entry point
+ // to bindgen, and lets you build up options for
+ // the resulting bindings.
+ let bindings = bindgen::Builder::default()
+ // The input header we would like to generate
+ // bindings for.
+ .header(HEADER_PATH)
+ // Tell cargo to invalidate the built crate whenever any of the
+ // included header files changed.
+ .parse_callbacks(Box::new(bindgen::CargoCallbacks))
+ // Finish the builder and generate the bindings.
+ .generate()
+ // Unwrap the Result and panic on failure.
+ .expect("Unable to generate bindings");
+
+ // Write the bindings to the $OUT_DIR/bindings.rs file.
+ let out_path = PathBuf::from(env::var("OUT_DIR").unwrap());
+ bindings
+ .write_to_file(out_path.join("atropos-sys.rs"))
+ .expect("Couldn't write bindings!");
+}
+
+fn gen_bpf_sched(name: &str) {
+ let bpf_cflags = env::var("ATROPOS_BPF_CFLAGS").unwrap();
+ let clang = env::var("ATROPOS_CLANG").unwrap();
+ eprintln!("{}", clang);
+ let outpath = format!("./src/bpf/.output/{}.skel.rs", name);
+ let skel = Path::new(&outpath);
+ let src = format!("./src/bpf/{}.bpf.c", name);
+ SkeletonBuilder::new()
+ .source(src.clone())
+ .clang(clang)
+ .clang_args(bpf_cflags)
+ .build_and_generate(&skel)
+ .unwrap();
+ println!("cargo:rerun-if-changed={}", src);
+}
+
+fn main() {
+ bindgen_atropos();
+ // It's unfortunate we cannot use `OUT_DIR` to store the generated skeleton.
+ // Reasons are because the generated skeleton contains compiler attributes
+ // that cannot be `include!()`ed via macro. And we cannot use the `#[path = "..."]`
+ // trick either because you cannot yet `concat!(env!("OUT_DIR"), "/skel.rs")` inside
+ // the path attribute either (see https://github.com/rust-lang/rust/pull/83366).
+ //
+ // However, there is hope! When the above feature stabilizes we can clean this
+ // all up.
+ create_dir_all("./src/bpf/.output").unwrap();
+ gen_bpf_sched("atropos");
+}
diff --git a/tools/sched_ext/atropos/rustfmt.toml b/tools/sched_ext/atropos/rustfmt.toml
new file mode 100644
index 000000000000..b7258ed0a8d8
--- /dev/null
+++ b/tools/sched_ext/atropos/rustfmt.toml
@@ -0,0 +1,8 @@
+# Get help on options with `rustfmt --help=config`
+# Please keep these in alphabetical order.
+edition = "2021"
+group_imports = "StdExternalCrate"
+imports_granularity = "Item"
+merge_derives = false
+use_field_init_shorthand = true
+version = "Two"
diff --git a/tools/sched_ext/atropos/src/bpf/atropos.bpf.c b/tools/sched_ext/atropos/src/bpf/atropos.bpf.c
new file mode 100644
index 000000000000..9e5164131de5
--- /dev/null
+++ b/tools/sched_ext/atropos/src/bpf/atropos.bpf.c
@@ -0,0 +1,501 @@
+// Copyright (c) Meta Platforms, Inc. and affiliates.
+
+// This software may be used and distributed according to the terms of the
+// GNU General Public License version 2.
+//
+// Atropos is a multi-domain BPF / userspace hybrid scheduler where the BPF
+// part does simple round robin in each domain and the userspace part
+// calculates the load factor of each domain and tells the BPF part how to load
+// balance the domains.
+//
+// Every task has an entry in the task_data map which lists which domain the
+// task belongs to. When a task first enters the system (atropos_enable), they
+// are round-robined to a domain.
+//
+// atropos_select_cpu is the primary scheduling logic, invoked when a task
+// becomes runnable. The lb_data map is populated by userspace to inform the BPF
+// scheduler that a task should be migrated to a new domain. Otherwise, the task
+// is scheduled in priority order as follows:
+// * The current core if the task was woken up synchronously and there are idle
+// cpus in the system
+// * The previous core, if idle
+// * The pinned-to core if the task is pinned to a specific core
+// * Any idle cpu in the domain
+//
+// If none of the above conditions are met, then the task is enqueued to a
+// dispatch queue corresponding to the domain (atropos_enqueue).
+//
+// atropos_consume will attempt to consume a task from its domain's
+// corresponding dispatch queue (this occurs after scheduling any tasks directly
+// assigned to it due to the logic in atropos_select_cpu). If no task is found,
+// then greedy load stealing will attempt to find a task on another dispatch
+// queue to run.
+//
+// Load balancing is almost entirely handled by userspace. BPF populates the
+// task weight, dom mask and current dom in the task_data map and executes the
+// load balance based on userspace populating the lb_data map.
+#include "../../../scx_common.bpf.h"
+#include "atropos.h"
+
+#include <errno.h>
+#include <stdbool.h>
+#include <string.h>
+#include <bpf/bpf_core_read.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+/*
+ * const volatiles are set during initialization and treated as consts by the
+ * jit compiler.
+ */
+
+/*
+ * Domains and cpus
+ */
+const volatile __u32 nr_doms;
+const volatile __u32 nr_cpus;
+const volatile __u32 cpu_dom_id_map[MAX_CPUS];
+const volatile __u64 dom_cpumasks[MAX_DOMS][MAX_CPUS / 64];
+
+const volatile bool switch_all;
+const volatile __u64 greedy_threshold = (u64)-1;
+
+/* base slice duration */
+const volatile __u64 slice_us = 20000;
+
+/*
+ * Exit info
+ */
+int exit_type = SCX_EXIT_NONE;
+char exit_msg[SCX_EXIT_MSG_LEN];
+
+struct pcpu_ctx {
+ __u32 dom_rr_cur; /* used when scanning other doms */
+
+ /* libbpf-rs does not respect the alignment, so pad out the struct explicitly */
+ __u8 _padding[CACHELINE_SIZE - sizeof(u64)];
+} __attribute__((aligned(CACHELINE_SIZE)));
+
+struct pcpu_ctx pcpu_ctx[MAX_CPUS];
+
+/*
+ * Statistics
+ */
+struct {
+ __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+ __uint(key_size, sizeof(u32));
+ __uint(value_size, sizeof(u64));
+ __uint(max_entries, ATROPOS_NR_STATS);
+} stats SEC(".maps");
+
+static inline void stat_add(enum stat_idx idx, u64 addend)
+{
+ u32 idx_v = idx;
+
+ u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx_v);
+ if (cnt_p)
+ (*cnt_p) += addend;
+}
+
+// Map pid -> task_ctx
+struct {
+ __uint(type, BPF_MAP_TYPE_HASH);
+ __type(key, pid_t);
+ __type(value, struct task_ctx);
+ __uint(max_entries, 1000000);
+ __uint(map_flags, 0);
+} task_data SEC(".maps");
+
+// This is populated from userspace to indicate which pids should be reassigned
+// to new doms
+struct {
+ __uint(type, BPF_MAP_TYPE_HASH);
+ __type(key, pid_t);
+ __type(value, u32);
+ __uint(max_entries, 1000);
+ __uint(map_flags, 0);
+} lb_data SEC(".maps");
+
+struct refresh_task_cpumask_loop_ctx {
+ struct task_struct *p;
+ struct task_ctx *ctx;
+};
+
+static int refresh_task_cpumask(u32 cpu, void *data)
+{
+ struct refresh_task_cpumask_loop_ctx *c = data;
+ struct task_struct *p = c->p;
+ struct task_ctx *ctx = c->ctx;
+ u64 mask = 1LLU << (cpu % 64);
+ const volatile __u64 *dptr;
+
+ dptr = MEMBER_VPTR(dom_cpumasks, [ctx->dom_id][cpu / 64]);
+ if (!dptr)
+ return 1;
+
+ if ((*dptr & mask) &&
+ scx_bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) {
+ u64 *cptr = MEMBER_VPTR(ctx->cpumask, [cpu / 64]);
+ if (!cptr)
+ return 1;
+ *cptr |= mask;
+ } else {
+ u64 *cptr = MEMBER_VPTR(ctx->cpumask, [cpu / 64]);
+ if (!cptr)
+ return 1;
+ *cptr *= ~mask;
+ }
+
+ return 0;
+}
+
+static void task_set_dq(struct task_ctx *task_ctx, struct task_struct *p,
+ u32 dom_id)
+{
+ struct refresh_task_cpumask_loop_ctx lctx = {
+ .p = p,
+ .ctx = task_ctx,
+ };
+
+ task_ctx->dom_id = dom_id;
+ bpf_loop(nr_cpus, refresh_task_cpumask, &lctx, 0);
+}
+
+s32 BPF_STRUCT_OPS(atropos_select_cpu, struct task_struct *p, int prev_cpu,
+ u32 wake_flags)
+{
+ s32 cpu;
+
+ pid_t pid = p->pid;
+ struct task_ctx *task_ctx = bpf_map_lookup_elem(&task_data, &pid);
+ if (!task_ctx) {
+ stat_add(ATROPOS_STAT_TASK_GET_ERR, 1);
+ return prev_cpu;
+ }
+
+ bool load_balanced = false;
+ u32 *new_dom = bpf_map_lookup_elem(&lb_data, &pid);
+ if (new_dom && *new_dom != task_ctx->dom_id) {
+ task_set_dq(task_ctx, p, *new_dom);
+ stat_add(ATROPOS_STAT_LOAD_BALANCE, 1);
+ load_balanced = true;
+ }
+
+ /*
+ * If WAKE_SYNC and the machine isn't fully saturated, wake up @p to the
+ * local dq of the waker.
+ */
+ if (p->nr_cpus_allowed > 1 && (wake_flags & SCX_WAKE_SYNC)) {
+ struct task_struct *current = (void *)bpf_get_current_task();
+
+ if (!(BPF_CORE_READ(current, flags) & PF_EXITING) &&
+ task_ctx->dom_id < MAX_DOMS &&
+ scx_bpf_has_idle_cpus_among_untyped(
+ (unsigned long)dom_cpumasks[task_ctx->dom_id])) {
+ cpu = bpf_get_smp_processor_id();
+ if (scx_bpf_cpumask_test_cpu(cpu,
+ p->cpus_ptr)) {
+ stat_add(ATROPOS_STAT_WAKE_SYNC, 1);
+ goto local;
+ }
+ }
+ }
+
+ /* if the previous CPU is idle, dispatch directly to it */
+ if (!load_balanced) {
+ u8 prev_idle = scx_bpf_test_and_clear_cpu_idle(prev_cpu);
+ if (*(volatile u8 *)&prev_idle) {
+ stat_add(ATROPOS_STAT_PREV_IDLE, 1);
+ cpu = prev_cpu;
+ goto local;
+ }
+ }
+
+ /* If only one core is allowed, dispatch */
+ if (p->nr_cpus_allowed == 1) {
+ cpu = scx_bpf_cpumask_first_untyped(
+ (unsigned long)task_ctx->cpumask);
+ stat_add(ATROPOS_STAT_PINNED, 1);
+ goto local;
+ }
+
+ /* Find an idle cpu and just dispatch */
+ cpu = scx_bpf_pick_idle_cpu_untyped((unsigned long)task_ctx->cpumask);
+ if (cpu >= 0) {
+ stat_add(ATROPOS_STAT_DIRECT_DISPATCH, 1);
+ goto local;
+ }
+
+ return prev_cpu;
+
+local:
+ task_ctx->dispatch_local = true;
+ return cpu;
+}
+
+void BPF_STRUCT_OPS(atropos_enqueue, struct task_struct *p, u32 enq_flags)
+{
+ p->scx.slice = slice_us * 1000;
+
+ pid_t pid = p->pid;
+ struct task_ctx *task_ctx = bpf_map_lookup_elem(&task_data, &pid);
+ if (!task_ctx) {
+ stat_add(ATROPOS_STAT_TASK_GET_ERR, 1);
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+ return;
+ }
+
+ if (task_ctx->dispatch_local) {
+ task_ctx->dispatch_local = false;
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags);
+ return;
+ }
+
+ scx_bpf_dispatch(p, task_ctx->dom_id, SCX_SLICE_DFL, enq_flags);
+}
+
+static u32 cpu_to_dom_id(s32 cpu)
+{
+ if (nr_doms <= 1)
+ return 0;
+
+ if (cpu >= 0 && cpu < MAX_CPUS) {
+ u32 dom_id;
+
+ /*
+ * XXX - idk why the verifier thinks cpu_dom_id_map[cpu] is not
+ * safe here.
+ */
+ bpf_probe_read_kernel(&dom_id, sizeof(dom_id),
+ (const void *)&cpu_dom_id_map[cpu]);
+ return dom_id;
+ } else {
+ return MAX_DOMS;
+ }
+}
+
+static bool is_cpu_in_dom(u32 cpu, u32 dom_id)
+{
+ u64 mask = 0;
+
+ /*
+ * XXX - derefing two dimensional array triggers the verifier, use
+ * probe_read instead.
+ */
+ bpf_probe_read_kernel(&mask, sizeof(mask),
+ (const void *)&dom_cpumasks[dom_id][cpu / 64]);
+ return mask & (1LLU << (cpu % 64));
+}
+
+struct cpumask_intersects_domain_loop_ctx {
+ const struct cpumask *cpumask;
+ u32 dom_id;
+ bool ret;
+};
+
+static int cpumask_intersects_domain_loopfn(u32 idx, void *data)
+{
+ struct cpumask_intersects_domain_loop_ctx *lctx = data;
+
+ if (scx_bpf_cpumask_test_cpu(idx, lctx->cpumask) &&
+ is_cpu_in_dom(idx, lctx->dom_id)) {
+ lctx->ret = true;
+ return 1;
+ }
+ return 0;
+}
+
+static bool cpumask_intersects_domain(const struct cpumask *cpumask, u32 dom_id)
+{
+ struct cpumask_intersects_domain_loop_ctx lctx = {
+ .cpumask = cpumask,
+ .dom_id = dom_id,
+ .ret = false,
+ };
+
+ bpf_loop(nr_cpus, cpumask_intersects_domain_loopfn, &lctx, 0);
+ return lctx.ret;
+}
+
+static u32 dom_rr_next(s32 cpu)
+{
+ if (cpu >= 0 && cpu < MAX_CPUS) {
+ struct pcpu_ctx *pcpuc = &pcpu_ctx[cpu];
+ u32 dom_id = (pcpuc->dom_rr_cur + 1) % nr_doms;
+
+ if (dom_id == cpu_to_dom_id(cpu))
+ dom_id = (dom_id + 1) % nr_doms;
+
+ pcpuc->dom_rr_cur = dom_id;
+ return dom_id;
+ }
+ return 0;
+}
+
+static int greedy_loopfn(s32 idx, void *data)
+{
+ u32 dom_id = dom_rr_next(*(s32 *)data);
+
+ if (scx_bpf_dsq_nr_queued(dom_id) > greedy_threshold &&
+ scx_bpf_consume(dom_id)) {
+ stat_add(ATROPOS_STAT_GREEDY, 1);
+ return 1;
+ }
+ return 0;
+}
+
+void BPF_STRUCT_OPS(atropos_consume, s32 cpu)
+{
+ u32 dom = cpu_to_dom_id(cpu);
+ if (scx_bpf_consume(dom)) {
+ stat_add(ATROPOS_STAT_DSQ_DISPATCH, 1);
+ return;
+ }
+
+ if (greedy_threshold != (u64)-1)
+ bpf_loop(nr_doms - 1, greedy_loopfn, &cpu, 0);
+}
+
+struct pick_task_domain_loop_ctx {
+ struct task_struct *p;
+ const struct cpumask *cpumask;
+ u64 dom_mask;
+ u32 dom_rr_base;
+ u32 dom_id;
+};
+
+static int pick_task_domain_loopfn(u32 idx, void *data)
+{
+ struct pick_task_domain_loop_ctx *lctx = data;
+ u32 dom_id = (lctx->dom_rr_base + idx) % nr_doms;
+
+ if (dom_id >= MAX_DOMS)
+ return 1;
+
+ if (cpumask_intersects_domain(lctx->cpumask, dom_id)) {
+ lctx->dom_mask |= 1LLU << dom_id;
+ if (lctx->dom_id == MAX_DOMS)
+ lctx->dom_id = dom_id;
+ }
+ return 0;
+}
+
+static u32 pick_task_domain(struct task_ctx *task_ctx, struct task_struct *p,
+ const struct cpumask *cpumask)
+{
+ struct pick_task_domain_loop_ctx lctx = {
+ .p = p,
+ .cpumask = cpumask,
+ .dom_id = MAX_DOMS,
+ };
+ s32 cpu = bpf_get_smp_processor_id();
+
+ if (cpu < 0 || cpu >= MAX_CPUS)
+ return MAX_DOMS;
+
+ lctx.dom_rr_base = ++(pcpu_ctx[cpu].dom_rr_cur);
+
+ bpf_loop(nr_doms, pick_task_domain_loopfn, &lctx, 0);
+ task_ctx->dom_mask = lctx.dom_mask;
+
+ return lctx.dom_id;
+}
+
+static void task_set_domain(struct task_ctx *task_ctx, struct task_struct *p,
+ const struct cpumask *cpumask)
+{
+ u32 dom_id = 0;
+
+ if (nr_doms > 1)
+ dom_id = pick_task_domain(task_ctx, p, cpumask);
+
+ task_set_dq(task_ctx, p, dom_id);
+}
+
+void BPF_STRUCT_OPS(atropos_set_cpumask, struct task_struct *p,
+ const struct cpumask *cpumask)
+{
+ pid_t pid = p->pid;
+ struct task_ctx *task_ctx = bpf_map_lookup_elem(&task_data, &pid);
+ if (!task_ctx) {
+ stat_add(ATROPOS_STAT_TASK_GET_ERR, 1);
+ return;
+ }
+
+ task_set_domain(task_ctx, p, cpumask);
+}
+
+void BPF_STRUCT_OPS(atropos_enable, struct task_struct *p)
+{
+ struct task_ctx task_ctx;
+ memset(&task_ctx, 0, sizeof(task_ctx));
+ task_ctx.weight = p->scx.weight;
+
+ task_set_domain(&task_ctx, p, p->cpus_ptr);
+ pid_t pid = p->pid;
+ long ret =
+ bpf_map_update_elem(&task_data, &pid, &task_ctx, BPF_NOEXIST);
+ if (ret) {
+ stat_add(ATROPOS_STAT_TASK_GET_ERR, 1);
+ return;
+ }
+}
+
+void BPF_STRUCT_OPS(atropos_disable, struct task_struct *p)
+{
+ pid_t pid = p->pid;
+ long ret = bpf_map_delete_elem(&task_data, &pid);
+ if (ret) {
+ stat_add(ATROPOS_STAT_TASK_GET_ERR, 1);
+ return;
+ }
+}
+
+int BPF_STRUCT_OPS(atropos_init)
+{
+ int ret;
+ u32 local_nr_doms = nr_doms;
+
+ bpf_printk("atropos init");
+
+ if (switch_all)
+ scx_bpf_switch_all();
+
+ // BPF verifier gets cranky if we don't bound this like so
+ if (local_nr_doms > MAX_DOMS)
+ local_nr_doms = MAX_DOMS;
+
+ for (u32 i = 0; i < local_nr_doms; ++i) {
+ ret = scx_bpf_create_dsq(i, -1);
+ if (ret < 0)
+ return ret;
+ }
+
+ for (u32 i = 0; i < nr_cpus; ++i) {
+ pcpu_ctx[i].dom_rr_cur = i;
+ }
+
+ return 0;
+}
+
+void BPF_STRUCT_OPS(atropos_exit, struct scx_exit_info *ei)
+{
+ bpf_probe_read_kernel_str(exit_msg, sizeof(exit_msg), ei->msg);
+ exit_type = ei->type;
+}
+
+SEC(".struct_ops")
+struct sched_ext_ops atropos = {
+ .select_cpu = (void *)atropos_select_cpu,
+ .enqueue = (void *)atropos_enqueue,
+ .consume = (void *)atropos_consume,
+ .set_cpumask = (void *)atropos_set_cpumask,
+ .enable = (void *)atropos_enable,
+ .disable = (void *)atropos_disable,
+ .init = (void *)atropos_init,
+ .exit = (void *)atropos_exit,
+ .flags = 0,
+ .name = "atropos",
+};
diff --git a/tools/sched_ext/atropos/src/bpf/atropos.h b/tools/sched_ext/atropos/src/bpf/atropos.h
new file mode 100644
index 000000000000..9a2bfee8b1d6
--- /dev/null
+++ b/tools/sched_ext/atropos/src/bpf/atropos.h
@@ -0,0 +1,38 @@
+// Copyright (c) Meta Platforms, Inc. and affiliates.
+
+// This software may be used and distributed according to the terms of the
+// GNU General Public License version 2.
+#ifndef __ATROPOS_H
+#define __ATROPOS_H
+
+#include <stdbool.h>
+
+#define MAX_CPUS 512
+#define MAX_DOMS 64 /* limited to avoid complex bitmask ops */
+#define CACHELINE_SIZE 64
+
+/* Statistics */
+enum stat_idx {
+ ATROPOS_STAT_TASK_GET_ERR,
+ ATROPOS_STAT_TASK_GET_ERR_ENABLE,
+ ATROPOS_STAT_CPUMASK_ERR,
+ ATROPOS_STAT_WAKE_SYNC,
+ ATROPOS_STAT_PREV_IDLE,
+ ATROPOS_STAT_PINNED,
+ ATROPOS_STAT_DIRECT_DISPATCH,
+ ATROPOS_STAT_DSQ_DISPATCH,
+ ATROPOS_STAT_GREEDY,
+ ATROPOS_STAT_LOAD_BALANCE,
+ ATROPOS_STAT_LAST_TASK,
+ ATROPOS_NR_STATS,
+};
+
+struct task_ctx {
+ unsigned long long dom_mask; /* the domains this task can run on */
+ unsigned long long cpumask[MAX_CPUS / 64];
+ unsigned int dom_id;
+ unsigned int weight;
+ bool dispatch_local;
+};
+
+#endif /* __ATROPOS_H */
diff --git a/tools/sched_ext/atropos/src/main.rs b/tools/sched_ext/atropos/src/main.rs
new file mode 100644
index 000000000000..b9ae312a562f
--- /dev/null
+++ b/tools/sched_ext/atropos/src/main.rs
@@ -0,0 +1,648 @@
+// Copyright (c) Meta Platforms, Inc. and affiliates.
+
+// This software may be used and distributed according to the terms of the
+// GNU General Public License version 2.
+#![deny(clippy::all)]
+use std::collections::BTreeMap;
+use std::ffi::CStr;
+use std::sync::atomic::AtomicBool;
+use std::sync::atomic::Ordering;
+use std::sync::Arc;
+
+use ::fb_procfs as procfs;
+use anyhow::anyhow;
+use anyhow::bail;
+use anyhow::Context;
+use bitvec::prelude::*;
+use clap::Parser;
+
+mod util;
+
+oss_shim!();
+
+/// Atropos is a multi-domain BPF / userspace hybrid scheduler where the BPF
+/// part does simple round robin in each domain and the userspace part
+/// calculates the load factor of each domain and tells the BPF part how to load
+/// balance the domains.
+
+/// This scheduler demonstrates dividing scheduling logic between BPF and
+/// userspace and using rust to build the userspace part. An earlier variant of
+/// this scheduler was used to balance across six domains, each representing a
+/// chiplet in a six-chiplet AMD processor, and could match the performance of
+/// production setup using CFS.
+#[derive(Debug, Parser)]
+struct Opt {
+ /// Set the log level for more or less verbose output. --log_level=debug
+ /// will output libbpf verbose details
+ #[clap(short, long, default_value = "info")]
+ log_level: String,
+ /// Set the cpumask for a domain, provide multiple --cpumasks, one for each
+ /// domain. E.g. --cpumasks 0xff_00ff --cpumasks 0xff00 will create two
+ /// domains with the corresponding CPUs belonging to each domain. Each CPU
+ /// must belong to precisely one domain.
+ #[clap(short, long, required = true, min_values = 1)]
+ cpumasks: Vec<String>,
+ /// Switch all tasks to sched_ext. If not specified, only tasks which
+ /// have their scheduling policy set to SCHED_EXT using
+ /// sched_setscheduler(2) are switched.
+ #[clap(short, long, default_value = "false")]
+ all: bool,
+ /// Enable load balancing. Periodically userspace will calculate the load
+ /// factor of each domain and instruct BPF which processes to move.
+ #[clap(short, long, default_value = "true")]
+ load_balance: bool,
+ /// Enable greedy task stealing. When a domain is idle, a cpu will attempt
+ /// to steal tasks from a domain with at least greedy_threshold tasks
+ /// enqueued. These tasks aren't permanently stolen from the domain.
+ #[clap(short, long)]
+ greedy_threshold: Option<u64>,
+}
+
+type CpusetDqPair = (Vec<BitVec<u64, Lsb0>>, Vec<i32>);
+
+// Returns Vec of cpuset for each dq and a vec of dq for each cpu
+fn parse_cpusets(cpumasks: &[String]) -> anyhow::Result<CpusetDqPair> {
+ if cpumasks.len() > atropos_sys::MAX_DOMS as usize {
+ bail!(
+ "Number of requested DSQs ({}) is greater than MAX_DOMS ({})",
+ cpumasks.len(),
+ atropos_sys::MAX_DOMS
+ );
+ }
+ let num_cpus = libbpf_rs::num_possible_cpus()?;
+ if num_cpus > atropos_sys::MAX_CPUS as usize {
+ bail!(
+ "num_cpus ({}) is greater than MAX_CPUS ({})",
+ num_cpus,
+ atropos_sys::MAX_CPUS,
+ );
+ }
+ let mut cpus = vec![-1i32; num_cpus];
+ let mut cpusets = vec![bitvec![u64, Lsb0; 0; atropos_sys::MAX_CPUS as usize]; cpumasks.len()];
+ for (dq, cpumask) in cpumasks.iter().enumerate() {
+ let hex_str = {
+ let mut tmp_str = cpumask
+ .strip_prefix("0x")
+ .unwrap_or(cpumask)
+ .replace('_', "");
+ if tmp_str.len() % 2 != 0 {
+ tmp_str = "0".to_string() + &tmp_str;
+ }
+ tmp_str
+ };
+ let byte_vec = hex::decode(&hex_str)
+ .with_context(|| format!("Failed to parse cpumask: {}", cpumask))?;
+
+ for (index, &val) in byte_vec.iter().rev().enumerate() {
+ let mut v = val;
+ while v != 0 {
+ let lsb = v.trailing_zeros() as usize;
+ v &= !(1 << lsb);
+ let cpu = index * 8 + lsb;
+ if cpu > num_cpus {
+ bail!(
+ concat!(
+ "Found cpu ({}) in cpumask ({}) which is larger",
+ " than the number of cpus on the machine ({})"
+ ),
+ cpu,
+ cpumask,
+ num_cpus
+ );
+ }
+ if cpus[cpu] != -1 {
+ bail!(
+ "Found cpu ({}) with dq ({}) but also in cpumask ({})",
+ cpu,
+ cpus[cpu],
+ cpumask
+ );
+ }
+ cpus[cpu] = dq as i32;
+ cpusets[dq].set(cpu, true);
+ }
+ }
+ cpusets[dq].set_uninitialized(false);
+ }
+
+ for (cpu, &dq) in cpus.iter().enumerate() {
+ if dq < 0 {
+ bail!(
+ "Cpu {} not assigned to any dq. Make sure it is covered by some --cpumasks argument.",
+ cpu
+ );
+ }
+ }
+
+ Ok((cpusets, cpus))
+}
+
+struct Sample {
+ total_cpu: procfs::CpuStat,
+}
+
+fn get_cpustats(reader: &mut procfs::ProcReader) -> anyhow::Result<Sample> {
+ let stat = reader.read_stat().context("Failed to read procfs")?;
+ Ok(Sample {
+ total_cpu: stat
+ .total_cpu
+ .ok_or_else(|| anyhow!("Could not read total cpu stat in proc"))?,
+ })
+}
+
+fn calculate_cpu_busy(prev: &procfs::CpuStat, next: &procfs::CpuStat) -> anyhow::Result<f64> {
+ match (prev, next) {
+ (
+ procfs::CpuStat {
+ user_usec: Some(prev_user),
+ nice_usec: Some(prev_nice),
+ system_usec: Some(prev_system),
+ idle_usec: Some(prev_idle),
+ iowait_usec: Some(prev_iowait),
+ irq_usec: Some(prev_irq),
+ softirq_usec: Some(prev_softirq),
+ stolen_usec: Some(prev_stolen),
+ guest_usec: _,
+ guest_nice_usec: _,
+ },
+ procfs::CpuStat {
+ user_usec: Some(curr_user),
+ nice_usec: Some(curr_nice),
+ system_usec: Some(curr_system),
+ idle_usec: Some(curr_idle),
+ iowait_usec: Some(curr_iowait),
+ irq_usec: Some(curr_irq),
+ softirq_usec: Some(curr_softirq),
+ stolen_usec: Some(curr_stolen),
+ guest_usec: _,
+ guest_nice_usec: _,
+ },
+ ) => {
+ let idle_usec = curr_idle - prev_idle;
+ let iowait_usec = curr_iowait - prev_iowait;
+ let user_usec = curr_user - prev_user;
+ let system_usec = curr_system - prev_system;
+ let nice_usec = curr_nice - prev_nice;
+ let irq_usec = curr_irq - prev_irq;
+ let softirq_usec = curr_softirq - prev_softirq;
+ let stolen_usec = curr_stolen - prev_stolen;
+
+ let busy_usec =
+ user_usec + system_usec + nice_usec + irq_usec + softirq_usec + stolen_usec;
+ let total_usec = idle_usec + busy_usec + iowait_usec;
+ Ok(busy_usec as f64 / total_usec as f64)
+ }
+ _ => {
+ bail!("Some procfs stats are not populated!");
+ }
+ }
+}
+
+fn calculate_pid_busy(
+ prev: &procfs::PidStat,
+ next: &procfs::PidStat,
+ dur: std::time::Duration,
+) -> anyhow::Result<f64> {
+ match (
+ (prev.user_usecs, prev.system_usecs),
+ (next.user_usecs, prev.system_usecs),
+ ) {
+ ((Some(prev_user), Some(prev_system)), (Some(next_user), Some(next_system))) => {
+ if (next_user >= prev_user) && (next_system >= prev_system) {
+ let busy_usec = next_user + next_system - prev_user - prev_system;
+ Ok(busy_usec as f64 / dur.as_micros() as f64)
+ } else {
+ bail!("Pid usage values look wrong");
+ }
+ }
+ _ => {
+ bail!("Some procfs stats are not populated!");
+ }
+ }
+}
+
+struct PidInfo {
+ pub pid: i32,
+ pub dom: u32,
+ pub dom_mask: u64,
+}
+
+struct LoadInfo {
+ pids_by_milliload: BTreeMap<u64, PidInfo>,
+ pid_stats: BTreeMap<i32, procfs::PidStat>,
+ global_load_sum: f64,
+ dom_load: Vec<f64>,
+}
+
+// We calculate the load for each task and then each dom by enumerating all the
+// tasks in task_data and calculating their CPU util from procfs.
+
+// Given procfs reader, task data map, and pidstat from previous calculation,
+// return:
+// * a sorted map from milliload -> pid_data,
+// * a map from pid -> pidstat
+// * a vec of per-dom looads
+fn calculate_load(
+ proc_reader: &procfs::ProcReader,
+ task_data: &libbpf_rs::Map,
+ interval: std::time::Duration,
+ prev_pid_stat: &BTreeMap<i32, procfs::PidStat>,
+ nr_doms: usize,
+) -> anyhow::Result<LoadInfo> {
+ let mut ret = LoadInfo {
+ pids_by_milliload: BTreeMap::new(),
+ pid_stats: BTreeMap::new(),
+ global_load_sum: 0f64,
+ dom_load: vec![0f64; nr_doms],
+ };
+ for key in task_data.keys() {
+ if let Some(task_ctx_vec) = task_data
+ .lookup(&key, libbpf_rs::MapFlags::ANY)
+ .context("Failed to lookup task_data")?
+ {
+ let task_ctx =
+ unsafe { &*(task_ctx_vec.as_slice().as_ptr() as *const atropos_sys::task_ctx) };
+ let pid = i32::from_ne_bytes(
+ key.as_slice()
+ .try_into()
+ .context("Invalid key length in task_data map")?,
+ );
+ match proc_reader.read_tid_stat(pid as u32) {
+ Ok(stat) => {
+ ret.pid_stats.insert(pid, stat);
+ }
+ Err(procfs::Error::IoError(_, ref e))
+ if e.raw_os_error()
+ .map_or(false, |ec| ec == 2 || ec == 3 /* ENOENT or ESRCH */) =>
+ {
+ continue;
+ }
+ Err(e) => {
+ bail!(e);
+ }
+ }
+ let pid_load = match (prev_pid_stat.get(&pid), ret.pid_stats.get(&pid)) {
+ (Some(prev_pid_stat), Some(next_pid_stat)) => {
+ calculate_pid_busy(prev_pid_stat, next_pid_stat, interval)?
+ }
+ // If we don't have any utilization #s for the process, just skip it
+ _ => {
+ continue;
+ }
+ } * task_ctx.weight as f64;
+ if !pid_load.is_finite() || pid_load <= 0.0 {
+ continue;
+ }
+ ret.global_load_sum += pid_load;
+ ret.dom_load[task_ctx.dom_id as usize] += pid_load;
+ // Only record pids that are eligible for load balancing
+ if task_ctx.dom_mask == (1u64 << task_ctx.dom_id) {
+ continue;
+ }
+ ret.pids_by_milliload.insert(
+ (pid_load * 1000.0) as u64,
+ PidInfo {
+ pid,
+ dom: task_ctx.dom_id,
+ dom_mask: task_ctx.dom_mask,
+ },
+ );
+ }
+ }
+ Ok(ret)
+}
+
+#[derive(Copy, Clone, Default)]
+struct DomLoadBalanceInfo {
+ load_to_pull: f64,
+ load_to_give: f64,
+}
+
+#[derive(Default)]
+struct LoadBalanceInfo {
+ doms: Vec<DomLoadBalanceInfo>,
+ doms_with_load_to_pull: BTreeMap<u32, f64>,
+ doms_with_load_to_give: BTreeMap<u32, f64>,
+}
+
+// To balance dom loads we identify doms with lower and higher load than average
+fn calculate_dom_load_balance(global_load_avg: f64, dom_load: &[f64]) -> LoadBalanceInfo {
+ let mut ret = LoadBalanceInfo::default();
+ ret.doms.resize(dom_load.len(), Default::default());
+
+ const LOAD_IMBAL_HIGH_PCT: f64 = 0.10;
+ const LOAD_IMBAL_MAX_ADJ_PCT: f64 = 0.10;
+ let high = global_load_avg * LOAD_IMBAL_HIGH_PCT;
+ let adj_max = global_load_avg * LOAD_IMBAL_MAX_ADJ_PCT;
+
+ for (dom, dom_load) in dom_load.iter().enumerate() {
+ let mut imbal = dom_load - global_load_avg;
+
+ let mut dom_load_to_pull = 0f64;
+ let mut dom_load_to_give = 0f64;
+ if imbal >= 0f64 {
+ dom_load_to_give = imbal;
+ } else {
+ imbal = -imbal;
+ if imbal > high {
+ dom_load_to_pull = f64::min(imbal, adj_max);
+ }
+ }
+ ret.doms[dom].load_to_pull = dom_load_to_pull;
+ ret.doms[dom].load_to_give = dom_load_to_give;
+ if dom_load_to_pull > 0f64 {
+ ret.doms_with_load_to_pull
+ .insert(dom as u32, dom_load_to_pull);
+ }
+ if dom_load_to_give > 0f64 {
+ ret.doms_with_load_to_give
+ .insert(dom as u32, dom_load_to_give);
+ }
+ }
+ ret
+}
+
+fn clear_map(map: &mut libbpf_rs::Map) {
+ // XXX: libbpf_rs has some design flaw that make it impossible to
+ // delete while iterating despite it being safe so we alias it here
+ let deleter: &mut libbpf_rs::Map = unsafe { &mut *(map as *mut _) };
+ for key in map.keys() {
+ let _ = deleter.delete(&key);
+ }
+}
+
+// Actually execute the load balancing. Concretely this writes pid -> dom
+// entries into the lb_data map for bpf side to consume.
+//
+// The logic here is simple, greedily balance the heaviest load processes until
+// either we have no doms with load to give or no doms with load to pull.
+fn load_balance(
+ global_load_avg: f64,
+ lb_data: &mut libbpf_rs::Map,
+ pids_by_milliload: &BTreeMap<u64, PidInfo>,
+ mut doms_with_load_to_pull: BTreeMap<u32, f64>,
+ mut doms_with_load_to_give: BTreeMap<u32, f64>,
+) -> anyhow::Result<()> {
+ clear_map(lb_data);
+ const LOAD_IMBAL_MIN_ADJ_PCT: f64 = 0.01;
+ let adj_min = global_load_avg * LOAD_IMBAL_MIN_ADJ_PCT;
+ for (pid_milliload, pidinfo) in pids_by_milliload.iter().rev() {
+ if doms_with_load_to_give.is_empty() || doms_with_load_to_pull.is_empty() {
+ break;
+ }
+
+ let pid_load = *pid_milliload as f64 / 1000f64;
+ let mut remove_to_give = None;
+ let mut remove_to_pull = None;
+ if let Some(dom_imbal) = doms_with_load_to_give.get_mut(&pidinfo.dom) {
+ if *dom_imbal < pid_load {
+ continue;
+ }
+
+ for (new_dom, new_dom_imbal) in doms_with_load_to_pull.iter_mut() {
+ if (pidinfo.dom_mask & (1 << new_dom)) == 0 || *new_dom_imbal < pid_load {
+ continue;
+ }
+
+ *dom_imbal -= pid_load;
+ if *dom_imbal <= adj_min {
+ remove_to_give = Some(pidinfo.dom);
+ }
+ *new_dom_imbal -= pid_load;
+ if *new_dom_imbal <= adj_min {
+ remove_to_pull = Some(pidinfo.dom);
+ }
+
+ lb_data
+ .update(
+ &(pidinfo.pid as libc::pid_t).to_ne_bytes(),
+ &new_dom.to_ne_bytes(),
+ libbpf_rs::MapFlags::NO_EXIST,
+ )
+ .context("Failed to update lb_data")?;
+ break;
+ }
+ }
+
+ remove_to_give.map(|dom| doms_with_load_to_give.remove(&dom));
+ remove_to_pull.map(|dom| doms_with_load_to_pull.remove(&dom));
+ }
+ Ok(())
+}
+
+fn print_stats(
+ logger: slog::Logger,
+ stats_map: &mut libbpf_rs::Map,
+ nr_doms: usize,
+ nr_cpus: usize,
+ cpu_busy: f64,
+ global_load_avg: f64,
+ dom_load: &[f64],
+ dom_lb_info: &[DomLoadBalanceInfo],
+) -> anyhow::Result<()> {
+ let stats = {
+ let mut stats: Vec<u64> = Vec::new();
+ let zero_vec = vec![vec![0u8; stats_map.value_size() as usize]; nr_cpus];
+ for stat in 0..atropos_sys::stat_idx_ATROPOS_NR_STATS {
+ let cpu_stat_vec = stats_map
+ .lookup_percpu(&(stat as u32).to_ne_bytes(), libbpf_rs::MapFlags::ANY)
+ .with_context(|| format!("Failed to lookup stat {}", stat))?
+ .expect("per-cpu stat should exist");
+ let sum = cpu_stat_vec
+ .iter()
+ .map(|val| {
+ u64::from_ne_bytes(
+ val.as_slice()
+ .try_into()
+ .expect("Invalid value length in stat map"),
+ )
+ })
+ .sum();
+ stats_map
+ .update_percpu(
+ &(stat as u32).to_ne_bytes(),
+ &zero_vec,
+ libbpf_rs::MapFlags::ANY,
+ )
+ .context("Failed to zero stat")?;
+ stats.push(sum);
+ }
+ stats
+ };
+ let mut total = 0;
+ total += stats[atropos_sys::stat_idx_ATROPOS_STAT_WAKE_SYNC as usize];
+ total += stats[atropos_sys::stat_idx_ATROPOS_STAT_PREV_IDLE as usize];
+ total += stats[atropos_sys::stat_idx_ATROPOS_STAT_PINNED as usize];
+ total += stats[atropos_sys::stat_idx_ATROPOS_STAT_DIRECT_DISPATCH as usize];
+ total += stats[atropos_sys::stat_idx_ATROPOS_STAT_DSQ_DISPATCH as usize];
+ total += stats[atropos_sys::stat_idx_ATROPOS_STAT_GREEDY as usize];
+ total += stats[atropos_sys::stat_idx_ATROPOS_STAT_LAST_TASK as usize];
+ slog::info!(logger, "cpu={:5.1}", cpu_busy * 100.0);
+ slog::info!(
+ logger,
+ "task_get_errs: {}, cpumask_errs: {}",
+ stats[atropos_sys::stat_idx_ATROPOS_STAT_TASK_GET_ERR as usize],
+ stats[atropos_sys::stat_idx_ATROPOS_STAT_CPUMASK_ERR as usize],
+ );
+ slog::info!(
+ logger,
+ "tot={:6} wake_sync={:4.1},prev_idle={:4.1},pinned={:4.1},direct={:4.1},dq={:4.1},greedy={:4.1}",
+ total,
+ stats[atropos_sys::stat_idx_ATROPOS_STAT_WAKE_SYNC as usize] as f64 / total as f64 * 100f64,
+ stats[atropos_sys::stat_idx_ATROPOS_STAT_PREV_IDLE as usize] as f64 / total as f64 * 100f64,
+ stats[atropos_sys::stat_idx_ATROPOS_STAT_PINNED as usize] as f64 / total as f64 * 100f64,
+ stats[atropos_sys::stat_idx_ATROPOS_STAT_DIRECT_DISPATCH as usize] as f64 / total as f64
+ * 100f64,
+ stats[atropos_sys::stat_idx_ATROPOS_STAT_DSQ_DISPATCH as usize] as f64 / total as f64
+ * 100f64,
+ stats[atropos_sys::stat_idx_ATROPOS_STAT_GREEDY as usize] as f64 / total as f64 * 100f64,
+ );
+
+ slog::info!(
+ logger,
+ "load_avg:{:.1}, load_balances={}",
+ global_load_avg,
+ stats[atropos_sys::stat_idx_ATROPOS_STAT_LOAD_BALANCE as usize]
+ );
+ for i in 0..nr_doms {
+ slog::info!(logger, "DOM[{:02}]", i);
+ slog::info!(
+ logger,
+ " load={:.1} to_pull={:.1},to_give={:.1}",
+ dom_load[i],
+ dom_lb_info[i].load_to_pull,
+ dom_lb_info[i].load_to_give,
+ );
+ }
+ Ok(())
+}
+
+pub fn run(
+ logger: slog::Logger,
+ debug: bool,
+ cpumasks: Vec<String>,
+ switch_all: bool,
+ balance_load: bool,
+ greedy_threshold: Option<u64>,
+) -> anyhow::Result<()> {
+ slog::info!(logger, "Atropos Scheduler Initialized");
+ let mut skel_builder = AtroposSkelBuilder::default();
+ skel_builder.obj_builder.debug(debug);
+ let mut skel = skel_builder.open().context("Failed to open BPF program")?;
+
+ let (cpusets, cpus) = parse_cpusets(&cpumasks)?;
+ let nr_doms = cpusets.len();
+ let nr_cpus = libbpf_rs::num_possible_cpus()?;
+ skel.rodata().nr_doms = nr_doms as u32;
+ skel.rodata().nr_cpus = nr_cpus as u32;
+
+ for (cpu, dom) in cpus.iter().enumerate() {
+ skel.rodata().cpu_dom_id_map[cpu] = *dom as u32;
+ }
+
+ for (dom, cpuset) in cpusets.iter().enumerate() {
+ let raw_cpuset_slice = cpuset.as_raw_slice();
+ let dom_cpumask_slice = &mut skel.rodata().dom_cpumasks[dom];
+ let (left, _) = dom_cpumask_slice.split_at_mut(raw_cpuset_slice.len());
+ left.clone_from_slice(cpuset.as_raw_slice());
+ slog::info!(logger, "dom {} cpumask {:X?}", dom, dom_cpumask_slice);
+ }
+
+ skel.rodata().switch_all = switch_all;
+
+ if let Some(greedy) = greedy_threshold {
+ skel.rodata().greedy_threshold = greedy;
+ }
+
+ let mut skel = skel.load().context("Failed to load BPF program")?;
+ skel.attach().context("Failed to attach BPF program")?;
+
+ let _structops = skel
+ .maps_mut()
+ .atropos()
+ .attach_struct_ops()
+ .context("Failed to attach atropos struct ops")?;
+ slog::info!(logger, "Atropos Scheduler Attached");
+ let shutdown = Arc::new(AtomicBool::new(false));
+ let shutdown_clone = shutdown.clone();
+ ctrlc::set_handler(move || {
+ shutdown_clone.store(true, Ordering::Relaxed);
+ })
+ .context("Error setting Ctrl-C handler")?;
+
+ let mut proc_reader = procfs::ProcReader::new();
+ let mut prev_sample = get_cpustats(&mut proc_reader)?;
+ let mut prev_pid_stat: BTreeMap<i32, procfs::PidStat> = BTreeMap::new();
+ while !shutdown.load(Ordering::Relaxed)
+ && unsafe { std::ptr::read_volatile(&skel.bss().exit_type as *const _) } == 0
+ {
+ let interval = std::time::Duration::from_secs(1);
+ std::thread::sleep(interval);
+ let now = std::time::SystemTime::now();
+ let next_sample = get_cpustats(&mut proc_reader)?;
+ let cpu_busy = calculate_cpu_busy(&prev_sample.total_cpu, &next_sample.total_cpu)?;
+ prev_sample = next_sample;
+ let load_info = calculate_load(
+ &proc_reader,
+ skel.maps().task_data(),
+ interval,
+ &prev_pid_stat,
+ nr_doms,
+ )?;
+ prev_pid_stat = load_info.pid_stats;
+
+ let global_load_avg = load_info.global_load_sum / nr_doms as f64;
+ let mut lb_info = calculate_dom_load_balance(global_load_avg, &load_info.dom_load);
+
+ let doms_with_load_to_pull = std::mem::take(&mut lb_info.doms_with_load_to_pull);
+ let doms_with_load_to_give = std::mem::take(&mut lb_info.doms_with_load_to_give);
+ if balance_load {
+ load_balance(
+ global_load_avg,
+ skel.maps_mut().lb_data(),
+ &load_info.pids_by_milliload,
+ doms_with_load_to_pull,
+ doms_with_load_to_give,
+ )?;
+ slog::info!(
+ logger,
+ "Load balancing took {:?}",
+ now.elapsed().context("Getting a duration failed")?
+ );
+ }
+ print_stats(
+ logger.clone(),
+ skel.maps_mut().stats(),
+ nr_doms,
+ nr_cpus,
+ cpu_busy,
+ global_load_avg,
+ &load_info.dom_load,
+ &lb_info.doms,
+ )?;
+ }
+ /* Report msg if EXT_OPS_EXIT_ERROR */
+ if skel.bss().exit_type == 2 {
+ let exit_msg_cstr = unsafe { CStr::from_ptr(skel.bss().exit_msg.as_ptr() as *const _) };
+ let exit_msg = exit_msg_cstr
+ .to_str()
+ .context("Failed to convert exit msg to string")?;
+ eprintln!("exit_type={} msg={}", skel.bss().exit_type, exit_msg);
+ }
+ Ok(())
+}
+
+fn main() -> anyhow::Result<()> {
+ let opts = Opt::parse();
+ let logger = setup_logger(&opts.log_level)?;
+ let debug = opts.log_level == "debug";
+
+ run(
+ logger,
+ debug,
+ opts.cpumasks,
+ opts.all,
+ opts.load_balance,
+ opts.greedy_threshold,
+ )
+}
diff --git a/tools/sched_ext/atropos/src/oss/atropos_sys.rs b/tools/sched_ext/atropos/src/oss/atropos_sys.rs
new file mode 100644
index 000000000000..bbeaf856d40e
--- /dev/null
+++ b/tools/sched_ext/atropos/src/oss/atropos_sys.rs
@@ -0,0 +1,10 @@
+// Copyright (c) Meta Platforms, Inc. and affiliates.
+
+// This software may be used and distributed according to the terms of the
+// GNU General Public License version 2.
+#![allow(non_upper_case_globals)]
+#![allow(non_camel_case_types)]
+#![allow(non_snake_case)]
+#![allow(dead_code)]
+
+include!(concat!(env!("OUT_DIR"), "/atropos-sys.rs"));
diff --git a/tools/sched_ext/atropos/src/oss/mod.rs b/tools/sched_ext/atropos/src/oss/mod.rs
new file mode 100644
index 000000000000..5afcf35f777d
--- /dev/null
+++ b/tools/sched_ext/atropos/src/oss/mod.rs
@@ -0,0 +1,29 @@
+// Copyright (c) Meta Platforms, Inc. and affiliates.
+
+// This software may be used and distributed according to the terms of the
+// GNU General Public License version 2.
+#[path = "../bpf/.output/atropos.skel.rs"]
+mod atropos;
+use std::str::FromStr;
+
+use anyhow::bail;
+pub use atropos::*;
+use slog::o;
+use slog::Drain;
+use slog::Level;
+
+pub mod atropos_sys;
+
+pub fn setup_logger(level: &str) -> anyhow::Result<slog::Logger> {
+ let log_level = match Level::from_str(level) {
+ Ok(l) => l,
+ Err(()) => bail!("Failed to parse \"{}\" as a log level", level),
+ };
+ let decorator = slog_term::TermDecorator::new().build();
+ let drain = slog_term::FullFormat::new(decorator).build().fuse();
+ let drain = slog_async::Async::new(drain)
+ .build()
+ .filter_level(log_level)
+ .fuse();
+ Ok(slog::Logger::root(drain, o!()))
+}
diff --git a/tools/sched_ext/atropos/src/util.rs b/tools/sched_ext/atropos/src/util.rs
new file mode 100644
index 000000000000..eae414c0919a
--- /dev/null
+++ b/tools/sched_ext/atropos/src/util.rs
@@ -0,0 +1,24 @@
+// Copyright (c) Meta Platforms, Inc. and affiliates.
+
+// This software may be used and distributed according to the terms of the
+// GNU General Public License version 2.
+
+// Shim between facebook types and open source types.
+//
+// The type interfaces and module hierarchy should be identical on
+// both "branches". And since we glob import, all the submodules in
+// this crate will inherit our name bindings and can use generic paths,
+// eg `crate::logging::setup(..)`.
+#[macro_export]
+macro_rules! oss_shim {
+ () => {
+ #[cfg(fbcode_build)]
+ mod facebook;
+ #[cfg(fbcode_build)]
+ use facebook::*;
+ #[cfg(not(fbcode_build))]
+ mod oss;
+ #[cfg(not(fbcode_build))]
+ use oss::*;
+ };
+}
--
2.38.1
Add sched_ext_ops operations to init/exit cgroups, and track task migrations
and config changes. Because different BPF schedulers may implement different
subsets of CPU control features, allow BPF schedulers to pick which cgroup
interface files to enable using SCX_OPS_CGROUP_KNOB_* flags. For now, only
the weight knobs are supported but adding more should be straightforward.
While a BPF scheduler is being enabled and disabled, relevant cgroup
operations are locked out using scx_cgroup_rwsem. This avoids situations
like task prep taking place while the task is being moved across cgroups,
making things easier for BPF schedulers.
This patch also adds scx_example_pair which implements a variant of core
scheduling where a hyperthread pair only run tasks from the same cgroup. The
BPF scheduler achieves this by putting tasks into per-cgroup queues,
time-slicing the cgroup to run for each pair first, and then scheduling
within the cgroup. See the header comment in scx_example_pair.bpf.c for more
details.
Note that scx_example_pair's cgroup-boundary guarantee breaks down for tasks
running in higher priority scheduler classes. This will be addressed by a
followup patch which implements a mechanism to track CPU preemption.
Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/sched/ext.h | 96 ++++-
init/Kconfig | 5 +
kernel/sched/core.c | 69 ++-
kernel/sched/ext.c | 326 ++++++++++++++-
kernel/sched/ext.h | 23 +
kernel/sched/sched.h | 12 +-
tools/sched_ext/.gitignore | 1 +
tools/sched_ext/Makefile | 9 +-
tools/sched_ext/scx_example_pair.bpf.c | 554 +++++++++++++++++++++++++
tools/sched_ext/scx_example_pair.c | 143 +++++++
tools/sched_ext/scx_example_pair.h | 10 +
11 files changed, 1226 insertions(+), 22 deletions(-)
create mode 100644 tools/sched_ext/scx_example_pair.bpf.c
create mode 100644 tools/sched_ext/scx_example_pair.c
create mode 100644 tools/sched_ext/scx_example_pair.h
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index b1c95fb11c8d..dc51304b6599 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -12,6 +12,8 @@
#include <linux/rhashtable.h>
#include <linux/llist.h>
+struct cgroup;
+
enum scx_consts {
SCX_OPS_NAME_LEN = 128,
SCX_EXIT_REASON_LEN = 128,
@@ -109,14 +111,27 @@ enum scx_ops_flags {
*/
SCX_OPS_ENQ_EXITING = 1LLU << 2,
+ /*
+ * CPU cgroup knob enable flags
+ */
+ SCX_OPS_CGROUP_KNOB_WEIGHT = 1LLU << 16, /* cpu.weight */
+
SCX_OPS_ALL_FLAGS = SCX_OPS_KEEP_BUILTIN_IDLE |
SCX_OPS_ENQ_LAST |
- SCX_OPS_ENQ_EXITING,
+ SCX_OPS_ENQ_EXITING |
+ SCX_OPS_CGROUP_KNOB_WEIGHT,
};
/* argument container for ops.enable() and friends */
struct scx_enable_args {
- /* empty for now */
+ /* the cgroup the task is joining */
+ struct cgroup *cgroup;
+};
+
+/* argument container for ops->cgroup_init() */
+struct scx_cgroup_init_args {
+ /* the weight of the cgroup [1..10000] */
+ u32 weight;
};
/**
@@ -341,7 +356,8 @@ struct sched_ext_ops {
* @p: task to enable BPF scheduling for
* @args: enable arguments, see the struct definition
*
- * Enable @p for BPF scheduling. @p will start running soon.
+ * Enable @p for BPF scheduling. @p is now in the cgroup specified for
+ * the preceding prep_enable() and will start running soon.
*/
void (*enable)(struct task_struct *p, struct scx_enable_args *args);
@@ -365,6 +381,77 @@ struct sched_ext_ops {
*/
void (*disable)(struct task_struct *p);
+ /**
+ * cgroup_init - Initialize a cgroup
+ * @cgrp: cgroup being initialized
+ * @args: init arguments, see the struct definition
+ *
+ * Either the BPF scheduler is being loaded or @cgrp created, initialize
+ * @cgrp for sched_ext. This operation may block.
+ *
+ * Return 0 for success, -errno for failure. An error return while
+ * loading will abort loading of the BPF scheduler. During cgroup
+ * creation, it will abort the specific cgroup creation.
+ */
+ s32 (*cgroup_init)(struct cgroup *cgrp,
+ struct scx_cgroup_init_args *args);
+
+ /**
+ * cgroup_exit - Exit a cgroup
+ * @cgrp: cgroup being exited
+ *
+ * Either the BPF scheduler is being unloaded or @cgrp destroyed, exit
+ * @cgrp for sched_ext. This operation my block.
+ */
+ void (*cgroup_exit)(struct cgroup *cgrp);
+
+ /**
+ * cgroup_prep_move - Prepare a task to be moved to a different cgroup
+ * @p: task being moved
+ * @from: cgroup @p is being moved from
+ * @to: cgroup @p is being moved to
+ *
+ * Prepare @p for move from cgroup @from to @to. This operation may
+ * block and can be used for allocations.
+ *
+ * Return 0 for success, -errno for failure. An error return aborts the
+ * migration.
+ */
+ s32 (*cgroup_prep_move)(struct task_struct *p,
+ struct cgroup *from, struct cgroup *to);
+
+ /**
+ * cgroup_move - Commit cgroup move
+ * @p: task being moved
+ * @from: cgroup @p is being moved from
+ * @to: cgroup @p is being moved to
+ *
+ * Commit the move. @p is dequeued during this operation.
+ */
+ void (*cgroup_move)(struct task_struct *p,
+ struct cgroup *from, struct cgroup *to);
+
+ /**
+ * cgroup_cancel_move - Cancel cgroup move
+ * @p: task whose cgroup move is being canceled
+ * @from: cgroup @p was being moved from
+ * @to: cgroup @p was being moved to
+ *
+ * @p was cgroup_prep_move()'d but failed before reaching cgroup_move().
+ * Undo the preparation.
+ */
+ void (*cgroup_cancel_move)(struct task_struct *p,
+ struct cgroup *from, struct cgroup *to);
+
+ /**
+ * cgroup_set_weight - A cgroup's weight is being changed
+ * @cgrp: cgroup whose weight is being updated
+ * @weight: new weight [1..10000]
+ *
+ * Update @tg's weight to @weight.
+ */
+ void (*cgroup_set_weight)(struct cgroup *cgrp, u32 weight);
+
/*
* All online ops must come before ops.init().
*/
@@ -481,6 +568,9 @@ struct sched_ext_entity {
/* cold fields */
struct list_head tasks_node;
+#ifdef CONFIG_EXT_GROUP_SCHED
+ struct cgroup *cgrp_moving_from;
+#endif
};
void sched_ext_free(struct task_struct *p);
diff --git a/init/Kconfig b/init/Kconfig
index abf65098f1b6..826624ec8925 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1037,6 +1037,11 @@ config RT_GROUP_SCHED
realtime bandwidth for them.
See Documentation/scheduler/sched-rt-group.rst for more information.
+config EXT_GROUP_SCHED
+ bool
+ depends on SCHED_CLASS_EXT && CGROUP_SCHED
+ default y
+
endif #CGROUP_SCHED
config UCLAMP_TASK_GROUP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 89d2421809da..79560641a61f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9720,6 +9720,9 @@ void __init sched_init(void)
root_task_group.shares = ROOT_TASK_GROUP_LOAD;
init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
#endif /* CONFIG_FAIR_GROUP_SCHED */
+#ifdef CONFIG_EXT_GROUP_SCHED
+ root_task_group.scx_weight = CGROUP_WEIGHT_DFL;
+#endif /* CONFIG_EXT_GROUP_SCHED */
#ifdef CONFIG_RT_GROUP_SCHED
root_task_group.rt_se = (struct sched_rt_entity **)ptr;
ptr += nr_cpu_ids * sizeof(void **);
@@ -10175,6 +10178,7 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;
+ scx_group_set_weight(tg, CGROUP_WEIGHT_DFL);
alloc_uclamp_sched_group(tg, parent);
return tg;
@@ -10286,6 +10290,8 @@ void sched_move_task(struct task_struct *tsk, enum sched_move_task_reason reason
put_prev_task(rq, tsk);
sched_change_group(tsk);
+ if (reason == SCHED_MOVE_TASK_CGROUP)
+ scx_cgroup_move_task(tsk);
if (queued)
enqueue_task(rq, tsk, queue_flags);
@@ -10325,6 +10331,11 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
{
struct task_group *tg = css_tg(css);
struct task_group *parent = css_tg(css->parent);
+ int ret;
+
+ ret = scx_tg_online(tg);
+ if (ret)
+ return ret;
if (parent)
sched_online_group(tg, parent);
@@ -10341,6 +10352,13 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
return 0;
}
+static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
+{
+ struct task_group *tg = css_tg(css);
+
+ scx_tg_offline(tg);
+}
+
static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
{
struct task_group *tg = css_tg(css);
@@ -10358,9 +10376,10 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
sched_unregister_group(tg);
}
-#ifdef CONFIG_RT_GROUP_SCHED
+#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_EXT_GROUP_SCHED)
static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
{
+#ifdef CONFIG_RT_GROUP_SCHED
struct task_struct *task;
struct cgroup_subsys_state *css;
@@ -10368,7 +10387,8 @@ static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
if (!sched_rt_can_attach(css_tg(css), task))
return -EINVAL;
}
- return 0;
+#endif
+ return scx_cgroup_can_attach(tset);
}
#endif
@@ -10379,7 +10399,16 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
cgroup_taskset_for_each(task, css, tset)
sched_move_task(task, SCHED_MOVE_TASK_CGROUP);
+
+ scx_cgroup_finish_attach();
+}
+
+#ifdef CONFIG_EXT_GROUP_SCHED
+static void cpu_cgroup_cancel_attach(struct cgroup_taskset *tset)
+{
+ scx_cgroup_cancel_attach(tset);
}
+#endif
#ifdef CONFIG_UCLAMP_TASK_GROUP
static void cpu_util_update_eff(struct cgroup_subsys_state *css)
@@ -10562,9 +10591,15 @@ static int cpu_uclamp_max_show(struct seq_file *sf, void *v)
static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 shareval)
{
+ int ret;
+
if (shareval > scale_load_down(ULONG_MAX))
shareval = MAX_SHARES;
- return sched_group_set_shares(css_tg(css), scale_load(shareval));
+ ret = sched_group_set_shares(css_tg(css), scale_load(shareval));
+ if (!ret)
+ scx_group_set_weight(css_tg(css),
+ sched_weight_to_cgroup(shareval));
+ return ret;
}
static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
@@ -11028,11 +11063,15 @@ static int cpu_extra_stat_show(struct seq_file *sf,
return 0;
}
-#ifdef CONFIG_FAIR_GROUP_SCHED
+#if defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_EXT_GROUP_SCHED)
static unsigned long tg_weight(struct task_group *tg)
{
+#ifdef CONFIG_FAIR_GROUP_SCHED
return scale_load_down(tg->shares);
+#else
+ return sched_weight_from_cgroup(tg->cgrp_weight);
+#endif
}
static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css,
@@ -11045,13 +11084,17 @@ static int cpu_weight_write_u64(struct cgroup_subsys_state *css,
struct cftype *cft, u64 cgrp_weight)
{
unsigned long weight;
+ int ret;
if (cgrp_weight < CGROUP_WEIGHT_MIN || cgrp_weight > CGROUP_WEIGHT_MAX)
return -ERANGE;
weight = sched_weight_from_cgroup(cgrp_weight);
- return sched_group_set_shares(css_tg(css), scale_load(weight));
+ ret = sched_group_set_shares(css_tg(css), scale_load(weight));
+ if (!ret)
+ scx_group_set_weight(css_tg(css), cgrp_weight);
+ return ret;
}
static s64 cpu_weight_nice_read_s64(struct cgroup_subsys_state *css,
@@ -11076,7 +11119,7 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
struct cftype *cft, s64 nice)
{
unsigned long weight;
- int idx;
+ int idx, ret;
if (nice < MIN_NICE || nice > MAX_NICE)
return -ERANGE;
@@ -11085,7 +11128,11 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
idx = array_index_nospec(idx, 40);
weight = sched_prio_to_weight[idx];
- return sched_group_set_shares(css_tg(css), scale_load(weight));
+ ret = sched_group_set_shares(css_tg(css), scale_load(weight));
+ if (!ret)
+ scx_group_set_weight(css_tg(css),
+ sched_weight_to_cgroup(weight));
+ return ret;
}
#endif
@@ -11147,7 +11194,7 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
#endif
struct cftype cpu_cftypes[CPU_CFTYPE_CNT + 1] = {
-#ifdef CONFIG_FAIR_GROUP_SCHED
+#if defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_EXT_GROUP_SCHED)
[CPU_CFTYPE_WEIGHT] = {
.name = "weight",
.flags = CFTYPE_NOT_ON_ROOT,
@@ -11201,13 +11248,17 @@ struct cftype cpu_cftypes[CPU_CFTYPE_CNT + 1] = {
struct cgroup_subsys cpu_cgrp_subsys = {
.css_alloc = cpu_cgroup_css_alloc,
.css_online = cpu_cgroup_css_online,
+ .css_offline = cpu_cgroup_css_offline,
.css_released = cpu_cgroup_css_released,
.css_free = cpu_cgroup_css_free,
.css_extra_stat_show = cpu_extra_stat_show,
-#ifdef CONFIG_RT_GROUP_SCHED
+#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_EXT_GROUP_SCHED)
.can_attach = cpu_cgroup_can_attach,
#endif
.attach = cpu_cgroup_attach,
+#ifdef CONFIG_EXT_GROUP_SCHED
+ .cancel_attach = cpu_cgroup_cancel_attach,
+#endif
.legacy_cftypes = cpu_legacy_cftypes,
.dfl_cftypes = cpu_cftypes,
.early_init = true,
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index cf6493f684f3..bd03b55fbcf5 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1561,6 +1561,19 @@ static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued)
resched_curr(rq);
}
+static struct cgroup *tg_cgrp(struct task_group *tg)
+{
+ /*
+ * If CGROUP_SCHED is disabled, @tg is NULL. If @tg is an autogroup,
+ * @tg->css.cgroup is NULL. In both cases, @tg can be treated as the
+ * root cgroup.
+ */
+ if (tg && tg->css.cgroup)
+ return tg->css.cgroup;
+ else
+ return &cgrp_dfl_root.cgrp;
+}
+
static int scx_ops_prepare_task(struct task_struct *p, struct task_group *tg)
{
int ret;
@@ -1570,7 +1583,7 @@ static int scx_ops_prepare_task(struct task_struct *p, struct task_group *tg)
p->scx.disallow = false;
if (SCX_HAS_OP(prep_enable)) {
- struct scx_enable_args args = { };
+ struct scx_enable_args args = { .cgroup = tg_cgrp(tg) };
ret = scx_ops.prep_enable(p, &args);
if (unlikely(ret)) {
@@ -1610,7 +1623,8 @@ static void scx_ops_enable_task(struct task_struct *p)
WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_OPS_PREPPED));
if (SCX_HAS_OP(enable)) {
- struct scx_enable_args args = { };
+ struct scx_enable_args args =
+ { .cgroup = tg_cgrp(p->sched_task_group) };
scx_ops.enable(p, &args);
}
p->scx.flags &= ~SCX_TASK_OPS_PREPPED;
@@ -1623,7 +1637,8 @@ static void scx_ops_disable_task(struct task_struct *p)
if (p->scx.flags & SCX_TASK_OPS_PREPPED) {
if (SCX_HAS_OP(cancel_enable)) {
- struct scx_enable_args args = { };
+ struct scx_enable_args args =
+ { .cgroup = tg_cgrp(task_group(p)) };
scx_ops.cancel_enable(p, &args);
}
p->scx.flags &= ~SCX_TASK_OPS_PREPPED;
@@ -1777,6 +1792,156 @@ bool scx_can_stop_tick(struct rq *rq)
}
#endif
+#ifdef CONFIG_EXT_GROUP_SCHED
+
+DEFINE_STATIC_PERCPU_RWSEM(scx_cgroup_rwsem);
+
+int scx_tg_online(struct task_group *tg)
+{
+ int ret = 0;
+
+ WARN_ON_ONCE(tg->scx_flags & (SCX_TG_ONLINE | SCX_TG_INITED));
+
+ percpu_down_read(&scx_cgroup_rwsem);
+
+ if (SCX_HAS_OP(cgroup_init)) {
+ struct scx_cgroup_init_args args = { .weight = tg->scx_weight };
+
+ ret = scx_ops.cgroup_init(tg->css.cgroup, &args);
+ if (!ret)
+ tg->scx_flags |= SCX_TG_ONLINE | SCX_TG_INITED;
+ else
+ ret = ops_sanitize_err("cgroup_init", ret);
+ } else {
+ tg->scx_flags |= SCX_TG_ONLINE;
+ }
+
+ percpu_up_read(&scx_cgroup_rwsem);
+ return ret;
+}
+
+void scx_tg_offline(struct task_group *tg)
+{
+ WARN_ON_ONCE(!(tg->scx_flags & SCX_TG_ONLINE));
+
+ percpu_down_read(&scx_cgroup_rwsem);
+
+ if (SCX_HAS_OP(cgroup_exit) && (tg->scx_flags & SCX_TG_INITED))
+ scx_ops.cgroup_exit(tg->css.cgroup);
+ tg->scx_flags &= ~(SCX_TG_ONLINE | SCX_TG_INITED);
+
+ percpu_up_read(&scx_cgroup_rwsem);
+}
+
+int scx_cgroup_can_attach(struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *css;
+ struct task_struct *p;
+ int ret;
+
+ /* released in scx_finish/cancel_attach() */
+ percpu_down_read(&scx_cgroup_rwsem);
+
+ if (!scx_enabled())
+ return 0;
+
+ cgroup_taskset_for_each(p, css, tset) {
+ struct cgroup *from = tg_cgrp(p->sched_task_group);
+
+ if (SCX_HAS_OP(cgroup_prep_move)) {
+ ret = scx_ops.cgroup_prep_move(p, from, css->cgroup);
+ if (ret)
+ goto err;
+ }
+
+ WARN_ON_ONCE(p->scx.cgrp_moving_from);
+ p->scx.cgrp_moving_from = from;
+ }
+
+ return 0;
+
+err:
+ cgroup_taskset_for_each(p, css, tset) {
+ if (!p->scx.cgrp_moving_from)
+ break;
+ if (SCX_HAS_OP(cgroup_cancel_move))
+ scx_ops.cgroup_cancel_move(p, p->scx.cgrp_moving_from,
+ css->cgroup);
+ p->scx.cgrp_moving_from = NULL;
+ }
+
+ percpu_up_read(&scx_cgroup_rwsem);
+ return ops_sanitize_err("cgroup_prep_move", ret);
+}
+
+void scx_cgroup_move_task(struct task_struct *p)
+{
+ if (!scx_enabled())
+ return;
+
+ if (SCX_HAS_OP(cgroup_move)) {
+ WARN_ON_ONCE(!p->scx.cgrp_moving_from);
+ scx_ops.cgroup_move(p, p->scx.cgrp_moving_from,
+ tg_cgrp(p->sched_task_group));
+ }
+ p->scx.cgrp_moving_from = NULL;
+}
+
+void scx_cgroup_finish_attach(void)
+{
+ percpu_up_read(&scx_cgroup_rwsem);
+}
+
+void scx_cgroup_cancel_attach(struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *css;
+ struct task_struct *p;
+
+ if (!scx_enabled())
+ goto out_unlock;
+
+ cgroup_taskset_for_each(p, css, tset) {
+ if (SCX_HAS_OP(cgroup_cancel_move)) {
+ WARN_ON_ONCE(!p->scx.cgrp_moving_from);
+ scx_ops.cgroup_cancel_move(p, p->scx.cgrp_moving_from,
+ css->cgroup);
+ }
+ p->scx.cgrp_moving_from = NULL;
+ }
+out_unlock:
+ percpu_up_read(&scx_cgroup_rwsem);
+}
+
+void scx_group_set_weight(struct task_group *tg, unsigned long weight)
+{
+ percpu_down_read(&scx_cgroup_rwsem);
+
+ if (tg->scx_weight != weight) {
+ if (SCX_HAS_OP(cgroup_set_weight))
+ scx_ops.cgroup_set_weight(tg_cgrp(tg), weight);
+ tg->scx_weight = weight;
+ }
+
+ percpu_up_read(&scx_cgroup_rwsem);
+}
+
+static void scx_cgroup_lock(void)
+{
+ percpu_down_write(&scx_cgroup_rwsem);
+}
+
+static void scx_cgroup_unlock(void)
+{
+ percpu_up_write(&scx_cgroup_rwsem);
+}
+
+#else /* CONFIG_EXT_GROUP_SCHED */
+
+static inline void scx_cgroup_lock(void) {}
+static inline void scx_cgroup_unlock(void) {}
+
+#endif /* CONFIG_EXT_GROUP_SCHED */
+
/*
* Omitted operations:
*
@@ -1916,6 +2081,130 @@ static void destroy_dsq(u64 dsq_id)
rcu_read_unlock();
}
+#ifdef CONFIG_EXT_GROUP_SCHED
+static void scx_cgroup_exit(void)
+{
+ struct cgroup_subsys_state *css;
+
+ percpu_rwsem_assert_held(&scx_cgroup_rwsem);
+
+ /*
+ * scx_tg_on/offline() are excluded through scx_cgroup_rwsem. If we walk
+ * cgroups and exit all the inited ones, all online cgroups are exited.
+ */
+ rcu_read_lock();
+ css_for_each_descendant_post(css, &root_task_group.css) {
+ struct task_group *tg = css_tg(css);
+
+ if (!(tg->scx_flags & SCX_TG_INITED))
+ continue;
+ tg->scx_flags &= ~SCX_TG_INITED;
+
+ if (!scx_ops.cgroup_exit)
+ continue;
+
+ if (WARN_ON_ONCE(!css_tryget(css)))
+ continue;
+ rcu_read_unlock();
+
+ scx_ops.cgroup_exit(css->cgroup);
+
+ rcu_read_lock();
+ css_put(css);
+ }
+ rcu_read_unlock();
+}
+
+static int scx_cgroup_init(void)
+{
+ struct cgroup_subsys_state *css;
+ int ret;
+
+ percpu_rwsem_assert_held(&scx_cgroup_rwsem);
+
+ /*
+ * scx_tg_on/offline() are excluded thorugh scx_cgroup_rwsem. If we walk
+ * cgroups and init, all online cgroups are initialized.
+ */
+ rcu_read_lock();
+ css_for_each_descendant_pre(css, &root_task_group.css) {
+ struct task_group *tg = css_tg(css);
+ struct scx_cgroup_init_args args = { .weight = tg->scx_weight };
+
+ if ((tg->scx_flags &
+ (SCX_TG_ONLINE | SCX_TG_INITED)) != SCX_TG_ONLINE)
+ continue;
+
+ if (!scx_ops.cgroup_init) {
+ tg->scx_flags |= SCX_TG_INITED;
+ continue;
+ }
+
+ if (WARN_ON_ONCE(!css_tryget(css)))
+ continue;
+ rcu_read_unlock();
+
+ ret = scx_ops.cgroup_init(css->cgroup, &args);
+ if (ret) {
+ css_put(css);
+ return ret;
+ }
+ tg->scx_flags |= SCX_TG_INITED;
+
+ rcu_read_lock();
+ css_put(css);
+ }
+ rcu_read_unlock();
+
+ return 0;
+}
+
+static void scx_cgroup_config_knobs(void)
+{
+ static DEFINE_MUTEX(cgintf_mutex);
+ DECLARE_BITMAP(mask, CPU_CFTYPE_CNT) = { };
+ u64 knob_flags;
+ int i;
+
+ /*
+ * Called from both class switch and ops enable/disable paths,
+ * synchronize internally.
+ */
+ mutex_lock(&cgintf_mutex);
+
+ /* if fair is in use, all knobs should be shown */
+ if (!scx_switched_all()) {
+ bitmap_fill(mask, CPU_CFTYPE_CNT);
+ goto apply;
+ }
+
+ /*
+ * On ext, only show the supported knobs. Otherwise, show all possible
+ * knobs so that configuration attempts succeed and the states are
+ * remembered while ops is not loaded.
+ */
+ if (scx_enabled())
+ knob_flags = scx_ops.flags;
+ else
+ knob_flags = SCX_OPS_ALL_FLAGS;
+
+ if (knob_flags & SCX_OPS_CGROUP_KNOB_WEIGHT) {
+ __set_bit(CPU_CFTYPE_WEIGHT, mask);
+ __set_bit(CPU_CFTYPE_WEIGHT_NICE, mask);
+ }
+apply:
+ for (i = 0; i < CPU_CFTYPE_CNT; i++)
+ cgroup_show_cftype(&cpu_cftypes[i], test_bit(i, mask));
+
+ mutex_unlock(&cgintf_mutex);
+}
+
+#else
+static void scx_cgroup_exit(void) {}
+static int scx_cgroup_init(void) { return 0; }
+static void scx_cgroup_config_knobs(void) {}
+#endif
+
/*
* Used by sched_fork() and __setscheduler_prio() to pick the matching
* sched_class. dl/rt are already handled.
@@ -2071,9 +2360,10 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
static_branch_disable(&__scx_switched_all);
WRITE_ONCE(scx_switching_all, false);
- /* avoid racing against fork */
+ /* avoid racing against fork and cgroup changes */
cpus_read_lock();
percpu_down_write(&scx_fork_rwsem);
+ scx_cgroup_lock();
spin_lock_irq(&scx_tasks_lock);
scx_task_iter_init(&sti);
@@ -2108,6 +2398,9 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
static_branch_disable_cpuslocked(&scx_builtin_idle_enabled);
synchronize_rcu();
+ scx_cgroup_exit();
+
+ scx_cgroup_unlock();
percpu_up_write(&scx_fork_rwsem);
cpus_read_unlock();
@@ -2137,6 +2430,8 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_DISABLED) !=
SCX_OPS_DISABLING);
+
+ scx_cgroup_config_knobs();
}
static DEFINE_KTHREAD_WORK(scx_ops_disable_work, scx_ops_disable_workfn);
@@ -2293,10 +2588,11 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
task_runnable_timeout_ms / 2);
/*
- * Lock out forks before opening the floodgate so that they don't wander
- * into the operations prematurely.
+ * Lock out forks, cgroup on/offlining and moves before opening the
+ * floodgate so that they don't wander into the operations prematurely.
*/
percpu_down_write(&scx_fork_rwsem);
+ scx_cgroup_lock();
for (i = 0; i < SCX_NR_ONLINE_OPS; i++)
if (((void (**)(void))ops)[i])
@@ -2315,6 +2611,14 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
static_branch_disable_cpuslocked(&scx_builtin_idle_enabled);
}
+ /*
+ * All cgroups should be initialized before letting in tasks. cgroup
+ * on/offlining and task migrations are already locked out.
+ */
+ ret = scx_cgroup_init();
+ if (ret)
+ goto err_disable_unlock;
+
static_branch_enable_cpuslocked(&__scx_ops_enabled);
/*
@@ -2397,6 +2701,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
spin_unlock_irq(&scx_tasks_lock);
preempt_enable();
+ scx_cgroup_unlock();
percpu_up_write(&scx_fork_rwsem);
if (!scx_ops_tryset_enable_state(SCX_OPS_ENABLED, SCX_OPS_ENABLING)) {
@@ -2410,6 +2715,8 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
cpus_read_unlock();
mutex_unlock(&scx_ops_enable_mutex);
+ scx_cgroup_config_knobs();
+
return 0;
err_unlock:
@@ -2417,6 +2724,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
return ret;
err_disable_unlock:
+ scx_cgroup_unlock();
percpu_up_write(&scx_fork_rwsem);
err_disable:
cpus_read_unlock();
@@ -2578,6 +2886,9 @@ static int bpf_scx_check_member(const struct btf_type *t,
switch (moff) {
case offsetof(struct sched_ext_ops, prep_enable):
+ case offsetof(struct sched_ext_ops, cgroup_init):
+ case offsetof(struct sched_ext_ops, cgroup_exit):
+ case offsetof(struct sched_ext_ops, cgroup_prep_move):
case offsetof(struct sched_ext_ops, init):
case offsetof(struct sched_ext_ops, exit):
/*
@@ -2704,6 +3015,7 @@ void __init init_sched_ext_class(void)
register_sysrq_key('S', &sysrq_sched_ext_reset_op);
INIT_DELAYED_WORK(&check_timeout_work, scx_check_timeout_workfn);
+ scx_cgroup_config_knobs();
}
@@ -2745,7 +3057,7 @@ static const struct btf_kfunc_id_set scx_kfunc_set_init = {
* @node: NUMA node to allocate from
*
* Create a dsq identified by @dsq_id. Can be called from sleepable operations
- * including ops.init() and .prep_enable().
+ * including ops.init(), .prep_enable() and .cgroup_prep_move().
*/
s32 scx_bpf_create_dsq(u64 dsq_id, s32 node)
{
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index e9ec267f13d5..470b2224cdfa 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -59,6 +59,11 @@ enum scx_deq_flags {
SCX_DEQ_SLEEP = DEQUEUE_SLEEP,
};
+enum scx_tg_flags {
+ SCX_TG_ONLINE = 1U << 0,
+ SCX_TG_INITED = 1U << 1,
+};
+
enum scx_kick_flags {
SCX_KICK_PREEMPT = 1LLU << 0, /* force scheduling on the CPU */
};
@@ -185,3 +190,21 @@ static inline void scx_update_idle(struct rq *rq, bool idle)
#else
static inline void scx_update_idle(struct rq *rq, bool idle) {}
#endif
+
+#ifdef CONFIG_EXT_GROUP_SCHED
+int scx_tg_online(struct task_group *tg);
+void scx_tg_offline(struct task_group *tg);
+int scx_cgroup_can_attach(struct cgroup_taskset *tset);
+void scx_cgroup_move_task(struct task_struct *p);
+void scx_cgroup_finish_attach(void);
+void scx_cgroup_cancel_attach(struct cgroup_taskset *tset);
+void scx_group_set_weight(struct task_group *tg, unsigned long cgrp_weight);
+#else /* CONFIG_EXT_GROUP_SCHED */
+static inline int scx_tg_online(struct task_group *tg) { return 0; }
+static inline void scx_tg_offline(struct task_group *tg) {}
+static inline int scx_cgroup_can_attach(struct cgroup_taskset *tset) { return 0; }
+static inline void scx_cgroup_move_task(struct task_struct *p) {}
+static inline void scx_cgroup_finish_attach(void) {}
+static inline void scx_cgroup_cancel_attach(struct cgroup_taskset *tset) {}
+static inline void scx_group_set_weight(struct task_group *tg, unsigned long cgrp_weight) {}
+#endif /* CONFIG_EXT_GROUP_SCHED */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a95aae3bc69a..a2ffa94ede02 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -406,6 +406,11 @@ struct task_group {
struct rt_bandwidth rt_bandwidth;
#endif
+#ifdef CONFIG_EXT_GROUP_SCHED
+ u32 scx_flags; /* SCX_TG_* */
+ u32 scx_weight;
+#endif
+
struct rcu_head rcu;
struct list_head list;
@@ -535,6 +540,11 @@ extern void set_task_rq_fair(struct sched_entity *se,
static inline void set_task_rq_fair(struct sched_entity *se,
struct cfs_rq *prev, struct cfs_rq *next) { }
#endif /* CONFIG_SMP */
+#else /* CONFIG_FAIR_GROUP_SCHED */
+static inline int sched_group_set_shares(struct task_group *tg, unsigned long shares)
+{
+ return 0;
+}
#endif /* CONFIG_FAIR_GROUP_SCHED */
#else /* CONFIG_CGROUP_SCHED */
@@ -3260,7 +3270,7 @@ static inline void update_current_exec_runtime(struct task_struct *curr,
#ifdef CONFIG_CGROUP_SCHED
enum cpu_cftype_id {
-#ifdef CONFIG_FAIR_GROUP_SCHED
+#if defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_EXT_GROUP_SCHED)
CPU_CFTYPE_WEIGHT,
CPU_CFTYPE_WEIGHT_NICE,
CPU_CFTYPE_IDLE,
diff --git a/tools/sched_ext/.gitignore b/tools/sched_ext/.gitignore
index 389f0e5b0970..ebc34dcf925b 100644
--- a/tools/sched_ext/.gitignore
+++ b/tools/sched_ext/.gitignore
@@ -1,6 +1,7 @@
scx_example_dummy
scx_example_qmap
scx_example_central
+scx_example_pair
*.skel.h
*.subskel.h
/tools/
diff --git a/tools/sched_ext/Makefile b/tools/sched_ext/Makefile
index d406b7586e08..45ab39139afc 100644
--- a/tools/sched_ext/Makefile
+++ b/tools/sched_ext/Makefile
@@ -114,7 +114,7 @@ BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH) \
-Wno-compare-distinct-pointer-types \
-O2 -mcpu=v3
-all: scx_example_dummy scx_example_qmap scx_example_central
+all: scx_example_dummy scx_example_qmap scx_example_central scx_example_pair
# sort removes libbpf duplicates when not cross-building
MAKE_DIRS := $(sort $(BUILD_DIR)/libbpf $(HOST_BUILD_DIR)/libbpf \
@@ -177,10 +177,15 @@ scx_example_central: scx_example_central.c scx_example_central.skel.h user_exit_
$(CC) $(CFLAGS) -c $< -o [email protected]
$(CC) -o $@ [email protected] $(HOST_BPFOBJ) $(LDFLAGS)
+scx_example_pair: scx_example_pair.c scx_example_pair.skel.h user_exit_info.h
+ $(CC) $(CFLAGS) -c $< -o [email protected]
+ $(CC) -o $@ [email protected] $(HOST_BPFOBJ) $(LDFLAGS)
+
clean:
rm -rf $(SCRATCH_DIR) $(HOST_SCRATCH_DIR)
rm -f *.o *.bpf.o *.skel.h *.subskel.h
- rm -f scx_example_dummy scx_example_qmap scx_example_central
+ rm -f scx_example_dummy scx_example_qmap scx_example_central \
+ scx_example_pair
.PHONY: all clean
diff --git a/tools/sched_ext/scx_example_pair.bpf.c b/tools/sched_ext/scx_example_pair.bpf.c
new file mode 100644
index 000000000000..7694d2169383
--- /dev/null
+++ b/tools/sched_ext/scx_example_pair.bpf.c
@@ -0,0 +1,554 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A demo sched_ext core-scheduler which always makes every sibling CPU pair
+ * execute from the same CPU cgroup.
+ *
+ * Each CPU in the system is paired with exactly one other CPU, according to a
+ * "stride" value that can be specified when the BPF scheduler program is first
+ * loaded. Throughout the runtime of the scheduler, these CPU pairs guarantee
+ * that they will only ever schedule tasks that belong to the same CPU cgroup.
+ *
+ * Scheduler Initialization
+ * ------------------------
+ *
+ * The scheduler BPF program is first initialized from user space, before it is
+ * enabled. During this initialization process, each CPU on the system is
+ * assigned several values that are constant throughout its runtime:
+ *
+ * 1. *Pair CPU*: The CPU that it synchronizes with when making scheduling
+ * decisions. Paired CPUs always schedule tasks from the same
+ * CPU cgroup, and synchronize with each other to guarantee
+ * that this constraint is not violated.
+ * 2. *Pair ID*: Each CPU pair is assigned a Pair ID, which is used to access
+ * a struct pair_ctx object that is shared between the pair.
+ * 3. *In-pair-index*: An index, 0 or 1, that is assigned to each core in the
+ * pair. Each struct pair_ctx has an active_mask field,
+ * which is a bitmap used to indicate whether each core
+ * in the pair currently has an actively running task.
+ * This index specifies which entry in the bitmap corresponds
+ * to each CPU in the pair.
+ *
+ * During this initialization, the CPUs are paired according to a "stride" that
+ * may be specified when invoking the user space program that initializes and
+ * loads the scheduler. By default, the stride is 1/2 the total number of CPUs.
+ *
+ * Tasks and cgroups
+ * -----------------
+ *
+ * Every cgroup in the system is registered with the scheduler using the
+ * pair_cgroup_init() callback, and every task in the system is associated with
+ * exactly one cgroup. At a high level, the idea with the pair scheduler is to
+ * always schedule tasks from the same cgroup within a given CPU pair. When a
+ * task is enqueued (i.e. passed to the pair_enqueue() callback function), its
+ * cgroup ID is read from its task struct, and then a corresponding queue map
+ * is used to FIFO-enqueue the task for that cgroup.
+ *
+ * If you look through the implementation of the scheduler, you'll notice that
+ * there is quite a bit of complexity involved with looking up the per-cgroup
+ * FIFO queue that we enqueue tasks in. For example, there is a cgrp_q_idx_hash
+ * BPF hash map that is used to map a cgroup ID to a globally unique ID that's
+ * allocated in the BPF program. This is done because we use separate maps to
+ * store the FIFO queue of tasks, and the length of that map, per cgroup. This
+ * complexity is only present because of current deficiencies in BPF that will
+ * soon be addressed. The main point to keep in mind is that newly enqueued
+ * tasks are added to their cgroup's FIFO queue.
+ *
+ * Dispatching tasks
+ * -----------------
+ *
+ * This section will describe how enqueued tasks are dispatched and scheduled.
+ * Tasks are dispatched in pair_dispatch(), and at a high level the workflow is
+ * as follows:
+ *
+ * 1. Fetch the struct pair_ctx for the current CPU. As mentioned above, this is
+ * the structure that's used to synchronize amongst the two pair CPUs in their
+ * scheduling decisions. After any of the following events have occurred:
+ *
+ * - The cgroup's slice run has expired, or
+ * - The cgroup becomes empty, or
+ * - Either CPU in the pair is preempted by a higher priority scheduling class
+ *
+ * The cgroup transitions to the draining state and stops executing new tasks
+ * from the cgroup.
+ *
+ * 2. If the pair is still executing a task, mark the pair_ctx as draining, and
+ * wait for the pair CPU to be preempted.
+ *
+ * 3. Otherwise, if the pair CPU is not running a task, we can move onto
+ * scheduling new tasks. Pop the next cgroup id from the top_q queue.
+ *
+ * 4. Pop a task from that cgroup's FIFO task queue, and begin executing it.
+ *
+ * Note again that this scheduling behavior is simple, but the implementation
+ * is complex mostly because this it hits several BPF shortcomings and has to
+ * work around in often awkward ways. Most of the shortcomings are expected to
+ * be resolved in the near future which should allow greatly simplifying this
+ * scheduler.
+ *
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#include "scx_common.bpf.h"
+#include "scx_example_pair.h"
+
+char _license[] SEC("license") = "GPL";
+
+const volatile bool switch_all;
+
+const volatile u32 nr_cpu_ids;
+
+/* a pair of CPUs stay on a cgroup for this duration */
+const volatile u32 pair_batch_dur_ns = SCX_SLICE_DFL;
+
+/* cpu ID -> pair cpu ID */
+const volatile s32 pair_cpu[MAX_CPUS] = { [0 ... MAX_CPUS - 1] = -1 };
+
+/* cpu ID -> pair_id */
+const volatile u32 pair_id[MAX_CPUS];
+
+/* CPU ID -> CPU # in the pair (0 or 1) */
+const volatile u32 in_pair_idx[MAX_CPUS];
+
+struct pair_ctx {
+ struct bpf_spin_lock lock;
+
+ /* the cgroup the pair is currently executing */
+ u64 cgid;
+
+ /* the pair started executing the current cgroup at */
+ u64 started_at;
+
+ /* whether the current cgroup is draining */
+ bool draining;
+
+ /* the CPUs that are currently active on the cgroup */
+ u32 active_mask;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, MAX_CPUS / 2);
+ __type(key, u32);
+ __type(value, struct pair_ctx);
+} pair_ctx SEC(".maps");
+
+/* queue of cgrp_q's possibly with tasks on them */
+struct {
+ __uint(type, BPF_MAP_TYPE_QUEUE);
+ /*
+ * Because it's difficult to build strong synchronization encompassing
+ * multiple non-trivial operations in BPF, this queue is managed in an
+ * opportunistic way so that we guarantee that a cgroup w/ active tasks
+ * is always on it but possibly multiple times. Once we have more robust
+ * synchronization constructs and e.g. linked list, we should be able to
+ * do this in a prettier way but for now just size it big enough.
+ */
+ __uint(max_entries, 4 * MAX_CGRPS);
+ __type(value, u64);
+} top_q SEC(".maps");
+
+/* per-cgroup q which FIFOs the tasks from the cgroup */
+struct cgrp_q {
+ __uint(type, BPF_MAP_TYPE_QUEUE);
+ __uint(max_entries, MAX_QUEUED);
+ __type(value, u32);
+};
+
+/*
+ * Ideally, we want to allocate cgrp_q and cgrq_q_len in the cgroup local
+ * storage; however, a cgroup local storage can only be accessed from the BPF
+ * progs attached to the cgroup. For now, work around by allocating array of
+ * cgrp_q's and then allocating per-cgroup indices.
+ *
+ * Another caveat: It's difficult to populate a large array of maps statically
+ * or from BPF. Initialize it from userland.
+ */
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
+ __uint(max_entries, MAX_CGRPS);
+ __type(key, s32);
+ __array(values, struct cgrp_q);
+} cgrp_q_arr SEC(".maps");
+
+static u64 cgrp_q_len[MAX_CGRPS];
+
+/*
+ * This and cgrp_q_idx_hash combine into a poor man's IDR. This likely would be
+ * useful to have as a map type.
+ */
+static u32 cgrp_q_idx_cursor;
+static u64 cgrp_q_idx_busy[MAX_CGRPS];
+
+/*
+ * All added up, the following is what we do:
+ *
+ * 1. When a cgroup is enabled, RR cgroup_q_idx_busy array doing cmpxchg looking
+ * for a free ID. If not found, fail cgroup creation with -EBUSY.
+ *
+ * 2. Hash the cgroup ID to the allocated cgrp_q_idx in the following
+ * cgrp_q_idx_hash.
+ *
+ * 3. Whenever a cgrp_q needs to be accessed, first look up the cgrp_q_idx from
+ * cgrp_q_idx_hash and then access the corresponding entry in cgrp_q_arr.
+ *
+ * This is sadly complicated for something pretty simple. Hopefully, we should
+ * be able to simplify in the future.
+ */
+struct {
+ __uint(type, BPF_MAP_TYPE_HASH);
+ __uint(max_entries, MAX_CGRPS);
+ __uint(key_size, sizeof(u64)); /* cgrp ID */
+ __uint(value_size, sizeof(s32)); /* cgrp_q idx */
+} cgrp_q_idx_hash SEC(".maps");
+
+/* statistics */
+u64 nr_total, nr_dispatched, nr_missing, nr_kicks, nr_preemptions;
+u64 nr_exps, nr_exp_waits, nr_exp_empty;
+u64 nr_cgrp_next, nr_cgrp_coll, nr_cgrp_empty;
+
+struct user_exit_info uei;
+
+static bool time_before(u64 a, u64 b)
+{
+ return (s64)(a - b) < 0;
+}
+
+void BPF_STRUCT_OPS(pair_enqueue, struct task_struct *p, u64 enq_flags)
+{
+ s32 pid = p->pid;
+ u64 cgid = p->sched_task_group->css.cgroup->kn->id;
+ u32 *q_idx;
+ struct cgrp_q *cgq;
+ u64 *cgq_len;
+
+ __sync_fetch_and_add(&nr_total, 1);
+
+ /* find the cgroup's q and push @p into it */
+ q_idx = bpf_map_lookup_elem(&cgrp_q_idx_hash, &cgid);
+ if (!q_idx) {
+ scx_bpf_error("failed to lookup q_idx for cgroup[%llu]", cgid);
+ return;
+ }
+
+ cgq = bpf_map_lookup_elem(&cgrp_q_arr, q_idx);
+ if (!cgq) {
+ scx_bpf_error("failed to lookup q_arr for cgroup[%llu] q_idx[%u]",
+ cgid, *q_idx);
+ return;
+ }
+
+ if (bpf_map_push_elem(cgq, &pid, 0)) {
+ scx_bpf_error("cgroup[%llu] queue overflow", cgid);
+ return;
+ }
+
+ /* bump q len, if going 0 -> 1, queue cgroup into the top_q */
+ cgq_len = MEMBER_VPTR(cgrp_q_len, [*q_idx]);
+ if (!cgq_len) {
+ scx_bpf_error("MEMBER_VTPR malfunction");
+ return;
+ }
+
+ if (!__sync_fetch_and_add(cgq_len, 1) &&
+ bpf_map_push_elem(&top_q, &cgid, 0)) {
+ scx_bpf_error("top_q overflow");
+ return;
+ }
+}
+
+/* find the next cgroup to execute and return it in *data */
+static int next_cgid_loopfn(u32 idx, void *data)
+{
+ u64 cgid;
+ u32 *q_idx;
+ u64 *cgq_len;
+
+ if (bpf_map_pop_elem(&top_q, &cgid))
+ return 1;
+
+ q_idx = bpf_map_lookup_elem(&cgrp_q_idx_hash, &cgid);
+ if (!q_idx)
+ return 0;
+
+ /* this is the only place where empty cgroups are taken off the top_q */
+ cgq_len = MEMBER_VPTR(cgrp_q_len, [*q_idx]);
+ if (!cgq_len || !*cgq_len)
+ return 0;
+
+ /* if it has any tasks, requeue as we may race and not execute it */
+ bpf_map_push_elem(&top_q, &cgid, 0);
+ *(u64 *)data = cgid;
+ return 1;
+}
+
+struct claim_task_loopctx {
+ u32 q_idx;
+ bool claimed;
+};
+
+/* claim one task from the specified cgq */
+static int claim_task_loopfn(u32 idx, void *data)
+{
+ struct claim_task_loopctx *claimc = data;
+ u64 *cgq_len;
+ u64 len;
+
+ cgq_len = MEMBER_VPTR(cgrp_q_len, [claimc->q_idx]);
+ if (!cgq_len)
+ return 1;
+
+ len = *cgq_len;
+ if (!len)
+ return 1;
+
+ if (__sync_val_compare_and_swap(cgq_len, len, len - 1) != len)
+ return 0;
+
+ claimc->claimed = true;
+ return 1;
+}
+
+static int lookup_pairc_and_mask(s32 cpu, struct pair_ctx **pairc, u32 *mask)
+{
+ u32 *vptr, in_pair_mask;
+ int err;
+
+ vptr = (u32 *)MEMBER_VPTR(pair_id, [cpu]);
+ if (!vptr)
+ return -EINVAL;
+
+ *pairc = bpf_map_lookup_elem(&pair_ctx, vptr);
+ if (!(*pairc))
+ return -EINVAL;
+
+ vptr = (u32 *)MEMBER_VPTR(in_pair_idx, [cpu]);
+ if (!vptr)
+ return -EINVAL;
+
+ *mask = 1U << *vptr;
+
+ return 0;
+}
+
+static int dispatch_loopfn(u32 idx, void *data)
+{
+ s32 cpu = *(s32 *)data;
+ struct pair_ctx *pairc;
+ struct bpf_map *cgq_map;
+ struct claim_task_loopctx claimc;
+ struct task_struct *p;
+ u64 now = bpf_ktime_get_ns();
+ bool kick_pair = false;
+ bool expired;
+ u32 *vptr, in_pair_mask;
+ s32 pid;
+ u64 cgid;
+ int ret;
+
+ ret = lookup_pairc_and_mask(cpu, &pairc, &in_pair_mask);
+ if (ret) {
+ scx_bpf_error("failed to lookup pairc and in_pair_mask for cpu[%d]",
+ cpu);
+ return 1;
+ }
+
+ bpf_spin_lock(&pairc->lock);
+ pairc->active_mask &= ~in_pair_mask;
+
+ expired = time_before(pairc->started_at + pair_batch_dur_ns, now);
+ if (expired || pairc->draining) {
+ u64 new_cgid = 0;
+
+ __sync_fetch_and_add(&nr_exps, 1);
+
+ /*
+ * We're done with the current cgid. An obvious optimization
+ * would be not draining if the next cgroup is the current one.
+ * For now, be dumb and always expire.
+ */
+ pairc->draining = true;
+
+ if (pairc->active_mask) {
+ /*
+ * The other CPU is still active We want to wait until
+ * this cgroup expires.
+ *
+ * If the pair controls its CPU, and the time already
+ * expired, kick. When the other CPU arrives at
+ * dispatch and clears its active mask, it'll push the
+ * pair to the next cgroup and kick this CPU.
+ */
+ __sync_fetch_and_add(&nr_exp_waits, 1);
+ bpf_spin_unlock(&pairc->lock);
+ if (expired)
+ kick_pair = true;
+ goto out_maybe_kick;
+ }
+
+ bpf_spin_unlock(&pairc->lock);
+
+ /*
+ * Pick the next cgroup. It'd be easier / cleaner to not drop
+ * pairc->lock and use stronger synchronization here especially
+ * given that we'll be switching cgroups significantly less
+ * frequently than tasks. Unfortunately, bpf_spin_lock can't
+ * really protect anything non-trivial. Let's do opportunistic
+ * operations instead.
+ */
+ bpf_loop(1 << 23, next_cgid_loopfn, &new_cgid, 0);
+ /* no active cgroup, go idle */
+ if (!new_cgid) {
+ __sync_fetch_and_add(&nr_exp_empty, 1);
+ return 1;
+ }
+
+ bpf_spin_lock(&pairc->lock);
+
+ /*
+ * The other CPU may already have started on a new cgroup while
+ * we dropped the lock. Make sure that we're still draining and
+ * start on the new cgroup.
+ */
+ if (pairc->draining && !pairc->active_mask) {
+ __sync_fetch_and_add(&nr_cgrp_next, 1);
+ pairc->cgid = new_cgid;
+ pairc->started_at = now;
+ pairc->draining = false;
+ kick_pair = true;
+ } else {
+ __sync_fetch_and_add(&nr_cgrp_coll, 1);
+ }
+ }
+
+ cgid = pairc->cgid;
+ pairc->active_mask |= in_pair_mask;
+ bpf_spin_unlock(&pairc->lock);
+
+ /* again, it'd be better to do all these with the lock held, oh well */
+ vptr = bpf_map_lookup_elem(&cgrp_q_idx_hash, &cgid);
+ if (!vptr) {
+ scx_bpf_error("failed to lookup q_idx for cgroup[%llu]", cgid);
+ return 1;
+ }
+
+ claimc = (struct claim_task_loopctx){ .q_idx = *vptr };
+ bpf_loop(1 << 23, claim_task_loopfn, &claimc, 0);
+ if (!claimc.claimed) {
+ /* the cgroup must be empty, expire and repeat */
+ __sync_fetch_and_add(&nr_cgrp_empty, 1);
+ bpf_spin_lock(&pairc->lock);
+ pairc->draining = true;
+ pairc->active_mask &= ~in_pair_mask;
+ bpf_spin_unlock(&pairc->lock);
+ return 0;
+ }
+
+ cgq_map = bpf_map_lookup_elem(&cgrp_q_arr, &claimc.q_idx);
+ if (!cgq_map) {
+ scx_bpf_error("failed to lookup cgq_map for cgroup[%llu] q_idx[%d]",
+ cgid, claimc.q_idx);
+ return 1;
+ }
+
+ if (bpf_map_pop_elem(cgq_map, &pid)) {
+ scx_bpf_error("cgq_map is empty for cgroup[%llu] q_idx[%d]",
+ cgid, claimc.q_idx);
+ return 1;
+ }
+
+ p = scx_bpf_find_task_by_pid(pid);
+ if (p) {
+ __sync_fetch_and_add(&nr_dispatched, 1);
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0);
+ } else {
+ /* we don't handle dequeues, retry on lost tasks */
+ __sync_fetch_and_add(&nr_missing, 1);
+ return 0;
+ }
+
+out_maybe_kick:
+ if (kick_pair) {
+ s32 *pair = (s32 *)MEMBER_VPTR(pair_cpu, [cpu]);
+ if (pair) {
+ __sync_fetch_and_add(&nr_kicks, 1);
+ scx_bpf_kick_cpu(*pair, SCX_KICK_PREEMPT);
+ }
+ }
+ return 1;
+}
+
+void BPF_STRUCT_OPS(pair_dispatch, s32 cpu, struct task_struct *prev)
+{
+ s32 cpu_on_stack = cpu;
+
+ bpf_loop(1 << 23, dispatch_loopfn, &cpu_on_stack, 0);
+}
+
+static int alloc_cgrp_q_idx_loopfn(u32 idx, void *data)
+{
+ u32 q_idx;
+
+ q_idx = __sync_fetch_and_add(&cgrp_q_idx_cursor, 1) % MAX_CGRPS;
+ if (!__sync_val_compare_and_swap(&cgrp_q_idx_busy[q_idx], 0, 1)) {
+ *(s32 *)data = q_idx;
+ return 1;
+ }
+ return 0;
+}
+
+s32 BPF_STRUCT_OPS(pair_cgroup_init, struct cgroup *cgrp)
+{
+ u64 cgid = cgrp->kn->id;
+ s32 q_idx = -1;
+
+ bpf_loop(MAX_CGRPS, alloc_cgrp_q_idx_loopfn, &q_idx, 0);
+ if (q_idx < 0)
+ return -EBUSY;
+
+ if (bpf_map_update_elem(&cgrp_q_idx_hash, &cgid, &q_idx, BPF_ANY)) {
+ u64 *busy = MEMBER_VPTR(cgrp_q_idx_busy, [q_idx]);
+ if (busy)
+ *busy = 0;
+ return -EBUSY;
+ }
+
+ return 0;
+}
+
+void BPF_STRUCT_OPS(pair_cgroup_exit, struct cgroup *cgrp)
+{
+ u64 cgid = cgrp->kn->id;
+ s32 *q_idx;
+
+ q_idx = bpf_map_lookup_elem(&cgrp_q_idx_hash, &cgid);
+ if (q_idx) {
+ u64 *busy = MEMBER_VPTR(cgrp_q_idx_busy, [*q_idx]);
+ if (busy)
+ *busy = 0;
+ bpf_map_delete_elem(&cgrp_q_idx_hash, &cgid);
+ }
+}
+
+s32 BPF_STRUCT_OPS(pair_init)
+{
+ if (switch_all)
+ scx_bpf_switch_all();
+ return 0;
+}
+
+void BPF_STRUCT_OPS(pair_exit, struct scx_exit_info *ei)
+{
+ uei_record(&uei, ei);
+}
+
+SEC(".struct_ops")
+struct sched_ext_ops pair_ops = {
+ .enqueue = (void *)pair_enqueue,
+ .dispatch = (void *)pair_dispatch,
+ .cgroup_init = (void *)pair_cgroup_init,
+ .cgroup_exit = (void *)pair_cgroup_exit,
+ .init = (void *)pair_init,
+ .exit = (void *)pair_exit,
+ .name = "pair",
+};
diff --git a/tools/sched_ext/scx_example_pair.c b/tools/sched_ext/scx_example_pair.c
new file mode 100644
index 000000000000..255ea7b1235d
--- /dev/null
+++ b/tools/sched_ext/scx_example_pair.c
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <unistd.h>
+#include <signal.h>
+#include <assert.h>
+#include <libgen.h>
+#include <bpf/bpf.h>
+#include "user_exit_info.h"
+#include "scx_example_pair.h"
+#include "scx_example_pair.skel.h"
+
+const char help_fmt[] =
+"A demo sched_ext core-scheduler which always makes every sibling CPU pair\n"
+"execute from the same CPU cgroup.\n"
+"\n"
+"See the top-level comment in .bpf.c for more details.\n"
+"\n"
+"Usage: %s [-a] [-S STRIDE]\n"
+"\n"
+" -a Switch all tasks\n"
+" -S STRIDE Override CPU pair stride (default: nr_cpus_ids / 2)\n"
+" -h Display this help and exit\n";
+
+static volatile int exit_req;
+
+static void sigint_handler(int dummy)
+{
+ exit_req = 1;
+}
+
+int main(int argc, char **argv)
+{
+ struct scx_example_pair *skel;
+ struct bpf_link *link;
+ u64 seq = 0;
+ s32 stride, i, opt, outer_fd;
+
+ signal(SIGINT, sigint_handler);
+ signal(SIGTERM, sigint_handler);
+
+ libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
+
+ skel = scx_example_pair__open();
+ assert(skel);
+
+ skel->rodata->nr_cpu_ids = libbpf_num_possible_cpus();
+
+ /* pair up the earlier half to the latter by default, override with -s */
+ stride = skel->rodata->nr_cpu_ids / 2;
+
+ while ((opt = getopt(argc, argv, "ahS:")) != -1) {
+ switch (opt) {
+ case 'a':
+ skel->rodata->switch_all = true;
+ break;
+ case 'S':
+ stride = strtoul(optarg, NULL, 0);
+ break;
+ default:
+ fprintf(stderr, help_fmt, basename(argv[0]));
+ return opt != 'h';
+ }
+ }
+
+ for (i = 0; i < skel->rodata->nr_cpu_ids; i++) {
+ if (skel->rodata->pair_cpu[i] < 0) {
+ skel->rodata->pair_cpu[i] = i + stride;
+ skel->rodata->pair_cpu[i + stride] = i;
+ skel->rodata->pair_id[i] = i;
+ skel->rodata->pair_id[i + stride] = i;
+ skel->rodata->in_pair_idx[i] = 0;
+ skel->rodata->in_pair_idx[i + stride] = 1;
+ }
+ }
+
+ assert(!scx_example_pair__load(skel));
+
+ /*
+ * Populate the cgrp_q_arr map which is an array containing per-cgroup
+ * queues. It'd probably be better to do this from BPF but there are too
+ * many to initialize statically and there's no way to dynamically
+ * populate from BPF.
+ */
+ outer_fd = bpf_map__fd(skel->maps.cgrp_q_arr);
+ assert(outer_fd >= 0);
+
+ printf("Initializing");
+ for (i = 0; i < MAX_CGRPS; i++) {
+ s32 inner_fd;
+
+ if (exit_req)
+ break;
+
+ inner_fd = bpf_map_create(BPF_MAP_TYPE_QUEUE, NULL, 0,
+ sizeof(u32), MAX_QUEUED, NULL);
+ assert(inner_fd >= 0);
+ assert(!bpf_map_update_elem(outer_fd, &i, &inner_fd, BPF_ANY));
+ close(inner_fd);
+
+ if (!(i % 10))
+ printf(".");
+ fflush(stdout);
+ }
+ printf("\n");
+
+ /*
+ * Fully initialized, attach and run.
+ */
+ link = bpf_map__attach_struct_ops(skel->maps.pair_ops);
+ assert(link);
+
+ while (!exit_req && !uei_exited(&skel->bss->uei)) {
+ printf("[SEQ %lu]\n", seq++);
+ printf(" total:%10lu dispatch:%10lu missing:%10lu\n",
+ skel->bss->nr_total,
+ skel->bss->nr_dispatched,
+ skel->bss->nr_missing);
+ printf(" kicks:%10lu preemptions:%7lu\n",
+ skel->bss->nr_kicks,
+ skel->bss->nr_preemptions);
+ printf(" exp:%10lu exp_wait:%10lu exp_empty:%10lu\n",
+ skel->bss->nr_exps,
+ skel->bss->nr_exp_waits,
+ skel->bss->nr_exp_empty);
+ printf("cgnext:%10lu cgcoll:%10lu cgempty:%10lu\n",
+ skel->bss->nr_cgrp_next,
+ skel->bss->nr_cgrp_coll,
+ skel->bss->nr_cgrp_empty);
+ fflush(stdout);
+ sleep(1);
+ }
+
+ bpf_link__destroy(link);
+ uei_print(&skel->bss->uei);
+ scx_example_pair__destroy(skel);
+ return 0;
+}
diff --git a/tools/sched_ext/scx_example_pair.h b/tools/sched_ext/scx_example_pair.h
new file mode 100644
index 000000000000..f60b824272f7
--- /dev/null
+++ b/tools/sched_ext/scx_example_pair.h
@@ -0,0 +1,10 @@
+#ifndef __SCX_EXAMPLE_PAIR_H
+#define __SCX_EXAMPLE_PAIR_H
+
+enum {
+ MAX_CPUS = 4096,
+ MAX_QUEUED = 4096,
+ MAX_CGRPS = 4096,
+};
+
+#endif /* __SCX_EXAMPLE_PAIR_H */
--
2.38.1
From: David Vernet <[email protected]>
If set when calling scx_bpf_kick_cpu(), the invoking CPU will busy wait for
the kicked cpu to enter the scheduler. This will be used to improve the
exclusion guarantees in scx_example_pair.
Signed-off-by: David Vernet <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/core.c | 4 +++-
kernel/sched/ext.c | 36 ++++++++++++++++++++++++++++++++++--
kernel/sched/ext.h | 20 ++++++++++++++++++++
kernel/sched/sched.h | 2 ++
4 files changed, 59 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 79560641a61f..ea4f6edfcf32 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5886,8 +5886,10 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
for_each_active_class(class) {
p = class->pick_next_task(rq);
- if (p)
+ if (p) {
+ scx_notify_pick_next_task(rq, p, class);
return p;
+ }
}
BUG(); /* The idle class should always have a runnable task. */
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index bd03b55fbcf5..aeaad3d8b05a 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -109,8 +109,9 @@ unsigned long last_timeout_check = INITIAL_JIFFIES;
static struct delayed_work check_timeout_work;
-/* idle tracking */
#ifdef CONFIG_SMP
+
+/* idle tracking */
#ifdef CONFIG_CPUMASK_OFFSTACK
#define CL_ALIGNED_IF_ONSTACK
#else
@@ -123,7 +124,11 @@ static struct {
} idle_masks CL_ALIGNED_IF_ONSTACK;
static bool __cacheline_aligned_in_smp has_idle_cpus;
-#endif
+
+/* for %SCX_KICK_WAIT */
+static u64 __percpu *kick_cpus_pnt_seqs;
+
+#endif /* CONFIG_SMP */
/*
* Direct dispatch marker.
@@ -2959,6 +2964,7 @@ static const struct sysrq_key_op sysrq_sched_ext_reset_op = {
static void kick_cpus_irq_workfn(struct irq_work *irq_work)
{
struct rq *this_rq = this_rq();
+ u64 *pseqs = this_cpu_ptr(kick_cpus_pnt_seqs);
int this_cpu = cpu_of(this_rq);
int cpu;
@@ -2972,14 +2978,32 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
if (cpumask_test_cpu(cpu, this_rq->scx.cpus_to_preempt) &&
rq->curr->sched_class == &ext_sched_class)
rq->curr->scx.slice = 0;
+ pseqs[cpu] = rq->scx.pnt_seq;
resched_curr(rq);
+ } else {
+ cpumask_clear_cpu(cpu, this_rq->scx.cpus_to_wait);
}
raw_spin_rq_unlock_irqrestore(rq, flags);
}
+ for_each_cpu_andnot(cpu, this_rq->scx.cpus_to_wait,
+ cpumask_of(this_cpu)) {
+ /*
+ * Pairs with smp_store_release() issued by this CPU in
+ * scx_notify_pick_next_task() on the resched path.
+ *
+ * We busy-wait here to guarantee that no other task can be
+ * scheduled on our core before the target CPU has entered the
+ * resched path.
+ */
+ while (smp_load_acquire(&cpu_rq(cpu)->scx.pnt_seq) == pseqs[cpu])
+ cpu_relax();
+ }
+
cpumask_clear(this_rq->scx.cpus_to_kick);
cpumask_clear(this_rq->scx.cpus_to_preempt);
+ cpumask_clear(this_rq->scx.cpus_to_wait);
}
#endif
@@ -2999,6 +3023,11 @@ void __init init_sched_ext_class(void)
#ifdef CONFIG_SMP
BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL));
BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL));
+
+ kick_cpus_pnt_seqs = __alloc_percpu(sizeof(kick_cpus_pnt_seqs[0]) *
+ num_possible_cpus(),
+ __alignof__(kick_cpus_pnt_seqs[0]));
+ BUG_ON(!kick_cpus_pnt_seqs);
#endif
for_each_possible_cpu(cpu) {
struct rq *rq = cpu_rq(cpu);
@@ -3009,6 +3038,7 @@ void __init init_sched_ext_class(void)
#ifdef CONFIG_SMP
BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick, GFP_KERNEL));
BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_preempt, GFP_KERNEL));
+ BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_wait, GFP_KERNEL));
init_irq_work(&rq->scx.kick_cpus_irq_work, kick_cpus_irq_workfn);
#endif
}
@@ -3228,6 +3258,8 @@ void scx_bpf_kick_cpu(s32 cpu, u64 flags)
cpumask_set_cpu(cpu, rq->scx.cpus_to_kick);
if (flags & SCX_KICK_PREEMPT)
cpumask_set_cpu(cpu, rq->scx.cpus_to_preempt);
+ if (flags & SCX_KICK_WAIT)
+ cpumask_set_cpu(cpu, rq->scx.cpus_to_wait);
irq_work_queue(&rq->scx.kick_cpus_irq_work);
preempt_enable();
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 470b2224cdfa..8ae717c5e850 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -66,6 +66,7 @@ enum scx_tg_flags {
enum scx_kick_flags {
SCX_KICK_PREEMPT = 1LLU << 0, /* force scheduling on the CPU */
+ SCX_KICK_WAIT = 1LLU << 1, /* wait for the CPU to be rescheduled */
};
#ifdef CONFIG_SCHED_CLASS_EXT
@@ -107,6 +108,22 @@ __printf(2, 3) void scx_ops_error_type(enum scx_exit_type type,
#define scx_ops_error(fmt, args...) \
scx_ops_error_type(SCX_EXIT_ERROR, fmt, ##args)
+static inline void scx_notify_pick_next_task(struct rq *rq,
+ const struct task_struct *p,
+ const struct sched_class *active)
+{
+#ifdef CONFIG_SMP
+ if (!scx_enabled())
+ return;
+ /*
+ * Pairs with the smp_load_acquire() issued by a CPU in
+ * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
+ * resched.
+ */
+ smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
+#endif
+}
+
static inline void scx_notify_sched_tick(void)
{
unsigned long last_check, timeout;
@@ -164,6 +181,9 @@ static inline int scx_check_setscheduler(struct task_struct *p,
int policy) { return 0; }
static inline bool scx_can_stop_tick(struct rq *rq) { return true; }
static inline void init_sched_ext_class(void) {}
+static inline void scx_notify_pick_next_task(struct rq *rq,
+ const struct task_struct *p,
+ const struct sched_class *active) {}
static inline void scx_notify_sched_tick(void) {}
#define for_each_active_class for_each_class
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a2ffa94ede02..5af758cc1e38 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -714,6 +714,8 @@ struct scx_rq {
#ifdef CONFIG_SMP
cpumask_var_t cpus_to_kick;
cpumask_var_t cpus_to_preempt;
+ cpumask_var_t cpus_to_wait;
+ u64 pnt_seq;
struct irq_work kick_cpus_irq_work;
#endif
};
--
2.38.1
Implement a new scheduler class sched_ext (SCX), which allows scheduling
policies to be implemented as BPF programs to achieve the following:
1. Ease of experimentation and exploration: Enabling rapid iteration of new
scheduling policies.
2. Customization: Building application-specific schedulers which implement
policies that are not applicable to general-purpose schedulers.
3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling
policies in production environments.
sched_ext leverages BPF’s struct_ops feature to define a structure which
exports function callbacks and flags to BPF programs that wish to implement
scheduling policies. The struct_ops structure exported by sched_ext is
struct sched_ext_ops, and is conceptually similar to struct sched_class. The
role of sched_ext is to map the complex sched_class callbacks to the more
simple and ergonomic struct sched_ext_ops callbacks.
For more detailed discussion on the motivations and overview, please refer
to the cover letter.
Later patches will also add several example schedulers and documentation.
This patch implements the minimum core framework to enable implementation of
BPF schedulers. Subsequent patches will gradually add functionalities
including safety guarantee mechanisms, nohz and cgroup support.
include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on
top, each operation should be self-explanatory. The followings are worth
noting:
* Both "sched_ext" and its shorthand "scx" are used. If the identifier
already has "sched" in it, "ext" is used; otherwise, "scx".
* In sched_ext_ops, only .name is mandatory. Every operation is optional and
if omitted a simple but functional default behavior is provided.
* A new policy constant SCHED_EXT is added and a task can select sched_ext
by invoking sched_setscheduler(2) with the new policy constant. However,
if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL
and the task is scheduled by CFS. When the BPF scheduler is loaded, all
tasks which have the SCHED_EXT policy are switched to sched_ext.
* To bridge the workflow imbalance between the scheduler core and
sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch
queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and
one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for
convenience and need not be used by a scheduler that doesn't require it.
SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting
the next task on the CPU. The BPF scheduler can manage an arbitrary number
of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().
* sched_ext guarantees system integrity no matter what the BPF scheduler
does. To enable this, each task's ownership is tracked through
p->scx.ops_state and all tasks are put on scx_tasks list. The disable path
can always recover and revert all tasks back to CFS. See p->scx.ops_state
and scx_tasks.
* A task is not tied to its rq while enqueued. This decouples CPU selection
from queueing and allows sharing a scheduling queue across an arbitrary
subset of CPUs. This adds some complexities as a task may need to be
bounced between rq's right before it starts executing. See
dispatch_to_local_dsq() and move_task_to_local_dsq().
* One complication that arises from the above weak association between task
and rq is that synchronizing with dequeue() gets complicated as dequeue()
may happen anytime while the task is enqueued and the dispatch path might
need to release the rq lock to transfer the task. Solving this requires a
bit of complexity. See the logic around p->scx.sticky_cpu and
p->scx.ops_qseq.
* Both enable and disable paths are a bit complicated. The enable path
switches all tasks without blocking to avoid issues which can arise from
partially switched states (e.g. the switching task itself being starved).
The disable path can't trust the BPF scheduler at all, so it also has to
guarantee forward progress without blocking. See scx_ops_enable() and
scx_ops_disable_workfn().
* When sched_ext is disabled, static_branches are used to shut down the
entry points from hot paths.
Signed-off-by: Tejun Heo <[email protected]>
Co-authored-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/asm-generic/vmlinux.lds.h | 1 +
include/linux/sched.h | 5 +
include/linux/sched/ext.h | 386 +++-
include/uapi/linux/sched.h | 1 +
init/init_task.c | 10 +
kernel/Kconfig.preempt | 4 +-
kernel/bpf/bpf_struct_ops_types.h | 4 +
kernel/sched/build_policy.c | 4 +
kernel/sched/core.c | 65 +-
kernel/sched/debug.c | 6 +
kernel/sched/ext.c | 2780 +++++++++++++++++++++++++++++
kernel/sched/ext.h | 109 +-
kernel/sched/sched.h | 19 +
13 files changed, 3388 insertions(+), 6 deletions(-)
create mode 100644 kernel/sched/ext.c
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index d06ada2341cb..cfbfc47692eb 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -131,6 +131,7 @@
*(__dl_sched_class) \
*(__rt_sched_class) \
*(__fair_sched_class) \
+ *(__ext_sched_class) \
*(__idle_sched_class) \
__sched_class_lowest = .;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ffb6eb55cd13..a9ecf6de0e86 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -70,6 +70,8 @@ struct signal_struct;
struct task_delay_info;
struct task_group;
+#include <linux/sched/ext.h>
+
/*
* Task state bitmask. NOTE! These bits are also
* encoded in fs/proc/array.c: get_task_state().
@@ -788,6 +790,9 @@ struct task_struct {
struct sched_entity se;
struct sched_rt_entity rt;
struct sched_dl_entity dl;
+#ifdef CONFIG_SCHED_CLASS_EXT
+ struct sched_ext_entity scx;
+#endif
const struct sched_class *sched_class;
#ifdef CONFIG_SCHED_CORE
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index a05dfcf533b0..e2e743ccd00d 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -1,9 +1,393 @@
/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
#ifndef _LINUX_SCHED_EXT_H
#define _LINUX_SCHED_EXT_H
#ifdef CONFIG_SCHED_CLASS_EXT
-#error "NOT IMPLEMENTED YET"
+
+#include <linux/rhashtable.h>
+#include <linux/llist.h>
+
+enum scx_consts {
+ SCX_OPS_NAME_LEN = 128,
+ SCX_EXIT_REASON_LEN = 128,
+ SCX_EXIT_BT_LEN = 64,
+ SCX_EXIT_MSG_LEN = 1024,
+
+ SCX_SLICE_DFL = 20 * NSEC_PER_MSEC,
+};
+
+/*
+ * DSQ (dispatch queue) IDs are 64bit of the format:
+ *
+ * Bits: [63] [62 .. 0]
+ * [ B] [ ID ]
+ *
+ * B: 1 for IDs for built-in DSQs, 0 for ops-created user DSQs
+ * ID: 63 bit ID
+ *
+ * Built-in IDs:
+ *
+ * Bits: [63] [62] [61..32] [31 .. 0]
+ * [ 1] [ L] [ R ] [ V ]
+ *
+ * 1: 1 for built-in DSQs.
+ * L: 1 for LOCAL_ON DSQ IDs, 0 for others
+ * V: For LOCAL_ON DSQ IDs, a CPU number. For others, a pre-defined value.
+ */
+enum scx_dsq_id_flags {
+ SCX_DSQ_FLAG_BUILTIN = 1LLU << 63,
+ SCX_DSQ_FLAG_LOCAL_ON = 1LLU << 62,
+
+ SCX_DSQ_INVALID = SCX_DSQ_FLAG_BUILTIN | 0,
+ SCX_DSQ_GLOBAL = SCX_DSQ_FLAG_BUILTIN | 1,
+ SCX_DSQ_LOCAL = SCX_DSQ_FLAG_BUILTIN | 2,
+ SCX_DSQ_LOCAL_ON = SCX_DSQ_FLAG_BUILTIN | SCX_DSQ_FLAG_LOCAL_ON,
+ SCX_DSQ_LOCAL_CPU_MASK = 0xffffffffLLU,
+};
+
+enum scx_exit_type {
+ SCX_EXIT_NONE,
+ SCX_EXIT_DONE,
+
+ SCX_EXIT_UNREG = 64, /* BPF unregistration */
+
+ SCX_EXIT_ERROR = 1024, /* runtime error, error msg contains details */
+ SCX_EXIT_ERROR_BPF, /* ERROR but triggered through scx_bpf_error() */
+};
+
+/*
+ * scx_exit_info is passed to ops.exit() to describe why the BPF scheduler is
+ * being disabled.
+ */
+struct scx_exit_info {
+ /* %SCX_EXIT_* - broad category of the exit reason */
+ enum scx_exit_type type;
+ /* textual representation of the above */
+ char reason[SCX_EXIT_REASON_LEN];
+ /* number of entries in the backtrace */
+ u32 bt_len;
+ /* backtrace if exiting due to an error */
+ unsigned long bt[SCX_EXIT_BT_LEN];
+ /* extra message */
+ char msg[SCX_EXIT_MSG_LEN];
+};
+
+/* sched_ext_ops.flags */
+enum scx_ops_flags {
+ /*
+ * Keep built-in idle tracking even if ops.update_idle() is implemented.
+ */
+ SCX_OPS_KEEP_BUILTIN_IDLE = 1LLU << 0,
+
+ /*
+ * By default, if there are no other task to run on the CPU, ext core
+ * keeps running the current task even after its slice expires. If this
+ * flag is specified, such tasks are passed to ops.enqueue() with
+ * %SCX_ENQ_LAST. See the comment above %SCX_ENQ_LAST for more info.
+ */
+ SCX_OPS_ENQ_LAST = 1LLU << 1,
+
+ /*
+ * An exiting task may schedule after PF_EXITING is set. In such cases,
+ * scx_bpf_find_task_by_pid() may not be able to find the task and if
+ * the BPF scheduler depends on pid lookup for dispatching, the task
+ * will be lost leading to various issues including RCU grace period
+ * stalls.
+ *
+ * To mask this problem, by default, unhashed tasks are automatically
+ * dispatched to the local dsq on enqueue. If the BPF scheduler doesn't
+ * depend on pid lookups and wants to handle these tasks directly, the
+ * following flag can be used.
+ */
+ SCX_OPS_ENQ_EXITING = 1LLU << 2,
+
+ SCX_OPS_ALL_FLAGS = SCX_OPS_KEEP_BUILTIN_IDLE |
+ SCX_OPS_ENQ_LAST |
+ SCX_OPS_ENQ_EXITING,
+};
+
+/* argument container for ops.enable() and friends */
+struct scx_enable_args {
+ /* empty for now */
+};
+
+/**
+ * struct sched_ext_ops - Operation table for BPF scheduler implementation
+ *
+ * Userland can implement an arbitrary scheduling policy by implementing and
+ * loading operations in this table.
+ */
+struct sched_ext_ops {
+ /**
+ * select_cpu - Pick the target CPU for a task which is being woken up
+ * @p: task being woken up
+ * @prev_cpu: the cpu @p was on before sleeping
+ * @wake_flags: SCX_WAKE_*
+ *
+ * Decision made here isn't final. @p may be moved to any CPU while it
+ * is getting dispatched for execution later. However, as @p is not on
+ * the rq at this point, getting the eventual execution CPU right here
+ * saves a small bit of overhead down the line.
+ *
+ * If an idle CPU is returned, the CPU is kicked and will try to
+ * dispatch. While an explicit custom mechanism can be added,
+ * select_cpu() serves as the default way to wake up idle CPUs.
+ */
+ s32 (*select_cpu)(struct task_struct *p, s32 prev_cpu, u64 wake_flags);
+
+ /**
+ * enqueue - Enqueue a task on the BPF scheduler
+ * @p: task being enqueued
+ * @enq_flags: %SCX_ENQ_*
+ *
+ * @p is ready to run. Dispatch directly by calling scx_bpf_dispatch()
+ * or enqueue on the BPF scheduler. If not directly dispatched, the bpf
+ * scheduler owns @p and if it fails to dispatch @p, the task will
+ * stall.
+ */
+ void (*enqueue)(struct task_struct *p, u64 enq_flags);
+
+ /**
+ * dequeue - Remove a task from the BPF scheduler
+ * @p: task being dequeued
+ * @deq_flags: %SCX_DEQ_*
+ *
+ * Remove @p from the BPF scheduler. This is usually called to isolate
+ * the task while updating its scheduling properties (e.g. priority).
+ *
+ * The ext core keeps track of whether the BPF side owns a given task or
+ * not and can gracefully ignore spurious dispatches from BPF side,
+ * which makes it safe to not implement this method. However, depending
+ * on the scheduling logic, this can lead to confusing behaviors - e.g.
+ * scheduling position not being updated across a priority change.
+ */
+ void (*dequeue)(struct task_struct *p, u64 deq_flags);
+
+ /**
+ * dispatch - Dispatch tasks from the BPF scheduler into dsq's
+ * @cpu: CPU to dispatch tasks for
+ * @prev: previous task being switched out
+ *
+ * Called when a CPU can't find a task to execute after ops.consume().
+ * The operation should dispatch one or more tasks from the BPF
+ * scheduler to the dsq's using scx_bpf_dispatch(). The maximum number
+ * of tasks which can be dispatched in a single call is specified by the
+ * @dispatch_max_batch field of this struct.
+ */
+ void (*dispatch)(s32 cpu, struct task_struct *prev);
+
+ /**
+ * consume - Consume tasks from the dsq's to the local dsq for execution
+ * @cpu: CPU to consume tasks for
+ *
+ * Called when a CPU's local dsq is empty. The operation should transfer
+ * one or more tasks from the dsq's to the CPU's local dsq using
+ * scx_bpf_consume(). If this function fails to fill the local dsq,
+ * ops.dispatch() will be called.
+ *
+ * This operation is unnecessary if the BPF scheduler always dispatches
+ * either to one of the local dsq's or the global dsq. If implemented,
+ * this operation is also responsible for consuming the global_dsq.
+ */
+ void (*consume)(s32 cpu);
+
+ /**
+ * consume_final - Final consume call before going idle
+ * @cpu: CPU to consume tasks for
+ *
+ * After ops.consume() and .dispatch(), @cpu still doesn't have a task
+ * to execute and is about to go idle. This operation can be used to
+ * implement more aggressive consumption strategies. Otherwise
+ * equivalent to ops.consume().
+ */
+ void (*consume_final)(s32 cpu);
+
+ /**
+ * yield - Yield CPU
+ * @from: yielding task
+ * @to: optional yield target task
+ *
+ * If @to is NULL, @from is yielding the CPU to other runnable tasks.
+ * The BPF scheduler should ensure that other available tasks are
+ * dispatched before the yielding task. Return value is ignored in this
+ * case.
+ *
+ * If @to is not-NULL, @from wants to yield the CPU to @to. If the bpf
+ * scheduler can implement the request, return %true; otherwise, %false.
+ */
+ bool (*yield)(struct task_struct *from, struct task_struct *to);
+
+ /**
+ * set_cpumask - Set CPU affinity
+ * @p: task to set CPU affinity for
+ * @cpumask: cpumask of cpus that @p can run on
+ *
+ * Update @p's CPU affinity to @cpumask.
+ */
+ void (*set_cpumask)(struct task_struct *p, struct cpumask *cpumask);
+
+ /**
+ * update_idle - Update the idle state of a CPU
+ * @cpu: CPU to udpate the idle state for
+ * @idle: whether entering or exiting the idle state
+ *
+ * This operation is called when @rq's CPU goes or leaves the idle
+ * state. By default, implementing this operation disables the built-in
+ * idle CPU tracking and the following helpers become unavailable:
+ *
+ * - scx_bpf_select_cpu_dfl()
+ * - scx_bpf_test_and_clear_cpu_idle()
+ * - scx_bpf_pick_idle_cpu()
+ * - scx_bpf_any_idle_cpu()
+ *
+ * The user also must implement ops.select_cpu() as the default
+ * implementation relies on scx_bpf_select_cpu_dfl().
+ *
+ * If you keep the built-in idle tracking, specify the
+ * %SCX_OPS_KEEP_BUILTIN_IDLE flag.
+ */
+ void (*update_idle)(s32 cpu, bool idle);
+
+ /**
+ * prep_enable - Prepare to enable BPF scheduling for a task
+ * @p: task to prepare BPF scheduling for
+ * @args: enable arguments, see the struct definition
+ *
+ * Either we're loading a BPF scheduler or a new task is being forked.
+ * Prepare BPF scheduling for @p. This operation may block and can be
+ * used for allocations.
+ *
+ * Return 0 for success, -errno for failure. An error return while
+ * loading will abort loading of the BPF scheduler. During a fork, will
+ * abort the specific fork.
+ */
+ s32 (*prep_enable)(struct task_struct *p, struct scx_enable_args *args);
+
+ /**
+ * enable - Enable BPF scheduling for a task
+ * @p: task to enable BPF scheduling for
+ * @args: enable arguments, see the struct definition
+ *
+ * Enable @p for BPF scheduling. @p will start running soon.
+ */
+ void (*enable)(struct task_struct *p, struct scx_enable_args *args);
+
+ /**
+ * cancel_enable - Cancel prep_enable()
+ * @p: task being canceled
+ * @args: enable arguments, see the struct definition
+ *
+ * @p was prep_enable()'d but failed before reaching enable(). Undo the
+ * preparation.
+ */
+ void (*cancel_enable)(struct task_struct *p,
+ struct scx_enable_args *args);
+
+ /**
+ * disable - Disable BPF scheduling for a task
+ * @p: task to disable BPF scheduling for
+ *
+ * @p is exiting, leaving SCX or the BPF scheduler is being unloaded.
+ * Disable BPF scheduling for @p.
+ */
+ void (*disable)(struct task_struct *p);
+
+ /*
+ * All online ops must come before ops.init().
+ */
+
+ /**
+ * init - Initialize the BPF scheduler
+ */
+ s32 (*init)(void);
+
+ /**
+ * exit - Clean up after the BPF scheduler
+ * @info: Exit info
+ */
+ void (*exit)(struct scx_exit_info *info);
+
+ /**
+ * dispatch_max_batch - Max nr of tasks that dispatch() can dispatch
+ */
+ u32 dispatch_max_batch;
+
+ /**
+ * flags - %SCX_OPS_* flags
+ */
+ u64 flags;
+
+ /**
+ * name - BPF scheduler's name
+ *
+ * Must be a non-zero valid BPF object name including only isalnum(),
+ * '_' and '.' chars. Shows up in kernel.sched_ext_ops sysctl while the
+ * BPF scheduler is enabled.
+ */
+ char name[SCX_OPS_NAME_LEN];
+};
+
+/*
+ * Dispatch queue (dsq) is a simple FIFO which is used to buffer between the
+ * scheduler core and the BPF scheduler. See the documentation for more details.
+ */
+struct scx_dispatch_q {
+ raw_spinlock_t lock;
+ struct list_head fifo;
+ u64 id;
+ u32 nr;
+ struct rhash_head hash_node;
+ struct list_head all_node;
+ struct llist_node free_node;
+ struct rcu_head rcu;
+};
+
+/* scx_entity.flags */
+enum scx_ent_flags {
+ SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */
+ SCX_TASK_BAL_KEEP = 1 << 1, /* balance decided to keep current */
+ SCX_TASK_ENQ_LOCAL = 1 << 2, /* used by scx_select_cpu_dfl() to set SCX_ENQ_LOCAL */
+
+ SCX_TASK_OPS_PREPPED = 1 << 3, /* prepared for BPF scheduler enable */
+ SCX_TASK_OPS_ENABLED = 1 << 4, /* task has BPF scheduler enabled */
+
+ SCX_TASK_CURSOR = 1 << 6, /* iteration cursor, not a task */
+};
+
+/*
+ * The following is embedded in task_struct and contains all fields necessary
+ * for a task to be scheduled by SCX.
+ */
+struct sched_ext_entity {
+ struct scx_dispatch_q *dsq;
+ struct list_head dsq_node;
+ u32 flags; /* protected by rq lock */
+ u32 weight;
+ s32 sticky_cpu;
+ s32 holding_cpu;
+ atomic64_t ops_state;
+
+ /* BPF scheduler modifiable fields */
+
+ /*
+ * Runtime budget in nsecs. This is usually set through
+ * scx_bpf_dispatch() but can also be modified directly by the BPF
+ * scheduler. Automatically decreased by SCX as the task executes. On
+ * depletion, a scheduling event is triggered.
+ */
+ u64 slice;
+
+ /* cold fields */
+ struct list_head tasks_node;
+};
+
+void sched_ext_free(struct task_struct *p);
+
#else /* !CONFIG_SCHED_CLASS_EXT */
static inline void sched_ext_free(struct task_struct *p) {}
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 3bac0a8ceab2..359a14cc76a4 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -118,6 +118,7 @@ struct clone_args {
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE 5
#define SCHED_DEADLINE 6
+#define SCHED_EXT 7
/* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
#define SCHED_RESET_ON_FORK 0x40000000
diff --git a/init/init_task.c b/init/init_task.c
index ff6c4b9bfe6b..bdbc663107bf 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -6,6 +6,7 @@
#include <linux/sched/sysctl.h>
#include <linux/sched/rt.h>
#include <linux/sched/task.h>
+#include <linux/sched/ext.h>
#include <linux/init.h>
#include <linux/fs.h>
#include <linux/mm.h>
@@ -101,6 +102,15 @@ struct task_struct init_task
#endif
#ifdef CONFIG_CGROUP_SCHED
.sched_task_group = &root_task_group,
+#endif
+#ifdef CONFIG_SCHED_CLASS_EXT
+ .scx = {
+ .dsq_node = LIST_HEAD_INIT(init_task.scx.dsq_node),
+ .sticky_cpu = -1,
+ .holding_cpu = -1,
+ .ops_state = ATOMIC_INIT(0),
+ .slice = SCX_SLICE_DFL,
+ },
#endif
.ptraced = LIST_HEAD_INIT(init_task.ptraced),
.ptrace_entry = LIST_HEAD_INIT(init_task.ptrace_entry),
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index c2f1fd95a821..50eb26da4f84 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -133,4 +133,6 @@ config SCHED_CORE
which is the likely usage by Linux distributions, there should
be no measurable impact on performance.
-
+config SCHED_CLASS_EXT
+ bool "Extensible Scheduling Class"
+ depends on BPF_SYSCALL && BPF_JIT
diff --git a/kernel/bpf/bpf_struct_ops_types.h b/kernel/bpf/bpf_struct_ops_types.h
index 5678a9ddf817..3618769d853d 100644
--- a/kernel/bpf/bpf_struct_ops_types.h
+++ b/kernel/bpf/bpf_struct_ops_types.h
@@ -9,4 +9,8 @@ BPF_STRUCT_OPS_TYPE(bpf_dummy_ops)
#include <net/tcp.h>
BPF_STRUCT_OPS_TYPE(tcp_congestion_ops)
#endif
+#ifdef CONFIG_SCHED_CLASS_EXT
+#include <linux/sched/ext.h>
+BPF_STRUCT_OPS_TYPE(sched_ext_ops)
+#endif
#endif
diff --git a/kernel/sched/build_policy.c b/kernel/sched/build_policy.c
index d9dc9ab3773f..4c658b21f603 100644
--- a/kernel/sched/build_policy.c
+++ b/kernel/sched/build_policy.c
@@ -28,6 +28,7 @@
#include <linux/suspend.h>
#include <linux/tsacct_kern.h>
#include <linux/vtime.h>
+#include <linux/percpu-rwsem.h>
#include <uapi/linux/sched/types.h>
@@ -52,3 +53,6 @@
#include "cputime.c"
#include "deadline.c"
+#ifdef CONFIG_SCHED_CLASS_EXT
+# include "ext.c"
+#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b9e69e009343..dc499f18573a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4363,6 +4363,17 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->rt.on_rq = 0;
p->rt.on_list = 0;
+#ifdef CONFIG_SCHED_CLASS_EXT
+ p->scx.dsq = NULL;
+ INIT_LIST_HEAD(&p->scx.dsq_node);
+ p->scx.flags = 0;
+ p->scx.weight = 0;
+ p->scx.sticky_cpu = -1;
+ p->scx.holding_cpu = -1;
+ atomic64_set(&p->scx.ops_state, 0);
+ p->scx.slice = SCX_SLICE_DFL;
+#endif
+
#ifdef CONFIG_PREEMPT_NOTIFIERS
INIT_HLIST_HEAD(&p->preempt_notifiers);
#endif
@@ -4599,6 +4610,10 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
goto out_cancel;
} else if (rt_prio(p->prio)) {
p->sched_class = &rt_sched_class;
+#ifdef CONFIG_SCHED_CLASS_EXT
+ } else if (task_on_scx(p)) {
+ p->sched_class = &ext_sched_class;
+#endif
} else {
p->sched_class = &fair_sched_class;
}
@@ -6869,6 +6884,10 @@ void __setscheduler_prio(struct task_struct *p, int prio)
p->sched_class = &dl_sched_class;
else if (rt_prio(prio))
p->sched_class = &rt_sched_class;
+#ifdef CONFIG_SCHED_CLASS_EXT
+ else if (task_on_scx(p))
+ p->sched_class = &ext_sched_class;
+#endif
else
p->sched_class = &fair_sched_class;
@@ -8784,6 +8803,7 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
case SCHED_NORMAL:
case SCHED_BATCH:
case SCHED_IDLE:
+ case SCHED_EXT:
ret = 0;
break;
}
@@ -8811,6 +8831,7 @@ SYSCALL_DEFINE1(sched_get_priority_min, int, policy)
case SCHED_NORMAL:
case SCHED_BATCH:
case SCHED_IDLE:
+ case SCHED_EXT:
ret = 0;
}
return ret;
@@ -9654,8 +9675,13 @@ void __init sched_init(void)
int i;
/* Make sure the linker didn't screw up */
- BUG_ON(&idle_sched_class != &fair_sched_class + 1 ||
- &fair_sched_class != &rt_sched_class + 1 ||
+#ifdef CONFIG_SCHED_CLASS_EXT
+ BUG_ON(&idle_sched_class != &ext_sched_class + 1 ||
+ &ext_sched_class != &fair_sched_class + 1);
+#else
+ BUG_ON(&idle_sched_class != &fair_sched_class + 1);
+#endif
+ BUG_ON(&fair_sched_class != &rt_sched_class + 1 ||
&rt_sched_class != &dl_sched_class + 1);
#ifdef CONFIG_SMP
BUG_ON(&dl_sched_class != &stop_sched_class + 1);
@@ -11242,3 +11268,38 @@ void call_trace_sched_update_nr_running(struct rq *rq, int count)
{
trace_sched_update_nr_running_tp(rq, count);
}
+
+#ifdef CONFIG_SCHED_CLASS_EXT
+void sched_deq_and_put_task(struct task_struct *p, int queue_flags,
+ struct sched_enq_and_set_ctx *ctx)
+{
+ struct rq *rq = task_rq(p);
+
+ lockdep_assert_rq_held(rq);
+
+ *ctx = (struct sched_enq_and_set_ctx){
+ .p = p,
+ .queue_flags = queue_flags | DEQUEUE_NOCLOCK,
+ .queued = task_on_rq_queued(p),
+ .running = task_current(rq, p),
+ };
+
+ update_rq_clock(rq);
+ if (ctx->queued)
+ dequeue_task(rq, p, queue_flags);
+ if (ctx->running)
+ put_prev_task(rq, p);
+}
+
+void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
+{
+ struct rq *rq = task_rq(ctx->p);
+
+ lockdep_assert_rq_held(rq);
+
+ if (ctx->queued)
+ enqueue_task(rq, ctx->p, ctx->queue_flags);
+ if (ctx->running)
+ set_next_task(rq, ctx->p);
+}
+#endif /* CONFIG_SCHED_CLASS_EXT */
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1637b65ba07a..814ed80b8ff6 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -338,6 +338,9 @@ static __init int sched_init_debug(void)
debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
+#ifdef CONFIG_SCHED_CLASS_EXT
+ debugfs_create_file("ext", 0444, debugfs_sched, NULL, &sched_ext_fops);
+#endif
return 0;
}
late_initcall(sched_init_debug);
@@ -1047,6 +1050,9 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
P(dl.runtime);
P(dl.deadline);
}
+#ifdef CONFIG_SCHED_CLASS_EXT
+ __PS("ext.enabled", p->sched_class == &ext_sched_class);
+#endif
#undef PN_SCHEDSTAT
#undef P_SCHEDSTAT
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
new file mode 100644
index 000000000000..f42464d66de4
--- /dev/null
+++ b/kernel/sched/ext.c
@@ -0,0 +1,2780 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#define SCX_OP_IDX(op) (offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void)))
+
+enum scx_internal_consts {
+ SCX_NR_ONLINE_OPS = SCX_OP_IDX(init),
+ SCX_DSP_DFL_MAX_BATCH = 32,
+};
+
+enum scx_ops_enable_state {
+ SCX_OPS_PREPPING,
+ SCX_OPS_ENABLING,
+ SCX_OPS_ENABLED,
+ SCX_OPS_DISABLING,
+ SCX_OPS_DISABLED,
+};
+
+/*
+ * sched_ext_entity->ops_state
+ *
+ * Used to track the task ownership between the SCX core and the BPF scheduler.
+ * State transitions look as follows:
+ *
+ * NONE -> QUEUEING -> QUEUED -> DISPATCHING -> NONE
+ * ^ | |
+ * | v v
+ * \-------------------------------/
+ *
+ * QUEUEING and DISPATCHING states can be waited upon. See wait_ops_state() call
+ * sites for explanations on the conditions being waited upon and why they are
+ * safe. Transitions out of them into NONE or QUEUED must store_release and the
+ * waiters should load_acquire.
+ *
+ * Tracking scx_ops_state enables sched_ext core to reliably determine whether
+ * any given task can be dispatched by the BPF scheduler at all times and thus
+ * relaxes the requirements on the BPF scheduler. This allows the BPF scheduler
+ * to try to dispatch any task anytime regardless of its state as the SCX core
+ * can safely reject invalid dispatches.
+ */
+enum scx_ops_state {
+ SCX_OPSS_NONE, /* owned by the SCX core */
+ SCX_OPSS_QUEUEING, /* in transit to the BPF scheduler */
+ SCX_OPSS_QUEUED, /* owned by the BPF scheduler */
+ SCX_OPSS_DISPATCHING, /* in transit back to the SCX core */
+
+ /*
+ * QSEQ brands each QUEUED instance so that, when dispatch races
+ * dequeue/requeue, the dispatcher can tell whether it still has a claim
+ * on the task being dispatched.
+ */
+ SCX_OPSS_QSEQ_SHIFT = 2,
+ SCX_OPSS_STATE_MASK = (1LLU << SCX_OPSS_QSEQ_SHIFT) - 1,
+ SCX_OPSS_QSEQ_MASK = ~SCX_OPSS_STATE_MASK,
+};
+
+/*
+ * During exit, a task may schedule after losing its PIDs. When disabling the
+ * BPF scheduler, we need to be able to iterate tasks in every state to
+ * guarantee system safety. Maintain a dedicated task list which contains every
+ * task between its fork and eventual free.
+ */
+static DEFINE_SPINLOCK(scx_tasks_lock);
+static LIST_HEAD(scx_tasks);
+
+/* ops enable/disable */
+static struct kthread_worker *scx_ops_helper;
+static DEFINE_MUTEX(scx_ops_enable_mutex);
+DEFINE_STATIC_KEY_FALSE(__scx_ops_enabled);
+DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem);
+static atomic_t scx_ops_enable_state_var = ATOMIC_INIT(SCX_OPS_DISABLED);
+static struct sched_ext_ops scx_ops;
+static bool warned_zero_slice;
+
+static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_last);
+static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_exiting);
+static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
+
+struct static_key_false scx_has_op[SCX_NR_ONLINE_OPS] =
+ { [0 ... SCX_NR_ONLINE_OPS-1] = STATIC_KEY_FALSE_INIT };
+
+static atomic_t scx_exit_type = ATOMIC_INIT(SCX_EXIT_DONE);
+static struct scx_exit_info scx_exit_info;
+
+static atomic64_t scx_nr_rejected = ATOMIC64_INIT(0);
+
+/* idle tracking */
+#ifdef CONFIG_SMP
+#ifdef CONFIG_CPUMASK_OFFSTACK
+#define CL_ALIGNED_IF_ONSTACK
+#else
+#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
+#endif
+
+static struct {
+ cpumask_var_t cpu;
+ cpumask_var_t smt;
+} idle_masks CL_ALIGNED_IF_ONSTACK;
+
+static bool __cacheline_aligned_in_smp has_idle_cpus;
+#endif
+
+/*
+ * Direct dispatch marker.
+ *
+ * Non-NULL values are used for direct dispatch from enqueue path. A valid
+ * pointer points to the task currently being enqueued. An ERR_PTR value is used
+ * to indicate that direct dispatch has already happened.
+ */
+static DEFINE_PER_CPU(struct task_struct *, direct_dispatch_task);
+
+/* dispatch queues */
+static struct scx_dispatch_q __cacheline_aligned_in_smp scx_dsq_global;
+
+static const struct rhashtable_params dsq_hash_params = {
+ .key_len = 8,
+ .key_offset = offsetof(struct scx_dispatch_q, id),
+ .head_offset = offsetof(struct scx_dispatch_q, hash_node),
+};
+
+static struct rhashtable dsq_hash;
+static DEFINE_RAW_SPINLOCK(all_dsqs_lock);
+static LIST_HEAD(all_dsqs);
+static LLIST_HEAD(dsqs_to_free);
+
+/* dispatch buf */
+struct dispatch_buf_ent {
+ struct task_struct *task;
+ u64 qseq;
+ u64 dsq_id;
+ u64 enq_flags;
+};
+
+static u32 dispatch_max_batch;
+static struct dispatch_buf_ent __percpu *dispatch_buf;
+static DEFINE_PER_CPU(u32, dispatch_buf_cursor);
+
+/* consume context */
+struct consume_ctx {
+ struct rq *rq;
+ struct rq_flags *rf;
+};
+
+static DEFINE_PER_CPU(struct consume_ctx, consume_ctx);
+
+void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice,
+ u64 enq_flags);
+__printf(2, 3) static void scx_ops_error_type(enum scx_exit_type type,
+ const char *fmt, ...);
+#define scx_ops_error(fmt, args...) \
+ scx_ops_error_type(SCX_EXIT_ERROR, fmt, ##args)
+
+struct scx_task_iter {
+ struct sched_ext_entity cursor;
+ struct task_struct *locked;
+ struct rq *rq;
+ struct rq_flags rf;
+};
+
+/**
+ * scx_task_iter_init - Initialize a task iterator
+ * @iter: iterator to init
+ *
+ * Initialize @iter. Must be called with scx_tasks_lock held. Once initialized,
+ * @iter must eventually be exited with scx_task_iter_exit().
+ *
+ * scx_tasks_lock may be released between this and the first next() call or
+ * between any two next() calls. If scx_tasks_lock is released between two
+ * next() calls, the caller is responsible for ensuring that the task being
+ * iterated remains accessible either through RCU read lock or obtaining a
+ * reference count.
+ *
+ * All tasks which existed when the iteration started are guaranteed to be
+ * visited as long as they still exist.
+ */
+static void scx_task_iter_init(struct scx_task_iter *iter)
+{
+ lockdep_assert_held(&scx_tasks_lock);
+
+ iter->cursor = (struct sched_ext_entity){ .flags = SCX_TASK_CURSOR };
+ list_add(&iter->cursor.tasks_node, &scx_tasks);
+ iter->locked = NULL;
+}
+
+/**
+ * scx_task_iter_exit - Exit a task iterator
+ * @iter: iterator to exit
+ *
+ * Exit a previously initialized @iter. Must be called with scx_tasks_lock held.
+ * If the iterator holds a task's rq lock, that rq lock is released. See
+ * scx_task_iter_init() for details.
+ */
+static void scx_task_iter_exit(struct scx_task_iter *iter)
+{
+ struct list_head *cursor = &iter->cursor.tasks_node;
+
+ lockdep_assert_held(&scx_tasks_lock);
+
+ if (iter->locked) {
+ task_rq_unlock(iter->rq, iter->locked, &iter->rf);
+ iter->locked = NULL;
+ }
+
+ if (list_empty(cursor))
+ return;
+
+ list_del_init(cursor);
+}
+
+/**
+ * scx_task_iter_next - Next task
+ * @iter: iterator to walk
+ *
+ * Visit the next task. See scx_task_iter_init() for details.
+ */
+static struct task_struct *scx_task_iter_next(struct scx_task_iter *iter)
+{
+ struct list_head *cursor = &iter->cursor.tasks_node;
+ struct sched_ext_entity *pos;
+
+ lockdep_assert_held(&scx_tasks_lock);
+
+ list_for_each_entry(pos, cursor, tasks_node) {
+ if (&pos->tasks_node == &scx_tasks)
+ return NULL;
+ if (!(pos->flags & SCX_TASK_CURSOR)) {
+ list_move(cursor, &pos->tasks_node);
+ return container_of(pos, struct task_struct, scx);
+ }
+ }
+
+ /* can't happen, should always terminate at scx_tasks above */
+ BUG();
+}
+
+/**
+ * scx_task_iter_next_filtered - Next non-idle task
+ * @iter: iterator to walk
+ *
+ * Visit the next non-idle task. See scx_task_iter_init() for details.
+ */
+static struct task_struct *
+scx_task_iter_next_filtered(struct scx_task_iter *iter)
+{
+ struct task_struct *p;
+
+ while ((p = scx_task_iter_next(iter))) {
+ if (!is_idle_task(p))
+ return p;
+ }
+ return NULL;
+}
+
+/**
+ * scx_task_iter_next_filtered_locked - Next non-idle task with its rq locked
+ * @iter: iterator to walk
+ *
+ * Visit the next non-idle task with its rq lock held. See scx_task_iter_init()
+ * for details.
+ */
+static struct task_struct *
+scx_task_iter_next_filtered_locked(struct scx_task_iter *iter)
+{
+ struct task_struct *p;
+
+ if (iter->locked) {
+ task_rq_unlock(iter->rq, iter->locked, &iter->rf);
+ iter->locked = NULL;
+ }
+
+ p = scx_task_iter_next_filtered(iter);
+ if (!p)
+ return NULL;
+
+ iter->rq = task_rq_lock(p, &iter->rf);
+ iter->locked = p;
+ return p;
+}
+
+static enum scx_ops_enable_state scx_ops_enable_state(void)
+{
+ return atomic_read(&scx_ops_enable_state_var);
+}
+
+static enum scx_ops_enable_state
+scx_ops_set_enable_state(enum scx_ops_enable_state to)
+{
+ return atomic_xchg(&scx_ops_enable_state_var, to);
+}
+
+static bool scx_ops_tryset_enable_state(enum scx_ops_enable_state to,
+ enum scx_ops_enable_state from)
+{
+ int from_v = from;
+
+ return atomic_try_cmpxchg(&scx_ops_enable_state_var, &from_v, to);
+}
+
+static bool scx_ops_disabling(void)
+{
+ return unlikely(scx_ops_enable_state() == SCX_OPS_DISABLING);
+}
+
+#define SCX_HAS_OP(op) static_branch_likely(&scx_has_op[SCX_OP_IDX(op)])
+
+static void wait_ops_state(struct task_struct *p, u64 opss)
+{
+ /*
+ * We're waiting @p to transit out of QUEUEING or DISPATCHING.
+ * load_acquire ensures that we see the updates.
+ */
+ do {
+ cpu_relax();
+ } while (atomic64_read_acquire(&p->scx.ops_state) == opss);
+}
+
+/**
+ * ops_cpu_valid - Verify a cpu number
+ * @cpu: cpu number which came from a BPF ops
+ *
+ * @cpu is a cpu number which came from the BPF scheduler and can be any value.
+ * Verify that it is in range and one of the possible cpus.
+ */
+static bool ops_cpu_valid(s32 cpu)
+{
+ return likely(cpu >= 0 && cpu < nr_cpu_ids && cpu_possible(cpu));
+}
+
+/**
+ * ops_sanitize_err - Sanitize a -errno value
+ * @ops_name: operation to blame on failure
+ * @err: -errno value to sanitize
+ *
+ * Verify @err is a valid -errno. If not, trigger scx_ops_error() and return
+ * -%EPROTO. This is necessary because returning a rogue -errno up the chain can
+ * cause misbehaviors. For an example, a large negative return from
+ * ops.prep_enable() triggers an oops when passed up the call chain because the
+ * value fails IS_ERR() test after being encoded with ERR_PTR() and then is
+ * handled as a pointer.
+ */
+static int ops_sanitize_err(const char *ops_name, s32 err)
+{
+ if (err < 0 && err >= -MAX_ERRNO)
+ return err;
+
+ scx_ops_error("ops.%s() returned an invalid errno %d", ops_name, err);
+ return -EPROTO;
+}
+
+static void update_curr_scx(struct rq *rq)
+{
+ struct task_struct *curr = rq->curr;
+ u64 now = rq_clock_task(rq);
+ u64 delta_exec;
+
+ if (time_before_eq64(now, curr->se.exec_start))
+ return;
+
+ delta_exec = now - curr->se.exec_start;
+ curr->se.exec_start = now;
+ curr->se.sum_exec_runtime += delta_exec;
+ account_group_exec_runtime(curr, delta_exec);
+ cgroup_account_cputime(curr, delta_exec);
+
+ curr->scx.slice -= min(curr->scx.slice, delta_exec);
+}
+
+static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,
+ u64 enq_flags)
+{
+ bool is_local = dsq->id == SCX_DSQ_LOCAL;
+
+ WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_node));
+
+ if (!is_local) {
+ raw_spin_lock(&dsq->lock);
+ if (unlikely(dsq->id == SCX_DSQ_INVALID)) {
+ scx_ops_error("attempting to dispatch to a destroyed dsq");
+ /* fall back to the global dsq */
+ raw_spin_unlock(&dsq->lock);
+ dsq = &scx_dsq_global;
+ raw_spin_lock(&dsq->lock);
+ }
+ }
+
+ if (enq_flags & SCX_ENQ_HEAD)
+ list_add(&p->scx.dsq_node, &dsq->fifo);
+ else
+ list_add_tail(&p->scx.dsq_node, &dsq->fifo);
+ dsq->nr++;
+ p->scx.dsq = dsq;
+
+ /*
+ * We're transitioning out of QUEUEING or DISPATCHING. store_release to
+ * match waiters' load_acquire.
+ */
+ if (enq_flags & SCX_ENQ_CLEAR_OPSS)
+ atomic64_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
+
+ if (is_local) {
+ struct rq *rq = container_of(dsq, struct rq, scx.local_dsq);
+
+ if (sched_class_above(&ext_sched_class, rq->curr->sched_class))
+ resched_curr(rq);
+ } else {
+ raw_spin_unlock(&dsq->lock);
+ }
+}
+
+static void dispatch_dequeue(struct scx_rq *scx_rq, struct task_struct *p)
+{
+ struct scx_dispatch_q *dsq = p->scx.dsq;
+ bool is_local = dsq == &scx_rq->local_dsq;
+
+ if (!dsq) {
+ WARN_ON_ONCE(!list_empty(&p->scx.dsq_node));
+ /*
+ * When dispatching directly from the BPF scheduler to a local
+ * dsq, the task isn't associated with any dsq but
+ * @p->scx.holding_cpu may be set under the protection of
+ * %SCX_OPSS_DISPATCHING.
+ */
+ if (p->scx.holding_cpu >= 0)
+ p->scx.holding_cpu = -1;
+ return;
+ }
+
+ if (!is_local)
+ raw_spin_lock(&dsq->lock);
+
+ /*
+ * Now that we hold @dsq->lock, @p->holding_cpu and @p->scx.dsq_node
+ * can't change underneath us.
+ */
+ if (p->scx.holding_cpu < 0) {
+ /* @p must still be on @dsq, dequeue */
+ WARN_ON_ONCE(list_empty(&p->scx.dsq_node));
+ list_del_init(&p->scx.dsq_node);
+ dsq->nr--;
+ } else {
+ /*
+ * We're racing against dispatch_to_local_dsq() which already
+ * removed @p from @dsq and set @p->scx.holding_cpu. Clear the
+ * holding_cpu which tells dispatch_to_local_dsq() that it lost
+ * the race.
+ */
+ WARN_ON_ONCE(!list_empty(&p->scx.dsq_node));
+ p->scx.holding_cpu = -1;
+ }
+ p->scx.dsq = NULL;
+
+ if (!is_local)
+ raw_spin_unlock(&dsq->lock);
+}
+
+static struct scx_dispatch_q *find_non_local_dsq(u64 dsq_id)
+{
+ lockdep_assert(rcu_read_lock_any_held());
+
+ if (dsq_id == SCX_DSQ_GLOBAL)
+ return &scx_dsq_global;
+ else
+ return rhashtable_lookup_fast(&dsq_hash, &dsq_id,
+ dsq_hash_params);
+}
+
+static struct scx_dispatch_q *find_dsq_for_dispatch(struct rq *rq, u64 dsq_id,
+ struct task_struct *p)
+{
+ struct scx_dispatch_q *dsq;
+
+ if (dsq_id == SCX_DSQ_LOCAL)
+ return &rq->scx.local_dsq;
+
+ dsq = find_non_local_dsq(dsq_id);
+ if (unlikely(!dsq)) {
+ scx_ops_error("non-existent dsq 0x%llx for %s[%d]",
+ dsq_id, p->comm, p->pid);
+ return &scx_dsq_global;
+ }
+
+ return dsq;
+}
+
+static void direct_dispatch(struct task_struct *ddsp_task, struct task_struct *p,
+ u64 dsq_id, u64 enq_flags)
+{
+ struct scx_dispatch_q *dsq;
+
+ /* @p must match the task which is being enqueued */
+ if (unlikely(p != ddsp_task)) {
+ if (IS_ERR(ddsp_task))
+ scx_ops_error("%s[%d] already direct-dispatched",
+ p->comm, p->pid);
+ else
+ scx_ops_error("enqueueing %s[%d] but trying to direct-dispatch %s[%d]",
+ ddsp_task->comm, ddsp_task->pid,
+ p->comm, p->pid);
+ return;
+ }
+
+ /*
+ * %SCX_DSQ_LOCAL_ON is not supported during direct dispatch because
+ * dispatching to the local dsq of a different CPU requires unlocking
+ * the current rq which isn't allowed in the enqueue path. Use
+ * ops.select_cpu() to be on the target CPU and then %SCX_DSQ_LOCAL.
+ */
+ if (unlikely((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON)) {
+ scx_ops_error("SCX_DSQ_LOCAL_ON can't be used for direct-dispatch");
+ return;
+ }
+
+ dsq = find_dsq_for_dispatch(task_rq(p), dsq_id, p);
+ dispatch_enqueue(dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
+
+ /*
+ * Mark that dispatch already happened by spoiling direct_dispatch_task
+ * with a non-NULL value which can never match a valid task pointer.
+ */
+ __this_cpu_write(direct_dispatch_task, ERR_PTR(-ESRCH));
+}
+
+static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
+ int sticky_cpu)
+{
+ struct task_struct **ddsp_taskp;
+ u64 qseq;
+
+ WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_QUEUED));
+
+ if (p->scx.flags & SCX_TASK_ENQ_LOCAL) {
+ enq_flags |= SCX_ENQ_LOCAL;
+ p->scx.flags &= ~SCX_TASK_ENQ_LOCAL;
+ }
+
+ /* rq migration */
+ if (sticky_cpu == cpu_of(rq))
+ goto local_norefill;
+
+ /*
+ * If !rq->online, we already told the BPF scheduler that the CPU is
+ * offline. We're just trying to on/offline the CPU. Don't bother the
+ * BPF scheduler.
+ */
+ if (unlikely(!rq->online))
+ goto local;
+
+ /* see %SCX_OPS_ENQ_EXITING */
+ if (!static_branch_unlikely(&scx_ops_enq_exiting) &&
+ unlikely(p->flags & PF_EXITING))
+ goto local;
+
+ /* see %SCX_OPS_ENQ_LAST */
+ if (!static_branch_unlikely(&scx_ops_enq_last) &&
+ (enq_flags & SCX_ENQ_LAST))
+ goto local;
+
+ if (!SCX_HAS_OP(enqueue)) {
+ if (enq_flags & SCX_ENQ_LOCAL)
+ goto local;
+ else
+ goto global;
+ }
+
+ /* dsq bypass didn't trigger, enqueue on the BPF scheduler */
+ qseq = rq->scx.ops_qseq++ << SCX_OPSS_QSEQ_SHIFT;
+
+ WARN_ON_ONCE(atomic64_read(&p->scx.ops_state) != SCX_OPSS_NONE);
+ atomic64_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
+
+ ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
+ WARN_ON_ONCE(*ddsp_taskp);
+ *ddsp_taskp = p;
+
+ scx_ops.enqueue(p, enq_flags);
+
+ /*
+ * If not directly dispatched, QUEUEING isn't clear yet and dispatch or
+ * dequeue may be waiting. The store_release matches their load_acquire.
+ */
+ if (*ddsp_taskp == p)
+ atomic64_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq);
+ *ddsp_taskp = NULL;
+ return;
+
+local:
+ p->scx.slice = SCX_SLICE_DFL;
+local_norefill:
+ dispatch_enqueue(&rq->scx.local_dsq, p, enq_flags);
+ return;
+
+global:
+ p->scx.slice = SCX_SLICE_DFL;
+ dispatch_enqueue(&scx_dsq_global, p, enq_flags);
+}
+
+static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags)
+{
+ int sticky_cpu = p->scx.sticky_cpu;
+
+ if (sticky_cpu >= 0)
+ p->scx.sticky_cpu = -1;
+
+ /*
+ * Restoring a running task will be immediately followed by
+ * set_next_task_scx() which expects the task to not be on the BPF
+ * scheduler as tasks can only start running through local dsqs. Force
+ * direct-dispatch into the local dsq by setting the sticky_cpu.
+ */
+ if (unlikely(enq_flags & ENQUEUE_RESTORE) && task_current(rq, p))
+ sticky_cpu = cpu_of(rq);
+
+ if (p->scx.flags & SCX_TASK_QUEUED)
+ return;
+
+ p->scx.flags |= SCX_TASK_QUEUED;
+ rq->scx.nr_running++;
+ add_nr_running(rq, 1);
+
+ do_enqueue_task(rq, p, enq_flags, sticky_cpu);
+}
+
+static void dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags)
+{
+ struct scx_rq *scx_rq = &rq->scx;
+ u64 opss;
+
+ if (!(p->scx.flags & SCX_TASK_QUEUED))
+ return;
+
+ /* acquire ensures that we see the preceding updates on QUEUED */
+ opss = atomic64_read_acquire(&p->scx.ops_state);
+
+ switch (opss & SCX_OPSS_STATE_MASK) {
+ case SCX_OPSS_NONE:
+ break;
+ case SCX_OPSS_QUEUEING:
+ BUG();
+ case SCX_OPSS_QUEUED:
+ if (SCX_HAS_OP(dequeue))
+ scx_ops.dequeue(p, deq_flags);
+
+ if (atomic64_try_cmpxchg(&p->scx.ops_state, &opss,
+ SCX_OPSS_NONE))
+ break;
+ fallthrough;
+ case SCX_OPSS_DISPATCHING:
+ /*
+ * If @p is being dispatched from the BPF scheduler to a dsq,
+ * wait for the transfer to complete so that @p doesn't get
+ * added to its dsq after dequeueing is complete.
+ *
+ * As we're waiting on DISPATCHING with @rq locked, the
+ * dispatching side shouldn't try to lock @rq while DISPATCHING
+ * is set. See dispatch_to_local_dsq().
+ */
+ wait_ops_state(p, SCX_OPSS_DISPATCHING);
+ BUG_ON(atomic64_read(&p->scx.ops_state) != SCX_OPSS_NONE);
+ break;
+ }
+
+ p->scx.flags &= ~SCX_TASK_QUEUED;
+ scx_rq->nr_running--;
+ sub_nr_running(rq, 1);
+
+ dispatch_dequeue(scx_rq, p);
+}
+
+static void yield_task_scx(struct rq *rq)
+{
+ struct task_struct *p = rq->curr;
+
+ if (SCX_HAS_OP(yield))
+ scx_ops.yield(p, NULL);
+ else
+ p->scx.slice = 0;
+}
+
+static bool yield_to_task_scx(struct rq *rq, struct task_struct *to)
+{
+ struct task_struct *from = rq->curr;
+
+ if (SCX_HAS_OP(yield))
+ return scx_ops.yield(from, to);
+ else
+ return false;
+}
+
+#ifdef CONFIG_SMP
+/**
+ * move_task_to_local_dsq - Move a task from a different rq to a local dsq
+ * @rq: rq to move the task into, currently locked
+ * @p: task to move
+ *
+ * Move @p which is currently on a different rq to @rq's local dsq. The caller
+ * must:
+ *
+ * 1. Start with exclusive access to @p either through its dsq lock or
+ * %SCX_OPSS_DISPATCHING flag.
+ *
+ * 2. Set @p->scx.holding_cpu to raw_smp_processor_id().
+ *
+ * 3. Remember task_rq(@p). Release the exclusive access so that we don't
+ * deadlock with dequeue.
+ *
+ * 4. Lock @rq and the task_rq from #3.
+ *
+ * 5. Call this function.
+ *
+ * Returns %true if @p was successfully moved. %false after racing dequeue and
+ * losing.
+ */
+static bool move_task_to_local_dsq(struct rq *rq, struct task_struct *p)
+{
+ struct rq *task_rq;
+
+ lockdep_assert_rq_held(rq);
+
+ /*
+ * If dequeue got to @p while we were trying to lock both rq's, it'd
+ * have cleared @p->scx.holding_cpu to -1. While other cpus may have
+ * updated it to different values afterwards, as this operation can't be
+ * preempted or recurse, @p->scx.holding_cpu can never become
+ * raw_smp_processor_id() again before we're done. Thus, we can tell
+ * whether we lost to dequeue by testing whether @p->scx.holding_cpu is
+ * still raw_smp_processor_id().
+ *
+ * See dispatch_dequeue() for the counterpart.
+ */
+ if (unlikely(p->scx.holding_cpu != raw_smp_processor_id()))
+ return false;
+
+ /* @p->rq couldn't have changed if we're still the holding cpu */
+ task_rq = task_rq(p);
+ lockdep_assert_rq_held(task_rq);
+
+ WARN_ON_ONCE(!cpumask_test_cpu(cpu_of(rq), p->cpus_ptr));
+ deactivate_task(task_rq, p, 0);
+ set_task_cpu(p, cpu_of(rq));
+ p->scx.sticky_cpu = cpu_of(rq);
+ activate_task(rq, p, 0);
+
+ return true;
+}
+
+/**
+ * dispatch_to_local_dsq_lock - Ensure source and desitnation rq's are locked
+ * @rq: current rq which is locked
+ * @rf: rq_flags to use when unlocking @rq
+ * @src_rq: rq to move task from
+ * @dst_rq: rq to move task to
+ *
+ * We're holding @rq lock and trying to dispatch a task from @src_rq to
+ * @dst_rq's local dsq and thus need to lock both @src_rq and @dst_rq. Whether
+ * @rq stays locked isn't important as long as the state is restored after
+ * dispatch_to_local_dsq_unlock().
+ */
+static void dispatch_to_local_dsq_lock(struct rq *rq, struct rq_flags *rf,
+ struct rq *src_rq, struct rq *dst_rq)
+{
+ rq_unpin_lock(rq, rf);
+
+ if (src_rq == dst_rq) {
+ raw_spin_rq_unlock(rq);
+ raw_spin_rq_lock(dst_rq);
+ } else if (rq == src_rq) {
+ double_lock_balance(rq, dst_rq);
+ rq_repin_lock(rq, rf);
+ } else if (rq == dst_rq) {
+ double_lock_balance(rq, src_rq);
+ rq_repin_lock(rq, rf);
+ } else {
+ raw_spin_rq_unlock(rq);
+ double_rq_lock(src_rq, dst_rq);
+ }
+}
+
+/**
+ * dispatch_to_local_dsq_unlock - Undo dispatch_to_local_dsq_lock()
+ * @rq: current rq which is locked
+ * @rf: rq_flags to use when unlocking @rq
+ * @src_rq: rq to move task from
+ * @dst_rq: rq to move task to
+ *
+ * Unlock @src_rq and @dst_rq and ensure that @rq is locked on return.
+ */
+static void dispatch_to_local_dsq_unlock(struct rq *rq, struct rq_flags *rf,
+ struct rq *src_rq, struct rq *dst_rq)
+{
+ if (src_rq == dst_rq) {
+ raw_spin_rq_unlock(dst_rq);
+ raw_spin_rq_lock(rq);
+ rq_repin_lock(rq, rf);
+ } else if (rq == src_rq) {
+ double_unlock_balance(rq, dst_rq);
+ } else if (rq == dst_rq) {
+ double_unlock_balance(rq, src_rq);
+ } else {
+ double_rq_unlock(src_rq, dst_rq);
+ raw_spin_rq_lock(rq);
+ rq_repin_lock(rq, rf);
+ }
+}
+#endif
+
+static void set_consume_ctx(struct rq *rq, struct rq_flags *rf)
+{
+ *this_cpu_ptr(&consume_ctx) = (struct consume_ctx){ .rq = rq, .rf = rf };
+}
+
+static bool consume_dispatch_q(struct rq *rq, struct rq_flags *rf,
+ struct scx_dispatch_q *dsq)
+{
+ struct scx_rq *scx_rq = &rq->scx;
+ struct task_struct *p;
+ struct rq *task_rq;
+ bool moved = false;
+retry:
+ if (list_empty(&dsq->fifo))
+ return false;
+
+ raw_spin_lock(&dsq->lock);
+ list_for_each_entry(p, &dsq->fifo, scx.dsq_node) {
+ task_rq = task_rq(p);
+ if (rq == task_rq)
+ goto this_rq;
+ if (likely(rq->online) && !is_migration_disabled(p) &&
+ cpumask_test_cpu(cpu_of(rq), p->cpus_ptr))
+ goto remote_rq;
+ }
+ raw_spin_unlock(&dsq->lock);
+ return false;
+
+this_rq:
+ /* @dsq is locked and @p is on this rq */
+ WARN_ON_ONCE(p->scx.holding_cpu >= 0);
+ list_move_tail(&p->scx.dsq_node, &scx_rq->local_dsq.fifo);
+ dsq->nr--;
+ scx_rq->local_dsq.nr++;
+ p->scx.dsq = &scx_rq->local_dsq;
+ raw_spin_unlock(&dsq->lock);
+ return true;
+
+remote_rq:
+#ifdef CONFIG_SMP
+ /*
+ * @dsq is locked and @p is on a remote rq. @p is currently protected by
+ * @dsq->lock. We want to pull @p to @rq but may deadlock if we grab
+ * @task_rq while holding @dsq and @rq locks. As dequeue can't drop the
+ * rq lock or fail, do a little dancing from our side. See
+ * move_task_to_local_dsq().
+ */
+ WARN_ON_ONCE(p->scx.holding_cpu >= 0);
+ list_del_init(&p->scx.dsq_node);
+ dsq->nr--;
+ p->scx.holding_cpu = raw_smp_processor_id();
+ raw_spin_unlock(&dsq->lock);
+
+ rq_unpin_lock(rq, rf);
+ double_lock_balance(rq, task_rq);
+ rq_repin_lock(rq, rf);
+
+ moved = move_task_to_local_dsq(rq, p);
+
+ double_unlock_balance(rq, task_rq);
+#endif /* CONFIG_SMP */
+ if (likely(moved))
+ return true;
+ goto retry;
+}
+
+enum dispatch_to_local_dsq_ret {
+ DTL_DISPATCHED, /* successfully dispatched */
+ DTL_LOST, /* lost race to dequeue */
+ DTL_NOT_LOCAL, /* destination is not a local dsq */
+ DTL_INVALID, /* invalid local dsq_id */
+};
+
+/**
+ * dispatch_to_local_dsq - Dispatch a task to a local dsq
+ * @rq: current rq which is locked
+ * @rf: rq_flags to use when unlocking @rq
+ * @dsq_id: destination dsq ID
+ * @p: task to dispatch
+ * @enq_flags: %SCX_ENQ_*
+ *
+ * We're holding @rq lock and want to dispatch @p to the local dsq identified by
+ * @dsq_id. This function performs all the synchronization dancing needed
+ * because local dsq's are protected with rq locks.
+ *
+ * The caller must have exclusive ownership of @p (e.g. through
+ * %SCX_OPSS_DISPATCHING).
+ */
+static enum dispatch_to_local_dsq_ret
+dispatch_to_local_dsq(struct rq *rq, struct rq_flags *rf, u64 dsq_id,
+ struct task_struct *p, u64 enq_flags)
+{
+ struct rq *src_rq = task_rq(p);
+ struct rq *dst_rq;
+
+ /*
+ * We're synchronized against dequeue through DISPATCHING. As @p can't
+ * be dequeued, its task_rq and cpus_allowed are stable too.
+ */
+ if (dsq_id == SCX_DSQ_LOCAL) {
+ dst_rq = rq;
+ } else if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
+ s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
+
+ if (!ops_cpu_valid(cpu)) {
+ scx_ops_error("invalid cpu %d in SCX_DSQ_LOCAL_ON verdict for %s[%d]",
+ cpu, p->comm, p->pid);
+ return DTL_INVALID;
+ }
+ dst_rq = cpu_rq(cpu);
+ } else {
+ return DTL_NOT_LOCAL;
+ }
+
+ /* if dispatching to @rq that @p is already on, no lock dancing needed */
+ if (rq == src_rq && rq == dst_rq) {
+ dispatch_enqueue(&dst_rq->scx.local_dsq, p,
+ enq_flags | SCX_ENQ_CLEAR_OPSS);
+ return DTL_DISPATCHED;
+ }
+
+#ifdef CONFIG_SMP
+ if (cpumask_test_cpu(cpu_of(dst_rq), p->cpus_ptr)) {
+ struct rq *locked_dst_rq = dst_rq;
+ bool dsp;
+
+ /*
+ * @p is on a possibly remote @src_rq which we need to lock to
+ * move the task. If dequeue is in progress, it'd be locking
+ * @src_rq and waiting on DISPATCHING, so we can't grab @src_rq
+ * lock while holding DISPATCHING.
+ *
+ * As DISPATCHING guarantees that @p is wholly ours, we can
+ * pretend that we're moving from a dsq and use the same
+ * mechanism - mark the task under transfer with holding_cpu,
+ * release DISPATCHING and then follow the same protocol.
+ */
+ p->scx.holding_cpu = raw_smp_processor_id();
+
+ /* store_release ensures that dequeue sees the above */
+ atomic64_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
+
+ dispatch_to_local_dsq_lock(rq, rf, src_rq, locked_dst_rq);
+
+ /*
+ * We don't require the BPF scheduler to avoid dispatching to
+ * offline CPUs mostly for convenience but also because CPUs can
+ * go offline between scx_bpf_dispatch() calls and here. If @p
+ * is destined to an offline CPU, queue it on its current CPU
+ * instead, which should always be safe. As this is an allowed
+ * behavior, don't trigger an ops error.
+ */
+ if (unlikely(!dst_rq->online))
+ dst_rq = src_rq;
+
+ if (src_rq == dst_rq) {
+ /*
+ * As @p is staying on the same rq, there's no need to
+ * go through the full deactivate/activate cycle.
+ * Optimize by abbreviating the operations in
+ * move_task_to_local_dsq().
+ */
+ dsp = p->scx.holding_cpu == raw_smp_processor_id();
+ if (likely(dsp)) {
+ p->scx.holding_cpu = -1;
+ dispatch_enqueue(&dst_rq->scx.local_dsq, p,
+ enq_flags);
+ }
+ } else {
+ dsp = move_task_to_local_dsq(dst_rq, p);
+ }
+
+ /* if the destination CPU is idle, wake it up */
+ if (dsp && p->sched_class > dst_rq->curr->sched_class)
+ resched_curr(dst_rq);
+
+ dispatch_to_local_dsq_unlock(rq, rf, src_rq, locked_dst_rq);
+
+ return dsp ? DTL_DISPATCHED : DTL_LOST;
+ }
+#endif /* CONFIG_SMP */
+
+ scx_ops_error("SCX_DSQ_LOCAL[_ON] verdict target cpu %d not allowed for %s[%d]",
+ cpu_of(dst_rq), p->comm, p->pid);
+ return DTL_INVALID;
+}
+
+/**
+ * finish_dispatch - Asynchronously finish dispatching a task
+ * @rq: current rq which is locked
+ * @rf: rq_flags to use when unlocking @rq
+ * @p: task to finish dispatching
+ * @qseq_at_dispatch: qseq when @p started getting dispatched
+ * @dsq_id: destination dsq ID
+ * @enq_flags: %SCX_ENQ_*
+ *
+ * Dispatching to local dsq's may need to wait for queueing to complete or
+ * require rq lock dancing. As we don't wanna do either while inside
+ * ops.dispatch() to avoid locking order inversion, we split dispatching into
+ * two parts. scx_bpf_dispatch() which is called by ops.dispatch() records the
+ * task and its qseq. Once ops.dispatch() returns, this function is called to
+ * finish up.
+ *
+ * There is no guarantee that @p is still valid for dispatching or even that it
+ * was valid in the first place. Make sure that the task is still owned by the
+ * BPF scheduler and claim the ownership before dispatching.
+ */
+static bool finish_dispatch(struct rq *rq, struct rq_flags *rf,
+ struct task_struct *p, u64 qseq_at_dispatch,
+ u64 dsq_id, u64 enq_flags)
+{
+ struct scx_dispatch_q *dsq;
+ u64 opss;
+
+retry:
+ /*
+ * No need for _acquire here. @p is accessed only after a successful
+ * try_cmpxchg to DISPATCHING.
+ */
+ opss = atomic64_read(&p->scx.ops_state);
+
+ switch (opss & SCX_OPSS_STATE_MASK) {
+ case SCX_OPSS_DISPATCHING:
+ case SCX_OPSS_NONE:
+ /* someone else already got to it */
+ return false;
+ case SCX_OPSS_QUEUED:
+ /*
+ * If qseq doesn't match, @p has gone through at least one
+ * dispatch/dequeue and re-enqueue cycle between
+ * scx_bpf_dispatch() and here and we have no claim on it.
+ */
+ if ((opss & SCX_OPSS_QSEQ_MASK) != qseq_at_dispatch)
+ return false;
+
+ /*
+ * While we know @p is accessible, we don't yet have a claim on
+ * it - the BPF scheduler is allowed to dispatch tasks
+ * spuriously and there can be a racing dequeue attempt. Let's
+ * claim @p by atomically transitioning it from QUEUED to
+ * DISPATCHING.
+ */
+ if (likely(atomic64_try_cmpxchg(&p->scx.ops_state, &opss,
+ SCX_OPSS_DISPATCHING)))
+ break;
+ goto retry;
+ case SCX_OPSS_QUEUEING:
+ /*
+ * do_enqueue_task() is in the process of transferring the task
+ * to the BPF scheduler while holding @p's rq lock. As we aren't
+ * holding any kernel or BPF resource that the enqueue path may
+ * depend upon, it's safe to wait.
+ */
+ wait_ops_state(p, opss);
+ goto retry;
+ }
+
+ BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
+
+ switch (dispatch_to_local_dsq(rq, rf, dsq_id, p, enq_flags)) {
+ case DTL_DISPATCHED:
+ return true;
+ case DTL_LOST:
+ return false;
+ case DTL_INVALID:
+ dsq_id = SCX_DSQ_GLOBAL;
+ break;
+ case DTL_NOT_LOCAL:
+ break;
+ }
+
+ dsq = find_dsq_for_dispatch(cpu_rq(raw_smp_processor_id()), dsq_id, p);
+ dispatch_enqueue(dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
+ return false;
+}
+
+int balance_scx(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+ struct scx_rq *scx_rq = &rq->scx;
+ bool prev_on_scx = prev->sched_class == &ext_sched_class;
+
+ lockdep_assert_rq_held(rq);
+
+ if (prev_on_scx) {
+ WARN_ON_ONCE(prev->scx.flags & SCX_TASK_BAL_KEEP);
+ update_curr_scx(rq);
+
+ /*
+ * If @prev is runnable & has slice left, it has priority and
+ * fetching more just increases latency for the fetched tasks.
+ * Tell put_prev_task_scx() to put @prev on local_dsq.
+ *
+ * See scx_ops_disable_workfn() for the explanation on the
+ * disabling() test.
+ */
+ if ((prev->scx.flags & SCX_TASK_QUEUED) &&
+ prev->scx.slice > 0 && !scx_ops_disabling()) {
+ prev->scx.flags |= SCX_TASK_BAL_KEEP;
+ return 1;
+ }
+ }
+retry:
+ /* if there already are tasks to run, nothing to do */
+ if (scx_rq->local_dsq.nr)
+ return 1;
+
+ if (SCX_HAS_OP(consume)) {
+ set_consume_ctx(rq, rf);
+ scx_ops.consume(cpu_of(rq));
+ if (scx_rq->local_dsq.nr)
+ return 1;
+ } else {
+ if (consume_dispatch_q(rq, rf, &scx_dsq_global))
+ return 1;
+ }
+
+ if (SCX_HAS_OP(dispatch)) {
+ int i, nr, nr_local = 0;
+
+ *this_cpu_ptr(&dispatch_buf_cursor) = 0;
+
+ if (prev_on_scx)
+ scx_ops.dispatch(cpu_of(rq), prev);
+ else
+ scx_ops.dispatch(cpu_of(rq), NULL);
+
+ nr = this_cpu_read(dispatch_buf_cursor);
+ if (!nr) {
+ if (SCX_HAS_OP(consume_final)) {
+ set_consume_ctx(rq, rf);
+ scx_ops.consume_final(cpu_of(rq));
+ return rq->scx.local_dsq.nr > 0;
+ }
+ return 0;
+ }
+
+ for (i = 0; i < nr; i++) {
+ struct dispatch_buf_ent *ent =
+ &this_cpu_ptr(dispatch_buf)[i];
+
+ if (finish_dispatch(rq, rf, ent->task, ent->qseq,
+ ent->dsq_id, ent->enq_flags))
+ nr_local++;
+ }
+
+ if (nr_local)
+ return 1;
+ else
+ goto retry;
+ }
+
+ return 0;
+}
+
+static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
+{
+ if (p->scx.flags & SCX_TASK_QUEUED) {
+ WARN_ON_ONCE(atomic64_read(&p->scx.ops_state) != SCX_OPSS_NONE);
+ dispatch_dequeue(&rq->scx, p);
+ }
+
+ p->se.exec_start = rq_clock_task(rq);
+}
+
+static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
+{
+ update_curr_scx(rq);
+
+ /*
+ * If we're being called from put_prev_task_balance(), balance_scx() may
+ * have decided that @p should keep running.
+ */
+ if (p->scx.flags & SCX_TASK_BAL_KEEP) {
+ p->scx.flags &= ~SCX_TASK_BAL_KEEP;
+ dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
+ return;
+ }
+
+ if (p->scx.flags & SCX_TASK_QUEUED) {
+ /*
+ * If @p has slice left and balance_scx() didn't tag it for
+ * keeping, @p is getting preempted by a higher priority
+ * scheduler class. Leave it at the head of the local dsq.
+ */
+ if (p->scx.slice > 0 && !scx_ops_disabling()) {
+ dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
+ return;
+ }
+
+ /*
+ * If we're in the pick_next_task path, balance_scx() should
+ * have already populated the local dsq if there are any other
+ * available tasks. If empty, tell ops.enqueue() that @p is the
+ * only one available for this cpu. ops.enqueue() should put it
+ * on the local dsq so that the subsequent pick_next_task_scx()
+ * can find the task unless it wants to trigger a separate
+ * follow-up scheduling event.
+ */
+ if (list_empty(&rq->scx.local_dsq.fifo))
+ do_enqueue_task(rq, p, SCX_ENQ_LAST | SCX_ENQ_LOCAL, -1);
+ else
+ do_enqueue_task(rq, p, 0, -1);
+ }
+}
+
+static struct task_struct *pick_task_scx(struct rq *rq)
+{
+ return list_first_entry_or_null(&rq->scx.local_dsq.fifo,
+ struct task_struct, scx.dsq_node);
+}
+
+static struct task_struct *pick_next_task_scx(struct rq *rq)
+{
+ struct task_struct *p;
+
+ p = pick_task_scx(rq);
+ if (!p)
+ return NULL;
+
+ if (unlikely(!p->scx.slice)) {
+ if (!scx_ops_disabling() && !warned_zero_slice) {
+ printk_deferred(KERN_WARNING "sched_ext: %s[%d] has zero slice in pick_next_task_scx()\n",
+ p->comm, p->pid);
+ warned_zero_slice = true;
+ }
+ p->scx.slice = SCX_SLICE_DFL;
+ }
+
+ set_next_task_scx(rq, p, true);
+
+ return p;
+}
+
+#ifdef CONFIG_SMP
+
+static bool test_and_clear_cpu_idle(int cpu)
+{
+ if (cpumask_test_and_clear_cpu(cpu, idle_masks.cpu)) {
+ if (cpumask_empty(idle_masks.cpu))
+ has_idle_cpus = false;
+ return true;
+ } else {
+ return false;
+ }
+}
+
+static int scx_pick_idle_cpu(const struct cpumask *cpus_allowed)
+{
+ int cpu;
+
+ do {
+ cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
+ if (cpu < nr_cpu_ids) {
+ const struct cpumask *sbm = topology_sibling_cpumask(cpu);
+
+ /*
+ * If offline, @cpu is not its own sibling and we can
+ * get caught in an infinite loop as @cpu is never
+ * cleared from idle_masks.smt. Clear @cpu directly in
+ * such cases.
+ */
+ if (likely(cpumask_test_cpu(cpu, sbm)))
+ cpumask_andnot(idle_masks.smt, idle_masks.smt, sbm);
+ else
+ cpumask_andnot(idle_masks.smt, idle_masks.smt, cpumask_of(cpu));
+ } else {
+ cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
+ if (cpu >= nr_cpu_ids)
+ return -EBUSY;
+ }
+ } while (!test_and_clear_cpu_idle(cpu));
+
+ return cpu;
+}
+
+static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags)
+{
+ s32 cpu;
+
+ if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+ scx_ops_error("built-in idle tracking is disabled");
+ return prev_cpu;
+ }
+
+ /*
+ * If WAKE_SYNC and the machine isn't fully saturated, wake up @p to the
+ * local dsq of the waker.
+ */
+ if ((wake_flags & SCX_WAKE_SYNC) && p->nr_cpus_allowed > 1 &&
+ has_idle_cpus && !(current->flags & PF_EXITING)) {
+ cpu = smp_processor_id();
+ if (cpumask_test_cpu(cpu, p->cpus_ptr)) {
+ p->scx.flags |= SCX_TASK_ENQ_LOCAL;
+ return cpu;
+ }
+ }
+
+ /* if the previous CPU is idle, dispatch directly to it */
+ if (test_and_clear_cpu_idle(prev_cpu)) {
+ p->scx.flags |= SCX_TASK_ENQ_LOCAL;
+ return prev_cpu;
+ }
+
+ if (p->nr_cpus_allowed == 1)
+ return prev_cpu;
+
+ cpu = scx_pick_idle_cpu(p->cpus_ptr);
+ if (cpu >= 0) {
+ p->scx.flags |= SCX_TASK_ENQ_LOCAL;
+ return cpu;
+ }
+
+ return prev_cpu;
+}
+
+static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flags)
+{
+ if (SCX_HAS_OP(select_cpu)) {
+ s32 cpu;
+
+ cpu = scx_ops.select_cpu(p, prev_cpu, wake_flags);
+ if (ops_cpu_valid(cpu)) {
+ return cpu;
+ } else {
+ scx_ops_error("select_cpu returned invalid cpu %d", cpu);
+ return prev_cpu;
+ }
+ } else {
+ return scx_select_cpu_dfl(p, prev_cpu, wake_flags);
+ }
+}
+
+static void set_cpus_allowed_scx(struct task_struct *p,
+ const struct cpumask *new_mask, u32 flags)
+{
+ set_cpus_allowed_common(p, new_mask, flags);
+
+ /*
+ * The effective cpumask is stored in @p->cpus_ptr which may temporarily
+ * differ from the configured one in @p->cpus_mask. Always tell the bpf
+ * scheduler the effective one.
+ *
+ * Fine-grained memory write control is enforced by BPF making the const
+ * designation pointless. Cast it away when calling the operation.
+ */
+ if (SCX_HAS_OP(set_cpumask))
+ scx_ops.set_cpumask(p, (struct cpumask *)p->cpus_ptr);
+}
+
+static void reset_idle_masks(void)
+{
+ /* consider all cpus idle, should converge to the actual state quickly */
+ cpumask_setall(idle_masks.cpu);
+ cpumask_setall(idle_masks.smt);
+ has_idle_cpus = true;
+}
+
+void __scx_update_idle(struct rq *rq, bool idle)
+{
+ int cpu = cpu_of(rq);
+ struct cpumask *sib_mask = topology_sibling_cpumask(cpu);
+
+ if (SCX_HAS_OP(update_idle)) {
+ scx_ops.update_idle(cpu_of(rq), idle);
+ if (!static_branch_unlikely(&scx_builtin_idle_enabled))
+ return;
+ }
+
+ if (idle) {
+ cpumask_set_cpu(cpu, idle_masks.cpu);
+ if (!has_idle_cpus)
+ has_idle_cpus = true;
+
+ /*
+ * idle_masks.smt handling is racy but that's fine as it's only
+ * for optimization and self-correcting.
+ */
+ for_each_cpu(cpu, sib_mask) {
+ if (!cpumask_test_cpu(cpu, idle_masks.cpu))
+ return;
+ }
+ cpumask_or(idle_masks.smt, idle_masks.smt, sib_mask);
+ } else {
+ cpumask_clear_cpu(cpu, idle_masks.cpu);
+ if (has_idle_cpus && cpumask_empty(idle_masks.cpu))
+ has_idle_cpus = false;
+
+ cpumask_andnot(idle_masks.smt, idle_masks.smt, sib_mask);
+ }
+}
+
+#else /* !CONFIG_SMP */
+
+static bool test_and_clear_cpu_idle(int cpu) { return false; }
+static int scx_pick_idle_cpu(const struct cpumask *cpus_allowed) { return -EBUSY; }
+static void reset_idle_masks(void) {}
+
+#endif /* CONFIG_SMP */
+
+static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued)
+{
+ update_curr_scx(rq);
+
+ /* always resched while disabling as we can't trust the slice */
+ if (!curr->scx.slice || scx_ops_disabling())
+ resched_curr(rq);
+}
+
+static int scx_ops_prepare_task(struct task_struct *p, struct task_group *tg)
+{
+ int ret;
+
+ WARN_ON_ONCE(p->scx.flags & SCX_TASK_OPS_PREPPED);
+
+ if (SCX_HAS_OP(prep_enable)) {
+ struct scx_enable_args args = { };
+
+ ret = scx_ops.prep_enable(p, &args);
+ if (unlikely(ret)) {
+ ret = ops_sanitize_err("prep_enable", ret);
+ return ret;
+ }
+ }
+
+ p->scx.flags |= SCX_TASK_OPS_PREPPED;
+ return 0;
+}
+
+static void scx_ops_enable_task(struct task_struct *p)
+{
+ lockdep_assert_rq_held(task_rq(p));
+ WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_OPS_PREPPED));
+
+ if (SCX_HAS_OP(enable)) {
+ struct scx_enable_args args = { };
+ scx_ops.enable(p, &args);
+ }
+ p->scx.flags &= ~SCX_TASK_OPS_PREPPED;
+ p->scx.flags |= SCX_TASK_OPS_ENABLED;
+}
+
+static void scx_ops_disable_task(struct task_struct *p)
+{
+ lockdep_assert_rq_held(task_rq(p));
+
+ if (p->scx.flags & SCX_TASK_OPS_PREPPED) {
+ if (SCX_HAS_OP(cancel_enable)) {
+ struct scx_enable_args args = { };
+ scx_ops.cancel_enable(p, &args);
+ }
+ p->scx.flags &= ~SCX_TASK_OPS_PREPPED;
+ } else if (p->scx.flags & SCX_TASK_OPS_ENABLED) {
+ if (SCX_HAS_OP(disable))
+ scx_ops.disable(p);
+ p->scx.flags &= ~SCX_TASK_OPS_ENABLED;
+ }
+}
+
+/**
+ * refresh_scx_weight - Refresh a task's ext weight
+ * @p: task to refresh ext weight for
+ *
+ * @p->scx.weight carries the task's static priority in cgroup weight scale to
+ * enable easy access from the BPF scheduler. To keep it synchronized with the
+ * current task priority, this function should be called when a new task is
+ * created, priority is changed for a task on sched_ext, and a task is switched
+ * to sched_ext from other classes.
+ */
+static void refresh_scx_weight(struct task_struct *p)
+{
+ u32 weight = sched_prio_to_weight[p->static_prio - MAX_RT_PRIO];
+
+ p->scx.weight = sched_weight_to_cgroup(weight);
+}
+
+void scx_pre_fork(struct task_struct *p)
+{
+ /*
+ * BPF scheduler enable/disable paths want to be able to iterate and
+ * update all tasks which can become complex when racing forks. As
+ * enable/disable are very cold paths, let's use a percpu_rwsem to
+ * exclude forks.
+ */
+ percpu_down_read(&scx_fork_rwsem);
+}
+
+int scx_fork(struct task_struct *p)
+{
+ percpu_rwsem_assert_held(&scx_fork_rwsem);
+
+ if (scx_enabled())
+ return scx_ops_prepare_task(p, task_group(p));
+ else
+ return 0;
+}
+
+void scx_post_fork(struct task_struct *p)
+{
+ refresh_scx_weight(p);
+
+ if (scx_enabled()) {
+ struct rq_flags rf;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &rf);
+ scx_ops_enable_task(p);
+ task_rq_unlock(rq, p, &rf);
+ }
+
+ spin_lock_irq(&scx_tasks_lock);
+ list_add_tail(&p->scx.tasks_node, &scx_tasks);
+ spin_unlock_irq(&scx_tasks_lock);
+
+ percpu_up_read(&scx_fork_rwsem);
+}
+
+void scx_cancel_fork(struct task_struct *p)
+{
+ if (scx_enabled())
+ scx_ops_disable_task(p);
+ percpu_up_read(&scx_fork_rwsem);
+}
+
+void sched_ext_free(struct task_struct *p)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&scx_tasks_lock, flags);
+ list_del_init(&p->scx.tasks_node);
+ spin_unlock_irqrestore(&scx_tasks_lock, flags);
+
+ /*
+ * @p is off scx_tasks and wholly ours. scx_ops_enable()'s PREPPED ->
+ * ENABLED transitions can't race us. Disable ops for @p.
+ */
+ if (p->scx.flags & (SCX_TASK_OPS_PREPPED | SCX_TASK_OPS_ENABLED)) {
+ struct rq_flags rf;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &rf);
+ scx_ops_disable_task(p);
+ task_rq_unlock(rq, p, &rf);
+ }
+}
+
+static void reweight_task_scx(struct rq *rq, struct task_struct *p, int newprio)
+{
+ refresh_scx_weight(p);
+}
+
+static void prio_changed_scx(struct rq *rq, struct task_struct *p, int oldprio)
+{
+}
+
+static void switching_to_scx(struct rq *rq, struct task_struct *p)
+{
+ refresh_scx_weight(p);
+
+ /*
+ * set_cpus_allowed_scx() is not called while @p is associated with a
+ * different scheduler class. Keep the BPF scheduler up-to-date.
+ */
+ if (SCX_HAS_OP(set_cpumask))
+ scx_ops.set_cpumask(p, (struct cpumask *)p->cpus_ptr);
+}
+
+static void check_preempt_curr_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
+static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
+
+/*
+ * Omitted operations:
+ *
+ * - check_preempt_curr: NOOP as it isn't useful in the wakeup path because the
+ * task isn't tied to the CPU at that point.
+ *
+ * - migrate_task_rq: Unncessary as task to cpu mapping is transient.
+ *
+ * - task_fork/dead: We need fork/dead notifications for all tasks regardless of
+ * their current sched_class. Call them directly from sched core instead.
+ *
+ * - task_woken, switched_from: Unnecessary.
+ */
+DEFINE_SCHED_CLASS(ext) = {
+ .enqueue_task = enqueue_task_scx,
+ .dequeue_task = dequeue_task_scx,
+ .yield_task = yield_task_scx,
+ .yield_to_task = yield_to_task_scx,
+
+ .check_preempt_curr = check_preempt_curr_scx,
+
+ .pick_next_task = pick_next_task_scx,
+
+ .put_prev_task = put_prev_task_scx,
+ .set_next_task = set_next_task_scx,
+
+#ifdef CONFIG_SMP
+ .balance = balance_scx,
+ .select_task_rq = select_task_rq_scx,
+
+ .pick_task = pick_task_scx,
+
+ .set_cpus_allowed = set_cpus_allowed_scx,
+#endif
+
+ .task_tick = task_tick_scx,
+
+ .switching_to = switching_to_scx,
+ .switched_to = switched_to_scx,
+ .reweight_task = reweight_task_scx,
+ .prio_changed = prio_changed_scx,
+
+ .update_curr = update_curr_scx,
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 0,
+#endif
+};
+
+static void init_dsq(struct scx_dispatch_q *dsq, u64 dsq_id)
+{
+ memset(dsq, 0, sizeof(*dsq));
+
+ raw_spin_lock_init(&dsq->lock);
+ INIT_LIST_HEAD(&dsq->fifo);
+ dsq->id = dsq_id;
+}
+
+static struct scx_dispatch_q *create_dsq(u64 dsq_id, int node)
+{
+ struct scx_dispatch_q *dsq;
+ int ret;
+
+ if (dsq_id & SCX_DSQ_FLAG_BUILTIN)
+ return ERR_PTR(-EINVAL);
+
+ dsq = kmalloc_node(sizeof(*dsq), GFP_KERNEL, node);
+ if (!dsq)
+ return ERR_PTR(-ENOMEM);
+
+ init_dsq(dsq, dsq_id);
+
+ raw_spin_lock_irq(&all_dsqs_lock);
+ ret = rhashtable_insert_fast(&dsq_hash, &dsq->hash_node,
+ dsq_hash_params);
+ if (!ret) {
+ list_add_tail_rcu(&dsq->all_node, &all_dsqs);
+ } else {
+ kfree(dsq);
+ dsq = ERR_PTR(ret);
+ }
+ raw_spin_unlock_irq(&all_dsqs_lock);
+ return dsq;
+}
+
+static void free_dsq_irq_workfn(struct irq_work *irq_work)
+{
+ struct llist_node *to_free = llist_del_all(&dsqs_to_free);
+ struct scx_dispatch_q *dsq, *tmp_dsq;
+
+ llist_for_each_entry_safe(dsq, tmp_dsq, to_free, free_node)
+ kfree_rcu(dsq);
+}
+
+static DEFINE_IRQ_WORK(free_dsq_irq_work, free_dsq_irq_workfn);
+
+static void destroy_dsq(u64 dsq_id)
+{
+ struct scx_dispatch_q *dsq;
+ unsigned long flags;
+
+ rcu_read_lock();
+
+ dsq = rhashtable_lookup_fast(&dsq_hash, &dsq_id, dsq_hash_params);
+ if (!dsq)
+ goto out_unlock_rcu;
+
+ raw_spin_lock_irqsave(&all_dsqs_lock, flags);
+ raw_spin_lock(&dsq->lock);
+
+ if (dsq->nr) {
+ scx_ops_error("attempting to destroy in-use dsq 0x%016llx (nr=%u)",
+ dsq->id, dsq->nr);
+ goto out_unlock_dsq;
+ }
+
+ if (rhashtable_remove_fast(&dsq_hash, &dsq->hash_node, dsq_hash_params))
+ goto out_unlock_dsq;
+
+ /*
+ * Mark dead by invalidating ->id to prevent dispatch_enqueue() from
+ * queueing more tasks. As this function can be called from anywhere,
+ * freeing is bounced through an irq work to avoid nesting RCU
+ * operations inside scheduler locks.
+ */
+ dsq->id = SCX_DSQ_INVALID;
+ list_del_rcu(&dsq->all_node);
+ llist_add(&dsq->free_node, &dsqs_to_free);
+ irq_work_queue(&free_dsq_irq_work);
+
+out_unlock_dsq:
+ raw_spin_unlock(&dsq->lock);
+ raw_spin_unlock_irqrestore(&all_dsqs_lock, flags);
+out_unlock_rcu:
+ rcu_read_unlock();
+}
+
+/*
+ * Used by sched_fork() and __setscheduler_prio() to pick the matching
+ * sched_class. dl/rt are already handled.
+ */
+bool task_on_scx(struct task_struct *p)
+{
+ if (!scx_enabled() || scx_ops_disabling())
+ return false;
+ return p->policy == SCHED_EXT;
+}
+
+static void scx_ops_fallback_enqueue(struct task_struct *p, u64 enq_flags)
+{
+ if (enq_flags & SCX_ENQ_LAST)
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags);
+ else
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+}
+
+static void scx_ops_fallback_consume(s32 cpu)
+{
+ struct consume_ctx *cctx = this_cpu_ptr(&consume_ctx);
+
+ consume_dispatch_q(cctx->rq, cctx->rf, &scx_dsq_global);
+}
+
+static void reset_dispatch_free_dsq_fn(void *ptr, void *arg)
+{
+ struct scx_dispatch_q *dsq = ptr;
+
+ WARN_ON_ONCE(dsq->nr || !list_empty(&dsq->fifo));
+ kfree(dsq);
+}
+
+static void scx_ops_disable_workfn(struct kthread_work *work)
+{
+ struct scx_exit_info *ei = &scx_exit_info;
+ struct scx_task_iter sti;
+ struct task_struct *p;
+ const char *reason;
+ int i, type;
+
+ type = atomic_read(&scx_exit_type);
+ while (true) {
+ /*
+ * NONE indicates that a new scx_ops has been registered since
+ * disable was scheduled - don't kill the new ops. DONE
+ * indicates that the ops has already been disabled.
+ */
+ if (type == SCX_EXIT_NONE || type == SCX_EXIT_DONE)
+ return;
+ if (atomic_try_cmpxchg(&scx_exit_type, &type, SCX_EXIT_DONE))
+ break;
+ }
+
+ switch (type) {
+ case SCX_EXIT_UNREG:
+ reason = "BPF scheduler unregistered";
+ break;
+ case SCX_EXIT_ERROR:
+ reason = "runtime error";
+ break;
+ case SCX_EXIT_ERROR_BPF:
+ reason = "scx_bpf_error";
+ break;
+ default:
+ reason = "<UNKNOWN>";
+ }
+
+ ei->type = type;
+ strlcpy(ei->reason, reason, sizeof(ei->reason));
+
+ switch (scx_ops_set_enable_state(SCX_OPS_DISABLING)) {
+ case SCX_OPS_DISABLED:
+ pr_warn("sched_ext: ops error detected without ops (%s)\n",
+ scx_exit_info.msg);
+ WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_DISABLED) !=
+ SCX_OPS_DISABLING);
+ return;
+ case SCX_OPS_PREPPING:
+ goto forward_progress_guaranteed;
+ case SCX_OPS_DISABLING:
+ /* shouldn't happen but handle it like ENABLING if it does */
+ WARN_ONCE(true, "sched_ext: duplicate disabling instance?");
+ fallthrough;
+ case SCX_OPS_ENABLING:
+ case SCX_OPS_ENABLED:
+ break;
+ }
+
+ /*
+ * DISABLING is set and ops was either ENABLING or ENABLED indicating
+ * that the ops and static branches are set.
+ *
+ * We must guarantee that all runnable tasks make forward progress
+ * without trusting the BPF scheduler. We can't grab any mutexes or
+ * rwsems as they might be held by tasks that the BPF scheduler is
+ * forgetting to run, which unfortunately also excludes toggling the
+ * static branches.
+ *
+ * Let's work around by overriding a couple ops and modifying behaviors
+ * based on the DISABLING state and then cycling the tasks through
+ * dequeue/enqueue to force global FIFO scheduling.
+ *
+ * a. ops.enqueue() and .consume() are overridden for simple global FIFO
+ * scheduling.
+ *
+ * b. balance_scx() never sets %SCX_TASK_BAL_KEEP as the slice value
+ * can't be trusted. Whenever a tick triggers, the running task is
+ * rotated to the tail of the queue.
+ *
+ * c. pick_next_task() suppresses zero slice warning.
+ */
+ scx_ops.enqueue = scx_ops_fallback_enqueue;
+ scx_ops.consume = scx_ops_fallback_consume;
+
+ spin_lock_irq(&scx_tasks_lock);
+ scx_task_iter_init(&sti);
+ while ((p = scx_task_iter_next_filtered_locked(&sti))) {
+ if (READ_ONCE(p->__state) != TASK_DEAD) {
+ struct sched_enq_and_set_ctx ctx;
+
+ sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE,
+ &ctx);
+ sched_enq_and_set_task(&ctx);
+ }
+ }
+ scx_task_iter_exit(&sti);
+ spin_unlock_irq(&scx_tasks_lock);
+
+forward_progress_guaranteed:
+ /*
+ * Here, every runnable task is guaranteed to make forward progress and
+ * we can safely use blocking synchronization constructs. Actually
+ * disable ops.
+ */
+ mutex_lock(&scx_ops_enable_mutex);
+
+ /* avoid racing against fork */
+ cpus_read_lock();
+ percpu_down_write(&scx_fork_rwsem);
+
+ spin_lock_irq(&scx_tasks_lock);
+ scx_task_iter_init(&sti);
+ while ((p = scx_task_iter_next_filtered_locked(&sti))) {
+ const struct sched_class *old_class = p->sched_class;
+ struct sched_enq_and_set_ctx ctx;
+ bool alive = READ_ONCE(p->__state) != TASK_DEAD;
+
+ sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx);
+ p->scx.slice = min_t(u64, p->scx.slice, SCX_SLICE_DFL);
+
+ __setscheduler_prio(p, p->prio);
+ if (alive)
+ check_class_changing(task_rq(p), p, old_class);
+
+ sched_enq_and_set_task(&ctx);
+
+ if (alive)
+ check_class_changed(task_rq(p), p, old_class, p->prio);
+
+ scx_ops_disable_task(p);
+ }
+ scx_task_iter_exit(&sti);
+ spin_unlock_irq(&scx_tasks_lock);
+
+ /* no task is on scx, turn off all the switches and flush in-progress calls */
+ static_branch_disable_cpuslocked(&__scx_ops_enabled);
+ for (i = 0; i < SCX_NR_ONLINE_OPS; i++)
+ static_branch_disable_cpuslocked(&scx_has_op[i]);
+ static_branch_disable_cpuslocked(&scx_ops_enq_last);
+ static_branch_disable_cpuslocked(&scx_ops_enq_exiting);
+ static_branch_disable_cpuslocked(&scx_builtin_idle_enabled);
+ synchronize_rcu();
+
+ percpu_up_write(&scx_fork_rwsem);
+ cpus_read_unlock();
+
+ if (ei->type >= SCX_EXIT_ERROR) {
+ printk(KERN_ERR "sched_ext: BPF scheduler \"%s\" errored, disabling\n", scx_ops.name);
+
+ if (ei->msg[0] == '\0')
+ printk(KERN_ERR "sched_ext: %s\n", ei->reason);
+ else
+ printk(KERN_ERR "sched_ext: %s (%s)\n", ei->reason, ei->msg);
+
+ stack_trace_print(ei->bt, ei->bt_len, 2);
+ }
+
+ if (scx_ops.exit)
+ scx_ops.exit(ei);
+
+ memset(&scx_ops, 0, sizeof(scx_ops));
+
+ rhashtable_free_and_destroy(&dsq_hash, reset_dispatch_free_dsq_fn, NULL);
+ INIT_LIST_HEAD(&all_dsqs);
+ free_percpu(dispatch_buf);
+ dispatch_buf = NULL;
+ dispatch_max_batch = 0;
+
+ mutex_unlock(&scx_ops_enable_mutex);
+
+ WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_DISABLED) !=
+ SCX_OPS_DISABLING);
+}
+
+static DEFINE_KTHREAD_WORK(scx_ops_disable_work, scx_ops_disable_workfn);
+
+static void schedule_scx_ops_disable_work(void)
+{
+ struct kthread_worker *helper = READ_ONCE(scx_ops_helper);
+
+ /*
+ * We may be called spuriously before the first bpf_sched_ext_reg(). If
+ * scx_ops_helper isn't set up yet, there's nothing to do.
+ */
+ if (helper)
+ kthread_queue_work(helper, &scx_ops_disable_work);
+}
+
+static void scx_ops_disable(enum scx_exit_type type)
+{
+ int none = SCX_EXIT_NONE;
+
+ if (WARN_ON_ONCE(type == SCX_EXIT_NONE || type == SCX_EXIT_DONE))
+ type = SCX_EXIT_ERROR;
+
+ atomic_try_cmpxchg(&scx_exit_type, &none, type);
+
+ schedule_scx_ops_disable_work();
+}
+
+static void scx_ops_error_irq_workfn(struct irq_work *irq_work)
+{
+ schedule_scx_ops_disable_work();
+}
+
+static DEFINE_IRQ_WORK(scx_ops_error_irq_work, scx_ops_error_irq_workfn);
+
+__printf(2, 3) static void scx_ops_error_type(enum scx_exit_type type,
+ const char *fmt, ...)
+{
+ struct scx_exit_info *ei = &scx_exit_info;
+ int none = SCX_EXIT_NONE;
+ va_list args;
+
+ if (!atomic_try_cmpxchg(&scx_exit_type, &none, type))
+ return;
+
+ ei->bt_len = stack_trace_save(ei->bt, ARRAY_SIZE(ei->bt), 1);
+
+ va_start(args, fmt);
+ vscnprintf(ei->msg, ARRAY_SIZE(ei->msg), fmt, args);
+ va_end(args);
+
+ irq_work_queue(&scx_ops_error_irq_work);
+}
+
+static struct kthread_worker *scx_create_rt_helper(const char *name)
+{
+ struct kthread_worker *helper;
+
+ helper = kthread_create_worker(0, name);
+ if (helper)
+ sched_set_fifo(helper->task);
+ return helper;
+}
+
+static int scx_ops_enable(struct sched_ext_ops *ops)
+{
+ struct scx_task_iter sti;
+ struct task_struct *p;
+ int i, ret;
+
+ mutex_lock(&scx_ops_enable_mutex);
+
+ if (!scx_ops_helper) {
+ WRITE_ONCE(scx_ops_helper,
+ scx_create_rt_helper("sched_ext_ops_helper"));
+ if (!scx_ops_helper) {
+ ret = -ENOMEM;
+ goto err_unlock;
+ }
+ }
+
+ if (scx_ops_enable_state() != SCX_OPS_DISABLED) {
+ ret = -EBUSY;
+ goto err_unlock;
+ }
+
+ ret = rhashtable_init(&dsq_hash, &dsq_hash_params);
+ if (ret)
+ goto err_unlock;
+
+ /*
+ * Set scx_ops, transition to PREPPING and clear exit info to arm the
+ * disable path. Failure triggers full disabling from here on.
+ */
+ scx_ops = *ops;
+
+ WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_PREPPING) !=
+ SCX_OPS_DISABLED);
+
+ memset(&scx_exit_info, 0, sizeof(scx_exit_info));
+ atomic_set(&scx_exit_type, SCX_EXIT_NONE);
+ warned_zero_slice = false;
+
+ atomic64_set(&scx_nr_rejected, 0);
+
+ /*
+ * Keep CPUs stable during enable so that the BPF scheduler can track
+ * online CPUs by watching ->on/offline_cpu() after ->init().
+ */
+ cpus_read_lock();
+
+ if (scx_ops.init) {
+ ret = scx_ops.init();
+
+ if (ret) {
+ ret = ops_sanitize_err("init", ret);
+ goto err_disable;
+ }
+
+ /*
+ * Exit early if ops.init() triggered scx_bpf_error(). Not
+ * strictly necessary as we'll fail transitioning into ENABLING
+ * later but that'd be after calling ops.prep_enable() on all
+ * tasks and with -EBUSY which isn't very intuitive. Let's exit
+ * early with success so that the condition is notified through
+ * ops.exit() like other scx_bpf_error() invocations.
+ */
+ if (atomic_read(&scx_exit_type) != SCX_EXIT_NONE)
+ goto err_disable;
+ }
+
+ WARN_ON_ONCE(dispatch_buf);
+ dispatch_max_batch = ops->dispatch_max_batch ?: SCX_DSP_DFL_MAX_BATCH;
+ dispatch_buf = __alloc_percpu(sizeof(dispatch_buf[0]) * dispatch_max_batch,
+ __alignof__(dispatch_buf[0]));
+ if (!dispatch_buf) {
+ ret = -ENOMEM;
+ goto err_disable;
+ }
+
+ /*
+ * Lock out forks before opening the floodgate so that they don't wander
+ * into the operations prematurely.
+ */
+ percpu_down_write(&scx_fork_rwsem);
+
+ for (i = 0; i < SCX_NR_ONLINE_OPS; i++)
+ if (((void (**)(void))ops)[i])
+ static_branch_enable_cpuslocked(&scx_has_op[i]);
+
+ if (ops->flags & SCX_OPS_ENQ_LAST)
+ static_branch_enable_cpuslocked(&scx_ops_enq_last);
+
+ if (ops->flags & SCX_OPS_ENQ_EXITING)
+ static_branch_enable_cpuslocked(&scx_ops_enq_exiting);
+
+ if (!ops->update_idle || (ops->flags & SCX_OPS_KEEP_BUILTIN_IDLE)) {
+ reset_idle_masks();
+ static_branch_enable_cpuslocked(&scx_builtin_idle_enabled);
+ } else {
+ static_branch_disable_cpuslocked(&scx_builtin_idle_enabled);
+ }
+
+ static_branch_enable_cpuslocked(&__scx_ops_enabled);
+
+ /*
+ * Enable ops for every task. Fork is excluded by scx_fork_rwsem
+ * preventing new tasks from being added. No need to exclude tasks
+ * leaving as sched_ext_free() can handle both prepped and enabled
+ * tasks. Prep all tasks first and then enable them with preemption
+ * disabled.
+ */
+ spin_lock_irq(&scx_tasks_lock);
+
+ scx_task_iter_init(&sti);
+ while ((p = scx_task_iter_next_filtered(&sti))) {
+ get_task_struct(p);
+ spin_unlock_irq(&scx_tasks_lock);
+
+ ret = scx_ops_prepare_task(p, task_group(p));
+ if (ret) {
+ put_task_struct(p);
+ spin_lock_irq(&scx_tasks_lock);
+ scx_task_iter_exit(&sti);
+ spin_unlock_irq(&scx_tasks_lock);
+ pr_err("sched_ext: ops.prep_enable() failed (%d) for %s[%d] while loading\n",
+ ret, p->comm, p->pid);
+ goto err_disable_unlock;
+ }
+
+ put_task_struct(p);
+ spin_lock_irq(&scx_tasks_lock);
+ }
+ scx_task_iter_exit(&sti);
+
+ /*
+ * All tasks are prepped but are still ops-disabled. Ensure that
+ * %current can't be scheduled out and switch everyone.
+ * preempt_disable() is necessary because we can't guarantee that
+ * %current won't be starved if scheduled out while switching.
+ */
+ preempt_disable();
+
+ /*
+ * From here on, the disable path must assume that tasks have ops
+ * enabled and need to be recovered.
+ */
+ if (!scx_ops_tryset_enable_state(SCX_OPS_ENABLING, SCX_OPS_PREPPING)) {
+ preempt_enable();
+ spin_unlock_irq(&scx_tasks_lock);
+ ret = -EBUSY;
+ goto err_disable_unlock;
+ }
+
+ /*
+ * We're fully committed and can't fail. The PREPPED -> ENABLED
+ * transitions here are synchronized against sched_ext_free() through
+ * scx_tasks_lock.
+ */
+ scx_task_iter_init(&sti);
+ while ((p = scx_task_iter_next_filtered_locked(&sti))) {
+ if (READ_ONCE(p->__state) != TASK_DEAD) {
+ const struct sched_class *old_class = p->sched_class;
+ struct sched_enq_and_set_ctx ctx;
+
+ sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE,
+ &ctx);
+ scx_ops_enable_task(p);
+
+ __setscheduler_prio(p, p->prio);
+ check_class_changing(task_rq(p), p, old_class);
+
+ sched_enq_and_set_task(&ctx);
+
+ check_class_changed(task_rq(p), p, old_class, p->prio);
+ } else {
+ scx_ops_disable_task(p);
+ }
+ }
+ scx_task_iter_exit(&sti);
+
+ spin_unlock_irq(&scx_tasks_lock);
+ preempt_enable();
+ percpu_up_write(&scx_fork_rwsem);
+
+ if (!scx_ops_tryset_enable_state(SCX_OPS_ENABLED, SCX_OPS_ENABLING)) {
+ ret = -EBUSY;
+ goto err_disable_unlock;
+ }
+
+ cpus_read_unlock();
+ mutex_unlock(&scx_ops_enable_mutex);
+
+ return 0;
+
+err_unlock:
+ mutex_unlock(&scx_ops_enable_mutex);
+ return ret;
+
+err_disable_unlock:
+ percpu_up_write(&scx_fork_rwsem);
+err_disable:
+ cpus_read_unlock();
+ mutex_unlock(&scx_ops_enable_mutex);
+ /* must be fully disabled before returning */
+ scx_ops_disable(SCX_EXIT_ERROR);
+ kthread_flush_work(&scx_ops_disable_work);
+ return ret;
+}
+
+#ifdef CONFIG_SCHED_DEBUG
+static const char *scx_ops_enable_state_str[] = {
+ [SCX_OPS_PREPPING] = "prepping",
+ [SCX_OPS_ENABLING] = "enabling",
+ [SCX_OPS_ENABLED] = "enabled",
+ [SCX_OPS_DISABLING] = "disabling",
+ [SCX_OPS_DISABLED] = "disabled",
+};
+
+static int scx_debug_show(struct seq_file *m, void *v)
+{
+ mutex_lock(&scx_ops_enable_mutex);
+ seq_printf(m, "%-30s: %s\n", "ops", scx_ops.name);
+ seq_printf(m, "%-30s: %ld\n", "enabled", scx_enabled());
+ seq_printf(m, "%-30s: %s\n", "enable_state",
+ scx_ops_enable_state_str[scx_ops_enable_state()]);
+ seq_printf(m, "%-30s: %llu\n", "nr_rejected",
+ atomic64_read(&scx_nr_rejected));
+ mutex_unlock(&scx_ops_enable_mutex);
+ return 0;
+}
+
+static int scx_debug_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, scx_debug_show, NULL);
+}
+
+const struct file_operations sched_ext_fops = {
+ .open = scx_debug_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+#endif
+
+/********************************************************************************
+ * bpf_struct_ops plumbing.
+ */
+#include <linux/bpf_verifier.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
+
+extern struct btf *btf_vmlinux;
+static const struct btf_type *task_struct_type;
+
+static bool bpf_scx_is_valid_access(int off, int size,
+ enum bpf_access_type type,
+ const struct bpf_prog *prog,
+ struct bpf_insn_access_aux *info)
+{
+ if (off < 0 || off >= sizeof(__u64) * MAX_BPF_FUNC_ARGS)
+ return false;
+ if (type != BPF_READ)
+ return false;
+ if (off % size != 0)
+ return false;
+
+ return btf_ctx_access(off, size, type, prog, info);
+}
+
+static int bpf_scx_btf_struct_access(struct bpf_verifier_log *log,
+ const struct btf *btf,
+ const struct btf_type *t, int off,
+ int size, enum bpf_access_type atype,
+ u32 *next_btf_id, enum bpf_type_flag *flag)
+{
+ if (t == task_struct_type) {
+ if (off >= offsetof(struct task_struct, scx.slice) &&
+ off + size <= offsetofend(struct task_struct, scx.slice))
+ return SCALAR_VALUE;
+ }
+
+ if (atype == BPF_READ)
+ return btf_struct_access(log, btf, t, off, size, atype,
+ next_btf_id, flag);
+
+ bpf_log(log, "only read is supported\n");
+ return -EACCES;
+}
+
+static const struct bpf_func_proto *
+bpf_scx_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+ switch (func_id) {
+ case BPF_FUNC_task_storage_get:
+ return &bpf_task_storage_get_proto;
+ case BPF_FUNC_task_storage_delete:
+ return &bpf_task_storage_delete_proto;
+ default:
+ return bpf_base_func_proto(func_id);
+ }
+}
+
+const struct bpf_verifier_ops bpf_scx_verifier_ops = {
+ .get_func_proto = bpf_scx_get_func_proto,
+ .is_valid_access = bpf_scx_is_valid_access,
+ .btf_struct_access = bpf_scx_btf_struct_access,
+};
+
+static int bpf_scx_init_member(const struct btf_type *t,
+ const struct btf_member *member,
+ void *kdata, const void *udata)
+{
+ const struct sched_ext_ops *uops = udata;
+ struct sched_ext_ops *ops = kdata;
+ u32 moff = __btf_member_bit_offset(t, member) / 8;
+ int ret;
+
+ switch (moff) {
+ case offsetof(struct sched_ext_ops, dispatch_max_batch):
+ if (*(u32 *)(udata + moff) > INT_MAX)
+ return -E2BIG;
+ ops->dispatch_max_batch = *(u32 *)(udata + moff);
+ return 1;
+ case offsetof(struct sched_ext_ops, flags):
+ if (*(u64 *)(udata + moff) & ~SCX_OPS_ALL_FLAGS)
+ return -EINVAL;
+ ops->flags = *(u64 *)(udata + moff);
+ return 1;
+ case offsetof(struct sched_ext_ops, name):
+ ret = bpf_obj_name_cpy(ops->name, uops->name,
+ sizeof(ops->name));
+ if (ret < 0)
+ return ret;
+ if (ret == 0)
+ return -EINVAL;
+ return 1;
+ }
+
+ return 0;
+}
+
+static int bpf_scx_check_member(const struct btf_type *t,
+ const struct btf_member *member,
+ struct bpf_prog *prog)
+{
+ u32 moff = __btf_member_bit_offset(t, member) / 8;
+
+ switch (moff) {
+ case offsetof(struct sched_ext_ops, prep_enable):
+ case offsetof(struct sched_ext_ops, init):
+ case offsetof(struct sched_ext_ops, exit):
+ /*
+ * FIXME - libbpf should be updated to support struct_ops
+ * operations to be marked as sleepable and this function should
+ * verify that the sleepable states match the expectations. For
+ * now, force-set sleepable here.
+ */
+ prog->aux->sleepable = true;
+ break;
+ }
+
+ return 0;
+}
+
+static int bpf_scx_reg(void *kdata)
+{
+ return scx_ops_enable(kdata);
+}
+
+static void bpf_scx_unreg(void *kdata)
+{
+ scx_ops_disable(SCX_EXIT_UNREG);
+ kthread_flush_work(&scx_ops_disable_work);
+}
+
+static int bpf_scx_init(struct btf *btf)
+{
+ u32 type_id;
+
+ type_id = btf_find_by_name_kind(btf, "task_struct", BTF_KIND_STRUCT);
+ if (type_id < 0)
+ return -EINVAL;
+ task_struct_type = btf_type_by_id(btf, type_id);
+
+ return 0;
+}
+
+/* "extern" to avoid sparse warning, only used in this file */
+extern struct bpf_struct_ops bpf_sched_ext_ops;
+
+struct bpf_struct_ops bpf_sched_ext_ops = {
+ .verifier_ops = &bpf_scx_verifier_ops,
+ .reg = bpf_scx_reg,
+ .unreg = bpf_scx_unreg,
+ .check_member = bpf_scx_check_member,
+ .init_member = bpf_scx_init_member,
+ .init = bpf_scx_init,
+ .name = "sched_ext_ops",
+};
+
+void __init init_sched_ext_class(void)
+{
+ int cpu;
+ u32 v;
+
+ /*
+ * The following is to prevent the compiler from optimizing out the enum
+ * definitions so that BPF scheduler implementations can use them
+ * through the generated vmlinux.h.
+ */
+ WRITE_ONCE(v, SCX_DEQ_SLEEP);
+
+ init_dsq(&scx_dsq_global, SCX_DSQ_GLOBAL);
+#ifdef CONFIG_SMP
+ BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL));
+ BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL));
+#endif
+ for_each_possible_cpu(cpu) {
+ struct rq *rq = cpu_rq(cpu);
+
+ init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL);
+ rq->scx.nr_running = 0;
+ }
+}
+
+
+/********************************************************************************
+ * Helpers that can be called from the BPF scheduler.
+ */
+#include <linux/btf_ids.h>
+
+/* Disables missing prototype warnings for kfuncs */
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+ "Global functions as their definitions will be in vmlinux BTF");
+
+/**
+ * scx_bpf_create_dsq - Create a dsq
+ * @dsq_id: dsq to attach
+ * @node: NUMA node to allocate from
+ *
+ * Create a dsq identified by @dsq_id. Can be called from sleepable operations
+ * including ops.init() and .prep_enable().
+ */
+s32 scx_bpf_create_dsq(u64 dsq_id, s32 node)
+{
+ if (unlikely(node >= (int)nr_node_ids ||
+ (node < 0 && node != NUMA_NO_NODE)))
+ return -EINVAL;
+ return PTR_ERR_OR_ZERO(create_dsq(dsq_id, node));
+}
+
+BTF_SET8_START(scx_kfunc_ids_sleepable)
+BTF_ID_FLAGS(func, scx_bpf_create_dsq)
+BTF_SET8_END(scx_kfunc_ids_sleepable)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_sleepable = {
+ .owner = THIS_MODULE,
+ .set = &scx_kfunc_ids_sleepable,
+};
+
+/**
+ * scx_bpf_dispatch_nr_slots - Return the number of remaining dispatch slots
+ */
+u32 scx_bpf_dispatch_nr_slots(void)
+{
+ return dispatch_max_batch - __this_cpu_read(dispatch_buf_cursor);
+}
+
+/**
+ * scx_bpf_dispatch - Dispatch a task to a dsq
+ * @p: task_struct to dispatch
+ * @dsq_id: dsq to dispatch to
+ * @slice: duration @p can run for in nsecs
+ * @enq_flags: SCX_ENQ_*
+ *
+ * Dispatch @p to the dsq identified by @dsq_id. It is safe to call this
+ * function spuriously. Can be called from ops.enqueue() and ops.dispatch().
+ *
+ * When called from ops.enqueue(), it's for direct dispatch and @p must match
+ * the task being enqueued. Also, %SCX_DSQ_LOCAL_ON can't be used to target the
+ * local dsq of a CPU other than the enqueueing one. Use ops.select_cpu() to be
+ * on the target CPU in the first place.
+ *
+ * When called from ops.dispatch(), there are no restrictions on @p or @dsq_id
+ * and this function can be called upto ops.dispatch_max_batch times to dispatch
+ * multiple tasks. scx_bpf_dispatch_nr_slots() returns the number of the
+ * remaining slots.
+ *
+ * @p is allowed to run for @slice. The scheduling path is triggered on slice
+ * exhaustion. If zero, the current residual slice is maintained. If
+ * %SCX_SLICE_INF, @p never expires and the BPF scheduler must kick the CPU with
+ * scx_bpf_kick_cpu() to trigger scheduling.
+ */
+void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice,
+ u64 enq_flags)
+{
+ struct task_struct *ddsp_task;
+ int idx;
+
+ lockdep_assert_irqs_disabled();
+
+ if (unlikely(!p)) {
+ scx_ops_error("called with NULL task");
+ return;
+ }
+
+ if (unlikely(enq_flags & __SCX_ENQ_INTERNAL_MASK)) {
+ scx_ops_error("invalid enq_flags 0x%llx", enq_flags);
+ return;
+ }
+
+ if (slice)
+ p->scx.slice = slice;
+ else
+ p->scx.slice = p->scx.slice ?: 1;
+
+ ddsp_task = __this_cpu_read(direct_dispatch_task);
+ if (ddsp_task) {
+ direct_dispatch(ddsp_task, p, dsq_id, enq_flags);
+ return;
+ }
+
+ idx = __this_cpu_read(dispatch_buf_cursor);
+ if (unlikely(idx >= dispatch_max_batch)) {
+ scx_ops_error("dispatch buffer overflow");
+ return;
+ }
+
+ this_cpu_ptr(dispatch_buf)[idx] = (struct dispatch_buf_ent){
+ .task = p,
+ .qseq = atomic64_read(&p->scx.ops_state) & SCX_OPSS_QSEQ_MASK,
+ .dsq_id = dsq_id,
+ .enq_flags = enq_flags,
+ };
+ __this_cpu_inc(dispatch_buf_cursor);
+}
+
+BTF_SET8_START(scx_kfunc_ids_dispatch)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots)
+BTF_ID_FLAGS(func, scx_bpf_dispatch)
+BTF_SET8_END(scx_kfunc_ids_dispatch)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = {
+ .owner = THIS_MODULE,
+ .set = &scx_kfunc_ids_dispatch,
+};
+
+/**
+ * scx_bpf_consume - Transfer a task from a dsq to the current CPU's local dsq
+ * @dsq_id: dsq to consume
+ *
+ * Consume a task from the dsq identified by @dsq_id and transfer it to the
+ * current CPU's local dsq for execution. Can only be called from
+ * ops.consume[_final]().
+ *
+ * Returns %true if a task has been consumed, %false if there isn't any task to
+ * consume.
+ */
+bool scx_bpf_consume(u64 dsq_id)
+{
+ struct consume_ctx *cctx = this_cpu_ptr(&consume_ctx);
+ struct scx_dispatch_q *dsq;
+
+ dsq = find_non_local_dsq(dsq_id);
+ if (unlikely(!dsq)) {
+ scx_ops_error("invalid dsq_id 0x%016llx", dsq_id);
+ return false;
+ }
+
+ return consume_dispatch_q(cctx->rq, cctx->rf, dsq);
+}
+
+BTF_SET8_START(scx_kfunc_ids_consume)
+BTF_ID_FLAGS(func, scx_bpf_consume)
+BTF_SET8_END(scx_kfunc_ids_consume)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_consume = {
+ .owner = THIS_MODULE,
+ .set = &scx_kfunc_ids_consume,
+};
+
+/**
+ * scx_bpf_dsq_nr_queued - Return the number of queued tasks
+ * @dsq_id: id of the dsq
+ *
+ * Return the number of tasks in the dsq matching @dsq_id. If not found,
+ * -%ENOENT is returned. Can be called from any non-sleepable online scx_ops
+ * operations.
+ */
+s32 scx_bpf_dsq_nr_queued(u64 dsq_id)
+{
+ struct scx_dispatch_q *dsq;
+
+ lockdep_assert(rcu_read_lock_any_held());
+
+ if (dsq_id == SCX_DSQ_LOCAL) {
+ return this_rq()->scx.local_dsq.nr;
+ } else if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
+ s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
+
+ if (ops_cpu_valid(cpu))
+ return cpu_rq(cpu)->scx.local_dsq.nr;
+ } else {
+ dsq = find_non_local_dsq(dsq_id);
+ if (dsq)
+ return dsq->nr;
+ }
+ return -ENOENT;
+}
+
+/**
+ * scx_bpf_test_and_clear_cpu_idle - Test and clear @cpu's idle state
+ * @cpu: cpu to test and clear idle for
+ *
+ * Returns %true if @cpu was idle and its idle state was successfully cleared.
+ * %false otherwise.
+ *
+ * Unavailable if ops.update_idle() is implemented and
+ * %SCX_OPS_KEEP_BUILTIN_IDLE is not set.
+ */
+bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
+{
+ if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+ scx_ops_error("built-in idle tracking is disabled");
+ return false;
+ }
+
+ if (ops_cpu_valid(cpu))
+ return test_and_clear_cpu_idle(cpu);
+ else
+ return 0;
+}
+
+/**
+ * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu
+ * @cpus_allowed: Allowed cpumask
+ *
+ * Pick and claim an idle cpu which is also in @cpus_allowed. Returns the picked
+ * idle cpu number on success. -%EBUSY if no matching cpu was found.
+ *
+ * Unavailable if ops.update_idle() is implemented and
+ * %SCX_OPS_KEEP_BUILTIN_IDLE is not set.
+ */
+s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed)
+{
+ if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+ scx_ops_error("built-in idle tracking is disabled");
+ return -EBUSY;
+ }
+
+ return scx_pick_idle_cpu(cpus_allowed);
+}
+
+BTF_SET8_START(scx_kfunc_ids_online)
+BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued)
+BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
+BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu)
+BTF_SET8_END(scx_kfunc_ids_online)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_online = {
+ .owner = THIS_MODULE,
+ .set = &scx_kfunc_ids_online,
+};
+
+struct scx_bpf_error_bstr_bufs {
+ u64 data[MAX_BPRINTF_VARARGS];
+ char msg[SCX_EXIT_MSG_LEN];
+};
+
+static DEFINE_PER_CPU(struct scx_bpf_error_bstr_bufs, scx_bpf_error_bstr_bufs);
+
+/**
+ * scx_bpf_error_bstr - Indicate fatal error
+ * @fmt: error message format string
+ * @data: format string parameters packaged using ___bpf_fill() macro
+ * @data_len: @data len
+ *
+ * Indicate that the BPF scheduler encountered a fatal error and initiate ops
+ * disabling.
+ */
+void scx_bpf_error_bstr(char *fmt, unsigned long long *data, u32 data_len)
+{
+ struct scx_bpf_error_bstr_bufs *bufs;
+ unsigned long flags;
+ u32 *bin_args;
+ int ret;
+
+ local_irq_save(flags);
+ bufs = this_cpu_ptr(&scx_bpf_error_bstr_bufs);
+
+ if (data_len % 8 || data_len > MAX_BPRINTF_VARARGS * 8 ||
+ (data_len && !data)) {
+ scx_ops_error("invalid data=%p and data_len=%u",
+ (void *)data, data_len);
+ goto out_restore;
+ }
+
+ ret = copy_from_kernel_nofault(bufs->data, data, data_len);
+ if (ret) {
+ scx_ops_error("failed to read data fields (%d)", ret);
+ goto out_restore;
+ }
+
+ ret = bpf_bprintf_prepare(fmt, UINT_MAX, bufs->data, &bin_args,
+ data_len / 8);
+ if (ret < 0) {
+ scx_ops_error("failed to format prepration (%d)", ret);
+ goto out_restore;
+ }
+
+ ret = bstr_printf(bufs->msg, sizeof(bufs->msg), fmt, bin_args);
+ bpf_bprintf_cleanup();
+ if (ret < 0) {
+ scx_ops_error("scx_ops_error(\"%s\", %p, %u) failed to format",
+ fmt, data, data_len);
+ goto out_restore;
+ }
+
+ scx_ops_error_type(SCX_EXIT_ERROR_BPF, "%s", bufs->msg);
+out_restore:
+ local_irq_restore(flags);
+}
+
+/**
+ * scx_bpf_destroy_dsq - Destroy a dsq
+ * @dsq_id: dsq to destroy
+ *
+ * Destroy the dsq identified by @dsq_id. Only dsqs created with
+ * scx_bpf_create_dsq() can be destroyed. The caller must ensure that the dsq is
+ * empty and no further tasks are dispatched to it. Ignored if called on a dsq
+ * which doesn't exist. Can be called from any online scx_ops operations.
+ */
+void scx_bpf_destroy_dsq(u64 dsq_id)
+{
+ destroy_dsq(dsq_id);
+}
+
+/**
+ * scx_bpf_task_running - Is task currently running?
+ * @p: task of interest
+ */
+bool scx_bpf_task_running(const struct task_struct *p)
+{
+ return task_rq(p)->curr == p;
+}
+
+/**
+ * scx_bpf_task_cpu - CPU a task is currently associated with
+ * @p: task of interest
+ */
+s32 scx_bpf_task_cpu(const struct task_struct *p)
+{
+ return task_cpu(p);
+}
+
+BTF_SET8_START(scx_kfunc_ids_any)
+BTF_ID_FLAGS(func, scx_bpf_error_bstr)
+BTF_ID_FLAGS(func, scx_bpf_destroy_dsq)
+BTF_ID_FLAGS(func, scx_bpf_task_running)
+BTF_ID_FLAGS(func, scx_bpf_task_cpu)
+BTF_SET8_END(scx_kfunc_ids_any)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_any = {
+ .owner = THIS_MODULE,
+ .set = &scx_kfunc_ids_any,
+};
+
+__diag_pop();
+
+/*
+ * This can't be done from init_sched_ext_class() as register_btf_kfunc_id_set()
+ * needs most of the system to be up.
+ */
+static int __init register_ext_kfuncs(void)
+{
+ int ret;
+
+ /*
+ * FIXME - Many kfunc helpers are context-sensitive and can only be
+ * called from specific scx_ops operations. Unfortunately, we can't
+ * currently tell for which operation we're verifying for. For now,
+ * allow all kfuncs for everybody.
+ */
+ if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+ &scx_kfunc_set_sleepable)) ||
+ (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+ &scx_kfunc_set_dispatch)) ||
+ (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+ &scx_kfunc_set_consume)) ||
+ (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+ &scx_kfunc_set_online)) ||
+ (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+ &scx_kfunc_set_any))) {
+ pr_err("sched_ext: failed to register kfunc sets (%d)\n", ret);
+ return ret;
+ }
+
+ return 0;
+}
+__initcall(register_ext_kfuncs);
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index f348158ed33a..6d5669481274 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -1,7 +1,106 @@
/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+enum scx_wake_flags {
+ /* expose select WF_* flags as enums */
+ SCX_WAKE_EXEC = WF_EXEC,
+ SCX_WAKE_FORK = WF_FORK,
+ SCX_WAKE_TTWU = WF_TTWU,
+ SCX_WAKE_SYNC = WF_SYNC,
+};
+
+enum scx_enq_flags {
+ /* expose select ENQUEUE_* flags as enums */
+ SCX_ENQ_WAKEUP = ENQUEUE_WAKEUP,
+ SCX_ENQ_HEAD = ENQUEUE_HEAD,
+
+ /* high 32bits are ext specific flags */
+
+ /*
+ * The task being enqueued is the only task available for the cpu. By
+ * default, ext core keeps executing such tasks but when
+ * %SCX_OPS_ENQ_LAST is specified, they're ops.enqueue()'d with
+ * %SCX_ENQ_LAST and %SCX_ENQ_LOCAL flags set.
+ *
+ * If the BPF scheduler wants to continue executing the task,
+ * ops.enqueue() should dispatch the task to %SCX_DSQ_LOCAL immediately.
+ * If the task gets queued on a different dsq or the BPF side, the BPF
+ * scheduler is responsible for triggering a follow-up scheduling event.
+ * Otherwise, Execution may stall.
+ */
+ SCX_ENQ_LAST = 1LLU << 41,
+
+ /*
+ * A hint indicating that it's advisable to enqueue the task on the
+ * local dsq of the currently selected CPU. Currently used by
+ * select_cpu_dfl() and together with %SCX_ENQ_LAST.
+ */
+ SCX_ENQ_LOCAL = 1LLU << 42,
+
+ /* high 8 bits are internal */
+ __SCX_ENQ_INTERNAL_MASK = 0xffLLU << 56,
+
+ SCX_ENQ_CLEAR_OPSS = 1LLU << 56,
+};
+
+enum scx_deq_flags {
+ /* expose select DEQUEUE_* flags as enums */
+ SCX_DEQ_SLEEP = DEQUEUE_SLEEP,
+};
#ifdef CONFIG_SCHED_CLASS_EXT
-#error "NOT IMPLEMENTED YET"
+
+struct sched_enq_and_set_ctx {
+ struct task_struct *p;
+ int queue_flags;
+ bool queued;
+ bool running;
+};
+
+void sched_deq_and_put_task(struct task_struct *p, int queue_flags,
+ struct sched_enq_and_set_ctx *ctx);
+void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx);
+
+extern const struct sched_class ext_sched_class;
+extern const struct bpf_verifier_ops bpf_sched_ext_verifier_ops;
+extern const struct file_operations sched_ext_fops;
+
+DECLARE_STATIC_KEY_FALSE(__scx_ops_enabled);
+#define scx_enabled() static_branch_unlikely(&__scx_ops_enabled)
+
+bool task_on_scx(struct task_struct *p);
+void scx_pre_fork(struct task_struct *p);
+int scx_fork(struct task_struct *p);
+void scx_post_fork(struct task_struct *p);
+void scx_cancel_fork(struct task_struct *p);
+int balance_scx(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
+void init_sched_ext_class(void);
+
+static inline const struct sched_class *next_active_class(const struct sched_class *class)
+{
+ class++;
+ if (!scx_enabled() && class == &ext_sched_class)
+ class++;
+ return class;
+}
+
+#define for_active_class_range(class, _from, _to) \
+ for (class = (_from); class != (_to); class = next_active_class(class))
+
+#define for_each_active_class(class) \
+ for_active_class_range(class, __sched_class_highest, __sched_class_lowest)
+
+/*
+ * SCX requires a balance() call before every pick_next_task() call including
+ * when waking up from idle.
+ */
+#define for_balance_class_range(class, prev_class, end_class) \
+ for_active_class_range(class, (prev_class) > &ext_sched_class ? \
+ &ext_sched_class : (prev_class), (end_class))
+
#else /* CONFIG_SCHED_CLASS_EXT */
#define scx_enabled() false
@@ -28,7 +127,13 @@ static inline void balance_scx_on_up(struct rq *rq, struct task_struct *prev,
#endif
#if defined(CONFIG_SCHED_CLASS_EXT) && defined(CONFIG_SMP)
-#error "NOT IMPLEMENTED YET"
+void __scx_update_idle(struct rq *rq, bool idle);
+
+static inline void scx_update_idle(struct rq *rq, bool idle)
+{
+ if (scx_enabled())
+ __scx_update_idle(rq, idle);
+}
#else
static inline void scx_update_idle(struct rq *rq, bool idle) {}
#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c00c27de2a30..18a80d5b542b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -185,6 +185,10 @@ static inline int idle_policy(int policy)
static inline int normal_policy(int policy)
{
+#ifdef CONFIG_SCHED_CLASS_EXT
+ if (policy == SCHED_EXT)
+ return true;
+#endif
return policy == SCHED_NORMAL;
}
@@ -685,6 +689,18 @@ struct cfs_rq {
#endif /* CONFIG_FAIR_GROUP_SCHED */
};
+#ifdef CONFIG_SCHED_CLASS_EXT
+struct scx_rq {
+ struct scx_dispatch_q local_dsq;
+ u64 ops_qseq;
+ u32 nr_running;
+#ifdef CONFIG_SMP
+ cpumask_var_t cpus_to_kick;
+ cpumask_var_t cpus_to_preempt;
+#endif
+};
+#endif /* CONFIG_SCHED_CLASS_EXT */
+
static inline int rt_bandwidth_enabled(void)
{
return sysctl_sched_rt_runtime >= 0;
@@ -1026,6 +1042,9 @@ struct rq {
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
+#ifdef CONFIG_SCHED_CLASS_EXT
+ struct scx_rq scx;
+#endif
#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this CPU: */
--
2.38.1
Pass an extra @prog parameter to bpf_struct_ops->check_member(). This will
be used in the future to verify @prog->aux->sleepable per-operation so that
a subset of operations in a struct_ops can be sleepable.
Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/bpf.h | 3 ++-
kernel/bpf/verifier.c | 2 +-
net/ipv4/bpf_tcp_ca.c | 3 ++-
3 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 798aec816970..82f5c30100fd 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1337,7 +1337,8 @@ struct bpf_struct_ops {
const struct bpf_verifier_ops *verifier_ops;
int (*init)(struct btf *btf);
int (*check_member)(const struct btf_type *t,
- const struct btf_member *member);
+ const struct btf_member *member,
+ struct bpf_prog *prog);
int (*init_member)(const struct btf_type *t,
const struct btf_member *member,
void *kdata, const void *udata);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 07c0259dfc1a..b5dba33f8e7d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -14928,7 +14928,7 @@ static int check_struct_ops_btf_id(struct bpf_verifier_env *env)
}
if (st_ops->check_member) {
- int err = st_ops->check_member(t, member);
+ int err = st_ops->check_member(t, member, prog);
if (err) {
verbose(env, "attach to unsupported member %s of struct %s\n",
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index 6da16ae6a962..24c819509b98 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -247,7 +247,8 @@ static int bpf_tcp_ca_init_member(const struct btf_type *t,
}
static int bpf_tcp_ca_check_member(const struct btf_type *t,
- const struct btf_member *member)
+ const struct btf_member *member,
+ struct bpf_prog *prog)
{
if (is_unsupported(__btf_member_bit_offset(t, member) / 8))
return -ENOTSUPP;
--
2.38.1
Being able to track the task runnable and running state transitions are
useful for a variety of purposes including latency tracking and load factor
calculation.
Currently, BPF schedulers don't have a good way of tracking these
transitions. Becoming runnable can be determined from ops.enqueue() but
becoming quiescent can only be inferred from the lack of subsequent enqueue.
Also, as the local dsq can have multiple tasks and some events are handled
in the sched_ext core, it's difficult to determine when a given task starts
and stops executing.
This patch adds sched_ext_ops.runnable(), .running(), .stopping() and
.quiescent() operations to track the task runnable and running state
transitions. They're mostly self explanatory; however, we want to ensure
that running <-> stopping transitions are always contained within runnable
<-> quiescent transitions which is a bit different from how the scheduler
core behaves. This adds a bit of complication. See the comment in
dequeue_task_scx().
Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/sched/ext.h | 65 +++++++++++++++++++++++++++++++++++++++
kernel/sched/ext.c | 31 +++++++++++++++++++
2 files changed, 96 insertions(+)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 6e25c3431bb4..4f8898556b28 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -209,6 +209,71 @@ struct sched_ext_ops {
*/
void (*consume_final)(s32 cpu);
+ /**
+ * runnable - A task is becoming runnable on its associated CPU
+ * @p: task becoming runnable
+ * @enq_flags: %SCX_ENQ_*
+ *
+ * This and the following three functions can be used to track a task's
+ * execution state transitions. A task becomes ->runnable() on a CPU,
+ * and then goes through one or more ->running() and ->stopping() pairs
+ * as it runs on the CPU, and eventually becomes ->quiescent() when it's
+ * done running on the CPU.
+ *
+ * @p is becoming runnable on the CPU because it's
+ *
+ * - waking up (%SCX_ENQ_WAKEUP)
+ * - being moved from another CPU
+ * - being restored after temporarily taken off the queue for an
+ * attribute change.
+ *
+ * This and ->enqueue() are related but not coupled. This operation
+ * notifies @p's state transition and may not be followed by ->enqueue()
+ * e.g. when @p is being dispatched to a remote CPU. Likewise, a task
+ * may be ->enqueue()'d without being preceded by this operation e.g.
+ * after exhausting its slice.
+ */
+ void (*runnable)(struct task_struct *p, u64 enq_flags);
+
+ /**
+ * running - A task is starting to run on its associated CPU
+ * @p: task starting to run
+ *
+ * See ->runnable() for explanation on the task state notifiers.
+ */
+ void (*running)(struct task_struct *p);
+
+ /**
+ * stopping - A task is stopping execution
+ * @p: task stopping to run
+ * @runnable: is task @p still runnable?
+ *
+ * See ->runnable() for explanation on the task state notifiers. If
+ * !@runnable, ->quiescent() will be invoked after this operation
+ * returns.
+ */
+ void (*stopping)(struct task_struct *p, bool runnable);
+
+ /**
+ * quiescent - A task is becoming not runnable on its associated CPU
+ * @p: task becoming not runnable
+ * @deq_flags: %SCX_DEQ_*
+ *
+ * See ->runnable() for explanation on the task state notifiers.
+ *
+ * @p is becoming quiescent on the CPU because it's
+ *
+ * - sleeping (%SCX_DEQ_SLEEP)
+ * - being moved to another CPU
+ * - being temporarily taken off the queue for an attribute change
+ * (%SCX_DEQ_SAVE)
+ *
+ * This and ->dequeue() are related but not coupled. This operation
+ * notifies @p's state transition and may not be preceded by ->dequeue()
+ * e.g. when @p is being dispatched to a remote CPU.
+ */
+ void (*quiescent)(struct task_struct *p, u64 deq_flags);
+
/**
* yield - Yield CPU
* @from: yielding task
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 4a98047a06bc..2eb382ed0e2f 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -670,6 +670,9 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
rq->scx.nr_running++;
add_nr_running(rq, 1);
+ if (SCX_HAS_OP(runnable))
+ scx_ops.runnable(p, enq_flags);
+
do_enqueue_task(rq, p, enq_flags, sticky_cpu);
}
@@ -716,6 +719,26 @@ static void dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags
break;
}
+ /*
+ * A currently running task which is going off @rq first gets dequeued
+ * and then stops running. As we want running <-> stopping transitions
+ * to be contained within runnable <-> quiescent transitions, trigger
+ * ->stopping() early here instead of in put_prev_task_scx().
+ *
+ * @p may go through multiple stopping <-> running transitions between
+ * here and put_prev_task_scx() if task attribute changes occur while
+ * balance_scx() leaves @rq unlocked. However, they don't contain any
+ * information meaningful to the BPF scheduler and can be suppressed by
+ * skipping the callbacks if the task is !QUEUED.
+ */
+ if (SCX_HAS_OP(stopping) && task_current(rq, p)) {
+ update_curr_scx(rq);
+ scx_ops.stopping(p, false);
+ }
+
+ if (SCX_HAS_OP(quiescent))
+ scx_ops.quiescent(p, deq_flags);
+
p->scx.flags &= ~SCX_TASK_QUEUED;
scx_rq->nr_running--;
sub_nr_running(rq, 1);
@@ -1223,6 +1246,10 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
p->se.exec_start = rq_clock_task(rq);
+ /* see dequeue_task_scx() on why we skip when !QUEUED */
+ if (SCX_HAS_OP(running) && (p->scx.flags & SCX_TASK_QUEUED))
+ scx_ops.running(p);
+
watchdog_unwatch_task(p, true);
}
@@ -1230,6 +1257,10 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
{
update_curr_scx(rq);
+ /* see dequeue_task_scx() on why we skip when !QUEUED */
+ if (SCX_HAS_OP(stopping) && (p->scx.flags & SCX_TASK_QUEUED))
+ scx_ops.stopping(p, true);
+
/*
* If we're being called from put_prev_task_balance(), balance_scx() may
* have decided that @p should keep running.
--
2.38.1
Allow BPF schedulers to indicate tickless operation by setting p->scx.slice
to SCX_SLICE_INF. A CPU whose current task has infinte slice goes into
tickless operation.
scx_example_central is updated to use tickless operations for all tasks and
instead use a BPF timer to expire slices. This also uses the SCX_ENQ_PREEMPT
and task state tracking added by the previous patches.
Currently, there is no way to pin the timer on the central CPU, so it may
end up on one of the worker CPUs; however, outside of that, the worker CPUs
can go tickless both while running sched_ext tasks and idling.
With schbench running, scx_example_central shows:
root@test ~# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts
LOC: 142024 656 664 449 Local timer interrupts
LOC: 161663 663 665 449 Local timer interrupts
Without it:
root@test ~ [SIGINT]# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts
LOC: 188778 3142 3793 3993 Local timer interrupts
LOC: 198993 5314 6323 6438 Local timer interrupts
While scx_example_central itself is too barebone to be useful as a
production scheduler, a more featureful central scheduler can be built using
the same approach. Google's experience shows that such an approach can have
significant benefits for certain applications such as VM hosting.
Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/sched/ext.h | 1 +
kernel/sched/core.c | 9 +-
kernel/sched/ext.c | 43 +++++-
kernel/sched/ext.h | 2 +
kernel/sched/sched.h | 6 +
tools/sched_ext/scx_example_central.bpf.c | 160 +++++++++++++++++++++-
tools/sched_ext/scx_example_central.c | 3 +-
7 files changed, 212 insertions(+), 12 deletions(-)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 4f8898556b28..b1c95fb11c8d 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -19,6 +19,7 @@ enum scx_consts {
SCX_EXIT_MSG_LEN = 1024,
SCX_SLICE_DFL = 20 * NSEC_PER_MSEC,
+ SCX_SLICE_INF = U64_MAX, /* infinite, implies nohz */
};
/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 20536957840d..89d2421809da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1200,13 +1200,16 @@ bool sched_can_stop_tick(struct rq *rq)
return true;
/*
- * If there are no DL,RR/FIFO tasks, there must only be CFS tasks left;
- * if there's more than one we need the tick for involuntary
- * preemption.
+ * If there are no DL,RR/FIFO tasks, there must only be CFS or SCX tasks
+ * left. For CFS, if there's more than one we need the tick for
+ * involuntary preemption. For SCX, ask.
*/
if (!scx_switched_all() && rq->nr_running > 1)
return false;
+ if (scx_enabled() && !scx_can_stop_tick(rq))
+ return false;
+
return true;
}
#endif /* CONFIG_NO_HZ_FULL */
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 2eb382ed0e2f..cf6493f684f3 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -383,7 +383,8 @@ static void update_curr_scx(struct rq *rq)
account_group_exec_runtime(curr, delta_exec);
cgroup_account_cputime(curr, delta_exec);
- curr->scx.slice -= min(curr->scx.slice, delta_exec);
+ if (curr->scx.slice != SCX_SLICE_INF)
+ curr->scx.slice -= min(curr->scx.slice, delta_exec);
}
static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,
@@ -1251,6 +1252,20 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
scx_ops.running(p);
watchdog_unwatch_task(p, true);
+
+ /*
+ * @p is getting newly scheduled or got kicked after someone updated its
+ * slice. Refresh whether tick can be stopped. See can_stop_tick_scx().
+ */
+ if ((p->scx.slice == SCX_SLICE_INF) !=
+ (bool)(rq->scx.flags & SCX_RQ_CAN_STOP_TICK)) {
+ if (p->scx.slice == SCX_SLICE_INF)
+ rq->scx.flags |= SCX_RQ_CAN_STOP_TICK;
+ else
+ rq->scx.flags &= ~SCX_RQ_CAN_STOP_TICK;
+
+ sched_update_tick_dependency(rq);
+ }
}
static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
@@ -1742,6 +1757,26 @@ int scx_check_setscheduler(struct task_struct *p, int policy)
return 0;
}
+#ifdef CONFIG_NO_HZ_FULL
+bool scx_can_stop_tick(struct rq *rq)
+{
+ struct task_struct *p = rq->curr;
+
+ if (scx_ops_disabling())
+ return false;
+
+ if (p->sched_class != &ext_sched_class)
+ return true;
+
+ /*
+ * @rq can consume from different dsq's, so we can't tell whether it
+ * needs the tick or not by looking at nr_running. Allow stopping ticks
+ * iff the BPF scheduler indicated so. See set_next_task_scx().
+ */
+ return rq->scx.flags & SCX_RQ_CAN_STOP_TICK;
+}
+#endif
+
/*
* Omitted operations:
*
@@ -1923,7 +1958,7 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
struct scx_task_iter sti;
struct task_struct *p;
const char *reason;
- int i, type;
+ int i, cpu, type;
type = atomic_read(&scx_exit_type);
while (true) {
@@ -2021,6 +2056,10 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
scx_task_iter_exit(&sti);
spin_unlock_irq(&scx_tasks_lock);
+ /* kick all CPUs to restore ticks */
+ for_each_possible_cpu(cpu)
+ resched_cpu(cpu);
+
forward_progress_guaranteed:
/*
* Here, every runnable task is guaranteed to make forward progress and
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 3597b7b5829e..e9ec267f13d5 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -94,6 +94,7 @@ void scx_post_fork(struct task_struct *p);
void scx_cancel_fork(struct task_struct *p);
int balance_scx(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
int scx_check_setscheduler(struct task_struct *p, int policy);
+bool scx_can_stop_tick(struct rq *rq);
void init_sched_ext_class(void);
__printf(2, 3) void scx_ops_error_type(enum scx_exit_type type,
@@ -156,6 +157,7 @@ static inline int balance_scx(struct rq *rq, struct task_struct *prev,
struct rq_flags *rf) { return 0; }
static inline int scx_check_setscheduler(struct task_struct *p,
int policy) { return 0; }
+static inline bool scx_can_stop_tick(struct rq *rq) { return true; }
static inline void init_sched_ext_class(void) {}
static inline void scx_notify_sched_tick(void) {}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0d8b52c52e2b..a95aae3bc69a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -690,11 +690,17 @@ struct cfs_rq {
};
#ifdef CONFIG_SCHED_CLASS_EXT
+/* scx_rq->flags, protected by the rq lock */
+enum scx_rq_flags {
+ SCX_RQ_CAN_STOP_TICK = 1 << 0,
+};
+
struct scx_rq {
struct scx_dispatch_q local_dsq;
struct list_head watchdog_list;
u64 ops_qseq;
u32 nr_running;
+ u32 flags;
#ifdef CONFIG_SMP
cpumask_var_t cpus_to_kick;
cpumask_var_t cpus_to_preempt;
diff --git a/tools/sched_ext/scx_example_central.bpf.c b/tools/sched_ext/scx_example_central.bpf.c
index f53ed4baf92d..ce994e9ecc92 100644
--- a/tools/sched_ext/scx_example_central.bpf.c
+++ b/tools/sched_ext/scx_example_central.bpf.c
@@ -14,7 +14,26 @@
* utilize and verify various scx mechanisms such as LOCAL_ON dispatching and
* consume_final().
*
- * b. Preemption
+ * b. Tickless operation
+ *
+ * All tasks are dispatched with the infinite slice which allows stopping the
+ * ticks on CONFIG_NO_HZ_FULL kernels running with the proper nohz_full
+ * parameter. The tickless operation can be observed through
+ * /proc/interrupts.
+ *
+ * Periodic switching is enforced by a periodic timer checking all CPUs and
+ * preempting them as necessary. Unfortunately, BPF timer currently doesn't
+ * have a way to pin to a specific CPU, so the periodic timer isn't pinned to
+ * the central CPU.
+ *
+ * c. Preemption
+ *
+ * Kthreads are unconditionally queued to the head of a matching local dsq
+ * and dispatched with SCX_DSQ_PREEMPT. This ensures that a kthread is always
+ * prioritized over user threads, which is required for ensuring forward
+ * progress as e.g. the periodic timer may run on a ksoftirqd and if the
+ * ksoftirqd gets starved by a user thread, there may not be anything else to
+ * vacate that user thread.
*
* SCX_KICK_PREEMPT is used to trigger scheduling and CPUs to move to the
* next tasks.
@@ -38,8 +57,18 @@ const volatile bool switch_all;
const volatile s32 central_cpu;
const volatile u32 nr_cpu_ids;
+/*
+ * XXX - kernel should be able to shut down the associated timers. For now,
+ * implement it manually. They should be bool but the verifier gets confused
+ * about the value range of bool variables when verifying the return value of
+ * the loopfns. Also, they can't be static because verification fails with BTF
+ * error message for some reason.
+ */
+int timer_running;
+int timer_kill;
+
u64 nr_total, nr_locals, nr_queued, nr_lost_pids;
-u64 nr_dispatches, nr_mismatches, nr_overflows;
+u64 nr_timers, nr_dispatches, nr_mismatches, nr_overflows;
struct user_exit_info uei;
@@ -51,6 +80,7 @@ struct {
/* can't use percpu map due to bad lookups */
static bool cpu_gimme_task[MAX_CPUS];
+static u64 cpu_started_at[MAX_CPUS];
struct central_timer {
struct bpf_timer timer;
@@ -86,9 +116,22 @@ void BPF_STRUCT_OPS(central_enqueue, struct task_struct *p, u64 enq_flags)
__sync_fetch_and_add(&nr_total, 1);
+ /*
+ * Push per-cpu kthreads at the head of local dsq's and preempt the
+ * corresponding CPU. This ensures that e.g. ksoftirqd isn't blocked
+ * behind other threads which is necessary for forward progress
+ * guarantee as we depend on the BPF timer which may run from ksoftirqd.
+ */
+ if ((p->flags & PF_KTHREAD) && p->nr_cpus_allowed == 1) {
+ __sync_fetch_and_add(&nr_locals, 1);
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_INF,
+ enq_flags | SCX_ENQ_PREEMPT);
+ return;
+ }
+
if (bpf_map_push_elem(¢ral_q, &pid, 0)) {
__sync_fetch_and_add(&nr_overflows, 1);
- scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_DFL, enq_flags);
+ scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_INF, enq_flags);
return;
}
@@ -122,12 +165,12 @@ static int dispatch_a_task_loopfn(u32 idx, void *data)
*/
if (!scx_bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) {
__sync_fetch_and_add(&nr_mismatches, 1);
- scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_DFL, 0);
+ scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_INF, 0);
return 0;
}
/* dispatch to the local and mark that @cpu doesn't need more tasks */
- scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, SCX_SLICE_DFL, 0);
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, SCX_SLICE_INF, 0);
if (cpu != central_cpu)
scx_bpf_kick_cpu(cpu, 0);
@@ -196,16 +239,119 @@ void BPF_STRUCT_OPS(central_consume_final, s32 cpu)
scx_bpf_consume(FALLBACK_DSQ_ID);
}
+void BPF_STRUCT_OPS(central_running, struct task_struct *p)
+{
+ s32 cpu = scx_bpf_task_cpu(p);
+ u64 *started_at = MEMBER_VPTR(cpu_started_at, [cpu]);
+ if (started_at)
+ *started_at = bpf_ktime_get_ns() ?: 1; /* 0 indicates idle */
+}
+
+void BPF_STRUCT_OPS(central_stopping, struct task_struct *p, bool runnable)
+{
+ s32 cpu = scx_bpf_task_cpu(p);
+ u64 *started_at = MEMBER_VPTR(cpu_started_at, [cpu]);
+ if (started_at)
+ *started_at = 0;
+}
+
+static int kick_cpus_loopfn(u32 idx, void *data)
+{
+ s32 cpu = (nr_timers + idx) % nr_cpu_ids;
+ u64 *nr_to_kick = data;
+ u64 now = bpf_ktime_get_ns();
+ u64 *started_at;
+ s32 pid;
+
+ if (cpu == central_cpu)
+ goto kick;
+
+ /* kick iff there's something pending */
+ if (scx_bpf_dsq_nr_queued(FALLBACK_DSQ_ID) ||
+ scx_bpf_dsq_nr_queued(SCX_DSQ_LOCAL_ON | cpu))
+ ;
+ else if (*nr_to_kick)
+ (*nr_to_kick)--;
+ else
+ return 0;
+
+ /* and the current one exhausted its slice */
+ started_at = MEMBER_VPTR(cpu_started_at, [cpu]);
+ if (started_at && *started_at &&
+ vtime_before(now, *started_at + SCX_SLICE_DFL))
+ return 0;
+kick:
+ scx_bpf_kick_cpu(cpu, SCX_KICK_PREEMPT);
+ return 0;
+}
+
+static int central_timerfn(void *map, int *key, struct bpf_timer *timer)
+{
+ u64 nr_to_kick = nr_queued;
+
+ if (timer_kill) {
+ timer_running = 0;
+ return 0;
+ }
+
+ bpf_loop(nr_cpu_ids, kick_cpus_loopfn, &nr_to_kick, 0);
+ bpf_timer_start(timer, TIMER_INTERVAL_NS, 0);
+ __sync_fetch_and_add(&nr_timers, 1);
+ return 0;
+}
+
int BPF_STRUCT_OPS(central_init)
{
+ u32 key = 0;
+ struct bpf_timer *timer;
+ int ret;
+
if (switch_all)
scx_bpf_switch_all();
- return scx_bpf_create_dsq(FALLBACK_DSQ_ID, -1);
+ ret = scx_bpf_create_dsq(FALLBACK_DSQ_ID, -1);
+ if (ret)
+ return ret;
+
+ timer = bpf_map_lookup_elem(¢ral_timer, &key);
+ if (!timer)
+ return -ESRCH;
+
+ bpf_timer_init(timer, ¢ral_timer, CLOCK_MONOTONIC);
+ bpf_timer_set_callback(timer, central_timerfn);
+ ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, 0);
+ timer_running = !ret;
+ return ret;
+}
+
+static int exit_wait_timer_nested_loopfn(u32 idx, void *data)
+{
+ u64 expiration = *(u64 *)data;
+
+ return !timer_running || vtime_before(expiration, bpf_ktime_get_ns());
+}
+
+static int exit_wait_timer_loopfn(u32 idx, void *data)
+{
+ u64 expiration = *(u64 *)data;
+
+ bpf_loop(1 << 23, exit_wait_timer_nested_loopfn, data, 0);
+ return !timer_running || vtime_before(expiration, bpf_ktime_get_ns());
}
void BPF_STRUCT_OPS(central_exit, struct scx_exit_info *ei)
{
+ u64 expiration = bpf_ktime_get_ns() + 1000 * MS_TO_NS;
+
+ /*
+ * XXX - We just need to make sure that the timer body isn't running on
+ * exit. If we catch the timer while waiting, great. If not, it's still
+ * highly likely that the timer body won't run in the future. Once bpf
+ * can shut down associated timers, this hackery should go away.
+ */
+ timer_kill = 1;
+ bpf_loop(1 << 23, exit_wait_timer_loopfn, &expiration, 0);
+
uei_record(&uei, ei);
}
@@ -223,6 +369,8 @@ struct sched_ext_ops central_ops = {
.dispatch = (void *)central_dispatch,
.consume = (void *)central_consume,
.consume_final = (void *)central_consume_final,
+ .running = (void *)central_running,
+ .stopping = (void *)central_stopping,
.init = (void *)central_init,
.exit = (void *)central_exit,
.name = "central",
diff --git a/tools/sched_ext/scx_example_central.c b/tools/sched_ext/scx_example_central.c
index c85e84459c58..83cbd1932958 100644
--- a/tools/sched_ext/scx_example_central.c
+++ b/tools/sched_ext/scx_example_central.c
@@ -76,7 +76,8 @@ int main(int argc, char **argv)
skel->bss->nr_locals,
skel->bss->nr_queued,
skel->bss->nr_lost_pids);
- printf(" dispatch:%10lu mismatch:%10lu overflow:%10lu\n",
+ printf("timer:%10lu dispatch:%10lu mismatch:%10lu overflow:%10lu\n",
+ skel->bss->nr_timers,
skel->bss->nr_dispatches,
skel->bss->nr_mismatches,
skel->bss->nr_overflows);
--
2.38.1
On Wed, Nov 30, 2022 at 12:23 AM Tejun Heo <[email protected]> wrote:
>
>
> static inline void rht_lock(struct bucket_table *tbl,
> - struct rhash_lock_head __rcu **bkt)
> + struct rhash_lock_head __rcu **bkt,
> + unsigned long *flags)
I guess it doesn't matter as long as this actually gets inlined, but
wouldn't it be better to have
flags = rht_lock(..);
...
rht_unlock(.., flags);
as the calling convention? Rather than passing a pointer to the stack around.
That's what the native _raw_spin_lock_irqsave() interface is (even if
"spin_lock_irqsave()" itself for historical reasons uses that inline
asm-like "pass argument by reference *without* using a pointer")
And gaah, we should have made 'flags' be a real type long ago, but I
guess 'unsigned long' is too ingrained and traditional to change that
now.
Linus
Hello,
On Wed, Nov 30, 2022 at 08:35:13AM -0800, Linus Torvalds wrote:
> On Wed, Nov 30, 2022 at 12:23 AM Tejun Heo <[email protected]> wrote:
> >
> >
> > static inline void rht_lock(struct bucket_table *tbl,
> > - struct rhash_lock_head __rcu **bkt)
> > + struct rhash_lock_head __rcu **bkt,
> > + unsigned long *flags)
>
> I guess it doesn't matter as long as this actually gets inlined, but
> wouldn't it be better to have
>
> flags = rht_lock(..);
> ...
> rht_unlock(.., flags);
>
> as the calling convention? Rather than passing a pointer to the stack around.
Sure thing.
> That's what the native _raw_spin_lock_irqsave() interface is (even if
> "spin_lock_irqsave()" itself for historical reasons uses that inline
> asm-like "pass argument by reference *without* using a pointer")
Yeah, it always feels kinda weird to wrap irqsave/restore due to the special
reference passing.
> And gaah, we should have made 'flags' be a real type long ago, but I
> guess 'unsigned long' is too ingrained and traditional to change that
> now.
Hahaha, that's gonna be an epic patchset.
Thanks.
--
tejun
hi -
On 11/30/22 03:22, Tejun Heo wrote:
[...]
> +static bool consume_dispatch_q(struct rq *rq, struct rq_flags *rf,
> + struct scx_dispatch_q *dsq)
> +{
> + struct scx_rq *scx_rq = &rq->scx;
> + struct task_struct *p;
> + struct rq *task_rq;
> + bool moved = false;
> +retry:
> + if (list_empty(&dsq->fifo))
> + return false;
> +
> + raw_spin_lock(&dsq->lock);
> + list_for_each_entry(p, &dsq->fifo, scx.dsq_node) {
> + task_rq = task_rq(p);
> + if (rq == task_rq)
> + goto this_rq;
> + if (likely(rq->online) && !is_migration_disabled(p) &&
> + cpumask_test_cpu(cpu_of(rq), p->cpus_ptr))
> + goto remote_rq;
> + }
> + raw_spin_unlock(&dsq->lock);
> + return false;
> +
> +this_rq:
> + /* @dsq is locked and @p is on this rq */
> + WARN_ON_ONCE(p->scx.holding_cpu >= 0);
> + list_move_tail(&p->scx.dsq_node, &scx_rq->local_dsq.fifo);
> + dsq->nr--;
> + scx_rq->local_dsq.nr++;
> + p->scx.dsq = &scx_rq->local_dsq;
> + raw_spin_unlock(&dsq->lock);
> + return true;
> +
> +remote_rq:
> +#ifdef CONFIG_SMP
> + /*
> + * @dsq is locked and @p is on a remote rq. @p is currently protected by
> + * @dsq->lock. We want to pull @p to @rq but may deadlock if we grab
> + * @task_rq while holding @dsq and @rq locks. As dequeue can't drop the
> + * rq lock or fail, do a little dancing from our side. See
> + * move_task_to_local_dsq().
> + */
> + WARN_ON_ONCE(p->scx.holding_cpu >= 0);
> + list_del_init(&p->scx.dsq_node);
> + dsq->nr--;
> + p->scx.holding_cpu = raw_smp_processor_id();
> + raw_spin_unlock(&dsq->lock);
> +
> + rq_unpin_lock(rq, rf);
> + double_lock_balance(rq, task_rq);
> + rq_repin_lock(rq, rf);
> +
> + moved = move_task_to_local_dsq(rq, p);
you might be able to avoid the double_lock_balance() by using
move_queued_task(), which internally hands off the old rq lock and
returns with the new rq lock.
the pattern for consume_dispatch_q() would be something like:
- kfunc from bpf, with this_rq lock held
- notice p isn't on this_rq, goto remote_rq:
- do sched_ext accounting, like the this_rq->dsq->nr--
- unlock this_rq
- p_rq = task_rq_lock(p)
- double_check p->rq didn't change to this_rq during that unlock
- new_rq = move_queued_task(p_rq, rf, p, new_cpu)
- do sched_ext accounting like new_rq->dsq->nr++
- unlock new_rq
- relock the original this_rq
- return to bpf
you still end up grabbing both locks, but just not at the same time.
plus, task_rq_lock() takes the guesswork out of whether you're getting
p's rq lock or not. it looks like you're using the holding_cpu to
handle the race where p moves cpus after you read task_rq(p) but before
you lock that task_rq. maybe you can drop the whole concept of the
holding_cpu?
thanks,
barret
> +
> + double_unlock_balance(rq, task_rq);
> +#endif /* CONFIG_SMP */
> + if (likely(moved))
> + return true;
> + goto retry;
> +}
Hello,
On Fri, Dec 02, 2022 at 12:08:27PM -0500, Barret Rhoden wrote:
> you might be able to avoid the double_lock_balance() by using
> move_queued_task(), which internally hands off the old rq lock and returns
> with the new rq lock.
>
> the pattern for consume_dispatch_q() would be something like:
>
> - kfunc from bpf, with this_rq lock held
> - notice p isn't on this_rq, goto remote_rq:
> - do sched_ext accounting, like the this_rq->dsq->nr--
> - unlock this_rq
> - p_rq = task_rq_lock(p)
> - double_check p->rq didn't change to this_rq during that unlock
> - new_rq = move_queued_task(p_rq, rf, p, new_cpu)
> - do sched_ext accounting like new_rq->dsq->nr++
> - unlock new_rq
> - relock the original this_rq
> - return to bpf
>
> you still end up grabbing both locks, but just not at the same time.
Yeah, this probably would look better than the current double lock dancing,
especially in the finish_dispatch() path.
> plus, task_rq_lock() takes the guesswork out of whether you're getting p's
> rq lock or not. it looks like you're using the holding_cpu to handle the
> race where p moves cpus after you read task_rq(p) but before you lock that
> task_rq. maybe you can drop the whole concept of the holding_cpu?
->holding_cpu is there to basically detect intervening dequeues, so if we
lock them out with TASK_ON_RQ_MIGRATING, we might be able to drop it. I need
to look into it more tho. Things get pretty subtle around there, so I could
easily be missing something. I'll try this and let you know how it goes.
Thanks.
--
tejun
Hello, Barret.
On Fri, Dec 02, 2022 at 08:01:37AM -1000, Tejun Heo wrote:
> > you still end up grabbing both locks, but just not at the same time.
>
> Yeah, this probably would look better than the current double lock dancing,
> especially in the finish_dispatch() path.
>
> > plus, task_rq_lock() takes the guesswork out of whether you're getting p's
> > rq lock or not. it looks like you're using the holding_cpu to handle the
> > race where p moves cpus after you read task_rq(p) but before you lock that
> > task_rq. maybe you can drop the whole concept of the holding_cpu?
>
> ->holding_cpu is there to basically detect intervening dequeues, so if we
> lock them out with TASK_ON_RQ_MIGRATING, we might be able to drop it. I need
> to look into it more tho. Things get pretty subtle around there, so I could
> easily be missing something. I'll try this and let you know how it goes.
I tried both and I'm pretty ambivalent. The problem is that the
finish_dispatch() path can't use TASK_ON_RQ_MIGRATING the way the consume
path does because the dispatch path isn't starting with the task locked. The
only claim it has to the task is through p->scx.ops_state.
It can be argued that getting rid of double locking is still nice but given
that holding_cpu is needed anyway and can play the same role, I'm not sure
how attractive it is. I suppose we can go either way.
Thanks.
--
tejun
On Tue, Nov 29, 2022 at 10:22:56PM -1000, Tejun Heo wrote:
> + /**
> + * dispatch - Dispatch tasks from the BPF scheduler into dsq's
> + * @cpu: CPU to dispatch tasks for
> + * @prev: previous task being switched out
> + *
> + * Called when a CPU can't find a task to execute after ops.consume().
> + * The operation should dispatch one or more tasks from the BPF
> + * scheduler to the dsq's using scx_bpf_dispatch(). The maximum number
> + * of tasks which can be dispatched in a single call is specified by the
> + * @dispatch_max_batch field of this struct.
> + */
> + void (*dispatch)(s32 cpu, struct task_struct *prev);
> +
> + /**
> + * consume - Consume tasks from the dsq's to the local dsq for execution
> + * @cpu: CPU to consume tasks for
> + *
> + * Called when a CPU's local dsq is empty. The operation should transfer
> + * one or more tasks from the dsq's to the CPU's local dsq using
> + * scx_bpf_consume(). If this function fails to fill the local dsq,
> + * ops.dispatch() will be called.
> + *
> + * This operation is unnecessary if the BPF scheduler always dispatches
> + * either to one of the local dsq's or the global dsq. If implemented,
> + * this operation is also responsible for consuming the global_dsq.
> + */
> + void (*consume)(s32 cpu);
> +
> + /**
> + * consume_final - Final consume call before going idle
> + * @cpu: CPU to consume tasks for
> + *
> + * After ops.consume() and .dispatch(), @cpu still doesn't have a task
> + * to execute and is about to go idle. This operation can be used to
> + * implement more aggressive consumption strategies. Otherwise
> + * equivalent to ops.consume().
> + */
> + void (*consume_final)(s32 cpu);
Doesn't really change the big picture but I ended up merging
ops.consume[_final]() into ops.dispatch() which should make the dispatch
path both simpler and more flexible.
Thanks.
--
tejun
rhashtable currently only does bh-safe synchronization making it impossible
to use from irq-safe contexts. Switch it to use irq-safe synchronization to
remove the restriction.
v2: Update the lock functions to return the ulong flags value and unlock
functions to take the value directly instead of passing around the
pointer. Suggested by Linus.
Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
Cc: Linus Torvalds <[email protected]>
---
include/linux/rhashtable.h | 61 ++++++++++++++++++++++++++-------------------
lib/rhashtable.c | 16 +++++++----
2 files changed, 46 insertions(+), 31 deletions(-)
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -323,29 +323,36 @@ static inline struct rhash_lock_head __r
* When we write to a bucket without unlocking, we use rht_assign_locked().
*/
-static inline void rht_lock(struct bucket_table *tbl,
- struct rhash_lock_head __rcu **bkt)
+static inline unsigned long rht_lock(struct bucket_table *tbl,
+ struct rhash_lock_head __rcu **bkt)
{
- local_bh_disable();
+ unsigned long flags;
+
+ local_irq_save(flags);
bit_spin_lock(0, (unsigned long *)bkt);
lock_map_acquire(&tbl->dep_map);
+ return flags;
}
-static inline void rht_lock_nested(struct bucket_table *tbl,
- struct rhash_lock_head __rcu **bucket,
- unsigned int subclass)
+static inline unsigned long rht_lock_nested(struct bucket_table *tbl,
+ struct rhash_lock_head __rcu **bucket,
+ unsigned int subclass)
{
- local_bh_disable();
+ unsigned long flags;
+
+ local_irq_save(flags);
bit_spin_lock(0, (unsigned long *)bucket);
lock_acquire_exclusive(&tbl->dep_map, subclass, 0, NULL, _THIS_IP_);
+ return flags;
}
static inline void rht_unlock(struct bucket_table *tbl,
- struct rhash_lock_head __rcu **bkt)
+ struct rhash_lock_head __rcu **bkt,
+ unsigned long flags)
{
lock_map_release(&tbl->dep_map);
bit_spin_unlock(0, (unsigned long *)bkt);
- local_bh_enable();
+ local_irq_restore(flags);
}
static inline struct rhash_head *__rht_ptr(
@@ -393,7 +400,8 @@ static inline void rht_assign_locked(str
static inline void rht_assign_unlock(struct bucket_table *tbl,
struct rhash_lock_head __rcu **bkt,
- struct rhash_head *obj)
+ struct rhash_head *obj,
+ unsigned long flags)
{
if (rht_is_a_nulls(obj))
obj = NULL;
@@ -401,7 +409,7 @@ static inline void rht_assign_unlock(str
rcu_assign_pointer(*bkt, (void *)obj);
preempt_enable();
__release(bitlock);
- local_bh_enable();
+ local_irq_restore(flags);
}
/**
@@ -706,6 +714,7 @@ static inline void *__rhashtable_insert_
struct rhash_head __rcu **pprev;
struct bucket_table *tbl;
struct rhash_head *head;
+ unsigned long flags;
unsigned int hash;
int elasticity;
void *data;
@@ -720,11 +729,11 @@ static inline void *__rhashtable_insert_
if (!bkt)
goto out;
pprev = NULL;
- rht_lock(tbl, bkt);
+ flags = rht_lock(tbl, bkt);
if (unlikely(rcu_access_pointer(tbl->future_tbl))) {
slow_path:
- rht_unlock(tbl, bkt);
+ rht_unlock(tbl, bkt, flags);
rcu_read_unlock();
return rhashtable_insert_slow(ht, key, obj);
}
@@ -756,9 +765,9 @@ slow_path:
RCU_INIT_POINTER(list->rhead.next, head);
if (pprev) {
rcu_assign_pointer(*pprev, obj);
- rht_unlock(tbl, bkt);
+ rht_unlock(tbl, bkt, flags);
} else
- rht_assign_unlock(tbl, bkt, obj);
+ rht_assign_unlock(tbl, bkt, obj, flags);
data = NULL;
goto out;
}
@@ -785,7 +794,7 @@ slow_path:
}
atomic_inc(&ht->nelems);
- rht_assign_unlock(tbl, bkt, obj);
+ rht_assign_unlock(tbl, bkt, obj, flags);
if (rht_grow_above_75(ht, tbl))
schedule_work(&ht->run_work);
@@ -797,7 +806,7 @@ out:
return data;
out_unlock:
- rht_unlock(tbl, bkt);
+ rht_unlock(tbl, bkt, flags);
goto out;
}
@@ -991,6 +1000,7 @@ static inline int __rhashtable_remove_fa
struct rhash_lock_head __rcu **bkt;
struct rhash_head __rcu **pprev;
struct rhash_head *he;
+ unsigned long flags;
unsigned int hash;
int err = -ENOENT;
@@ -999,7 +1009,7 @@ static inline int __rhashtable_remove_fa
if (!bkt)
return -ENOENT;
pprev = NULL;
- rht_lock(tbl, bkt);
+ flags = rht_lock(tbl, bkt);
rht_for_each_from(he, rht_ptr(bkt, tbl, hash), tbl, hash) {
struct rhlist_head *list;
@@ -1043,14 +1053,14 @@ static inline int __rhashtable_remove_fa
if (pprev) {
rcu_assign_pointer(*pprev, obj);
- rht_unlock(tbl, bkt);
+ rht_unlock(tbl, bkt, flags);
} else {
- rht_assign_unlock(tbl, bkt, obj);
+ rht_assign_unlock(tbl, bkt, obj, flags);
}
goto unlocked;
}
- rht_unlock(tbl, bkt);
+ rht_unlock(tbl, bkt, flags);
unlocked:
if (err > 0) {
atomic_dec(&ht->nelems);
@@ -1143,6 +1153,7 @@ static inline int __rhashtable_replace_f
struct rhash_lock_head __rcu **bkt;
struct rhash_head __rcu **pprev;
struct rhash_head *he;
+ unsigned long flags;
unsigned int hash;
int err = -ENOENT;
@@ -1158,7 +1169,7 @@ static inline int __rhashtable_replace_f
return -ENOENT;
pprev = NULL;
- rht_lock(tbl, bkt);
+ flags = rht_lock(tbl, bkt);
rht_for_each_from(he, rht_ptr(bkt, tbl, hash), tbl, hash) {
if (he != obj_old) {
@@ -1169,15 +1180,15 @@ static inline int __rhashtable_replace_f
rcu_assign_pointer(obj_new->next, obj_old->next);
if (pprev) {
rcu_assign_pointer(*pprev, obj_new);
- rht_unlock(tbl, bkt);
+ rht_unlock(tbl, bkt, flags);
} else {
- rht_assign_unlock(tbl, bkt, obj_new);
+ rht_assign_unlock(tbl, bkt, obj_new, flags);
}
err = 0;
goto unlocked;
}
- rht_unlock(tbl, bkt);
+ rht_unlock(tbl, bkt, flags);
unlocked:
return err;
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -231,6 +231,7 @@ static int rhashtable_rehash_one(struct
struct rhash_head *head, *next, *entry;
struct rhash_head __rcu **pprev = NULL;
unsigned int new_hash;
+ unsigned long flags;
if (new_tbl->nest)
goto out;
@@ -253,13 +254,14 @@ static int rhashtable_rehash_one(struct
new_hash = head_hashfn(ht, new_tbl, entry);
- rht_lock_nested(new_tbl, &new_tbl->buckets[new_hash], SINGLE_DEPTH_NESTING);
+ flags = rht_lock_nested(new_tbl, &new_tbl->buckets[new_hash],
+ SINGLE_DEPTH_NESTING);
head = rht_ptr(new_tbl->buckets + new_hash, new_tbl, new_hash);
RCU_INIT_POINTER(entry->next, head);
- rht_assign_unlock(new_tbl, &new_tbl->buckets[new_hash], entry);
+ rht_assign_unlock(new_tbl, &new_tbl->buckets[new_hash], entry, flags);
if (pprev)
rcu_assign_pointer(*pprev, next);
@@ -276,18 +278,19 @@ static int rhashtable_rehash_chain(struc
{
struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
struct rhash_lock_head __rcu **bkt = rht_bucket_var(old_tbl, old_hash);
+ unsigned long flags;
int err;
if (!bkt)
return 0;
- rht_lock(old_tbl, bkt);
+ flags = rht_lock(old_tbl, bkt);
while (!(err = rhashtable_rehash_one(ht, bkt, old_hash)))
;
if (err == -ENOENT)
err = 0;
- rht_unlock(old_tbl, bkt);
+ rht_unlock(old_tbl, bkt, flags);
return err;
}
@@ -590,6 +593,7 @@ static void *rhashtable_try_insert(struc
struct bucket_table *new_tbl;
struct bucket_table *tbl;
struct rhash_lock_head __rcu **bkt;
+ unsigned long flags;
unsigned int hash;
void *data;
@@ -607,7 +611,7 @@ static void *rhashtable_try_insert(struc
new_tbl = rht_dereference_rcu(tbl->future_tbl, ht);
data = ERR_PTR(-EAGAIN);
} else {
- rht_lock(tbl, bkt);
+ flags = rht_lock(tbl, bkt);
data = rhashtable_lookup_one(ht, bkt, tbl,
hash, key, obj);
new_tbl = rhashtable_insert_one(ht, bkt, tbl,
@@ -615,7 +619,7 @@ static void *rhashtable_try_insert(struc
if (PTR_ERR(new_tbl) != -EEXIST)
data = ERR_CAST(new_tbl);
- rht_unlock(tbl, bkt);
+ rht_unlock(tbl, bkt, flags);
}
} while (!IS_ERR_OR_NULL(new_tbl));
Hello:
This patch was applied to netdev/net-next.git (master)
by David S. Miller <[email protected]>:
On Tue, 6 Dec 2022 11:36:32 -1000 you wrote:
> rhashtable currently only does bh-safe synchronization making it impossible
> to use from irq-safe contexts. Switch it to use irq-safe synchronization to
> remove the restriction.
>
> v2: Update the lock functions to return the ulong flags value and unlock
> functions to take the value directly instead of passing around the
> pointer. Suggested by Linus.
>
> [...]
Here is the summary with links:
- [v2,01/31] rhashtable: Allow rhashtable to be used from irq-safe contexts
https://git.kernel.org/netdev/net-next/c/e47877c7aa82
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> new file mode 100644
> index 000000000000..f42464d66de4
> --- /dev/null
> +++ b/kernel/sched/ext.c
> @@ -0,0 +1,2780 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
> + * Copyright (c) 2022 Tejun Heo <[email protected]>
> + * Copyright (c) 2022 David Vernet <[email protected]>
> + */
> +#define SCX_OP_IDX(op) (offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void)))
> +
> +enum scx_internal_consts {
> + SCX_NR_ONLINE_OPS = SCX_OP_IDX(init),
> + SCX_DSP_DFL_MAX_BATCH = 32,
This definition of SCX_DSP_DFL_MAX_BATCH makes the dispatch queue have size
32. The example central policy thus aborts if more than 32 tasks are woken
up at once.
julia
Hello, Julia.
On Sun, Dec 11, 2022 at 11:33:50PM +0100, Julia Lawall wrote:
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > new file mode 100644
> > index 000000000000..f42464d66de4
> > --- /dev/null
> > +++ b/kernel/sched/ext.c
> > @@ -0,0 +1,2780 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
> > + * Copyright (c) 2022 Tejun Heo <[email protected]>
> > + * Copyright (c) 2022 David Vernet <[email protected]>
> > + */
> > +#define SCX_OP_IDX(op) (offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void)))
> > +
> > +enum scx_internal_consts {
> > + SCX_NR_ONLINE_OPS = SCX_OP_IDX(init),
> > + SCX_DSP_DFL_MAX_BATCH = 32,
>
> This definition of SCX_DSP_DFL_MAX_BATCH makes the dispatch queue have size
> 32. The example central policy thus aborts if more than 32 tasks are woken
> up at once.
Yeah, scx_exampl_central needs to either set ops.dispatch_max_batch higher
according to number of CPUs or flush and exit the loop and retry when
scx_bpf_dispatch_nr_slots() reaches zero. Will update.
Thanks.
--
tejun
On Tue, Nov 29, 2022 at 10:23:10PM -1000, Tejun Heo wrote:
> Signed-off-by: Tejun Heo <[email protected]>
> Reviewed-by: David Vernet <[email protected]>
> Acked-by: Josh Don <[email protected]>
> Acked-by: Hao Luo <[email protected]>
> Acked-by: Barret Rhoden <[email protected]>
No patch description? Really? Please write one.
For patch subject, better write "Documentation: scheduler: document
extensible scheduler class".
> +* The system integrity is maintained no matter what the BPF scheduler does.
> + The default scheduling behavior is restored anytime an error is detected,
> + a runnable task stalls, or on sysrq-S.
> +
> <snipped>...
> +Terminating the sched_ext scheduler program, triggering sysrq-S, or
> +detection of any internal error including stalled runnable tasks aborts the
> +BPF scheduler and reverts all tasks back to CFS.
IMO the reference to SysRq key can be reworded:
---- >8 ----
diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
index a2ad963b227a1b..404b820119b4a4 100644
--- a/Documentation/scheduler/sched-ext.rst
+++ b/Documentation/scheduler/sched-ext.rst
@@ -15,7 +15,8 @@ programs - the BPF scheduler.
* The system integrity is maintained no matter what the BPF scheduler does.
The default scheduling behavior is restored anytime an error is detected,
- a runnable task stalls, or on sysrq-S.
+ a runnable task stalls, or on invoking SysRq key sequence like
+ :kbd:`SysRq-s`.
Switching to and from sched_ext
===============================
@@ -35,7 +36,7 @@ case, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE`` and
``SCHED_EXT`` tasks are scheduled by sched_ext. In the example schedulers,
this mode can be selected with the ``-a`` option.
-Terminating the sched_ext scheduler program, triggering sysrq-S, or
+Terminating the sched_ext scheduler program, triggering SysRq key, or
detection of any internal error including stalled runnable tasks aborts the
BPF scheduler and reverts all tasks back to CFS.
> +A task is always *dispatch*ed to a dsq for execution. The task starts
> +execution when a CPU *consume*s the task from the dsq.
Sphinx reported two warnings:
Documentation/scheduler/sched-ext.rst:117: WARNING: Inline emphasis start-string without end-string.
Documentation/scheduler/sched-ext.rst:117: WARNING: Inline emphasis start-string without end-string.
I have to replace with quotes (since "dispatch" and "consume" have different
meaning in this context):
---- >8 ----
diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
index 81f78e05a6c214..a2ad963b227a1b 100644
--- a/Documentation/scheduler/sched-ext.rst
+++ b/Documentation/scheduler/sched-ext.rst
@@ -114,8 +114,8 @@ there is one global FIFO (``SCX_DSQ_GLOBAL``), and one local dsq per CPU
(``SCX_DSQ_LOCAL``). The BPF scheduler can manage an arbitrary number of
dsq's using ``scx_bpf_create_dsq()`` and ``scx_bpf_destroy_dsq()``.
-A task is always *dispatch*ed to a dsq for execution. The task starts
-execution when a CPU *consume*s the task from the dsq.
+A task is always "dispatched" to a dsq for execution. The task starts
+execution when a CPU "consumes" the task from the dsq.
Internally, a CPU only executes tasks which are running on its local dsq,
and the ``.consume()`` operation is in fact a transfer of a task from a
Otherwise the doc LGTM.
Thanks.
--
An old man doll... just what I always wanted! - Clara
On Sun, 11 Dec 2022, Tejun Heo wrote:
> Hello, Julia.
>
> On Sun, Dec 11, 2022 at 11:33:50PM +0100, Julia Lawall wrote:
> > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > > new file mode 100644
> > > index 000000000000..f42464d66de4
> > > --- /dev/null
> > > +++ b/kernel/sched/ext.c
> > > @@ -0,0 +1,2780 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +/*
> > > + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
> > > + * Copyright (c) 2022 Tejun Heo <[email protected]>
> > > + * Copyright (c) 2022 David Vernet <[email protected]>
> > > + */
> > > +#define SCX_OP_IDX(op) (offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void)))
> > > +
> > > +enum scx_internal_consts {
> > > + SCX_NR_ONLINE_OPS = SCX_OP_IDX(init),
> > > + SCX_DSP_DFL_MAX_BATCH = 32,
> >
> > This definition of SCX_DSP_DFL_MAX_BATCH makes the dispatch queue have size
> > 32. The example central policy thus aborts if more than 32 tasks are woken
> > up at once.
>
> Yeah, scx_exampl_central needs to either set ops.dispatch_max_batch higher
> according to number of CPUs or flush and exit the loop and retry when
> scx_bpf_dispatch_nr_slots() reaches zero. Will update.
Since there could be any number of waking threads, maybe some kind of
flush and retry solution would be better?
julia
>
> Thanks.
>
> --
> tejun
>
On Mon, Dec 12, 2022 at 07:03:56AM +0100, Julia Lawall wrote:
> > Yeah, scx_exampl_central needs to either set ops.dispatch_max_batch higher
> > according to number of CPUs or flush and exit the loop and retry when
> > scx_bpf_dispatch_nr_slots() reaches zero. Will update.
>
> Since there could be any number of waking threads, maybe some kind of
> flush and retry solution would be better?
Yeah, cental is a bit unusual because it's scheudling for other CPUs too. In
most cases, this doesn't matter that much because whether to retry or not
can be determined by the kernel core code. There are a couple ways to go
about it. When slots run out, it can explicitly queue another scheduling
event on self, or use scx_bpf_consume() to flush the pending tasks. Either
should work but neither is particularly pretty. I'm trying to see whether I
can remove the static dispatch buffers altogether.
Thanks.
--
tejun
Hello,
On Mon, Dec 12, 2022 at 11:01:42AM +0700, Bagas Sanjaya wrote:
> On Tue, Nov 29, 2022 at 10:23:10PM -1000, Tejun Heo wrote:
> > Signed-off-by: Tejun Heo <[email protected]>
> > Reviewed-by: David Vernet <[email protected]>
> > Acked-by: Josh Don <[email protected]>
> > Acked-by: Hao Luo <[email protected]>
> > Acked-by: Barret Rhoden <[email protected]>
>
> No patch description? Really? Please write one.
That's unnecessarily grating. I can add some blurb but here's an honest
question. What pertinent information would the description contain that
shouldn't be in the doc itself?
> For patch subject, better write "Documentation: scheduler: document
> extensible scheduler class".
Sounds good.
> ---- >8 ----
> diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> index a2ad963b227a1b..404b820119b4a4 100644
> --- a/Documentation/scheduler/sched-ext.rst
> +++ b/Documentation/scheduler/sched-ext.rst
> @@ -15,7 +15,8 @@ programs - the BPF scheduler.
>
> * The system integrity is maintained no matter what the BPF scheduler does.
> The default scheduling behavior is restored anytime an error is detected,
> - a runnable task stalls, or on sysrq-S.
> + a runnable task stalls, or on invoking SysRq key sequence like
> + :kbd:`SysRq-s`.
>
> Switching to and from sched_ext
> ===============================
> @@ -35,7 +36,7 @@ case, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE`` and
> ``SCHED_EXT`` tasks are scheduled by sched_ext. In the example schedulers,
> this mode can be selected with the ``-a`` option.
>
> -Terminating the sched_ext scheduler program, triggering sysrq-S, or
> +Terminating the sched_ext scheduler program, triggering SysRq key, or
> detection of any internal error including stalled runnable tasks aborts the
> BPF scheduler and reverts all tasks back to CFS.
>
>
> > +A task is always *dispatch*ed to a dsq for execution. The task starts
> > +execution when a CPU *consume*s the task from the dsq.
>
> Sphinx reported two warnings:
>
> Documentation/scheduler/sched-ext.rst:117: WARNING: Inline emphasis start-string without end-string.
> Documentation/scheduler/sched-ext.rst:117: WARNING: Inline emphasis start-string without end-string.
>
> I have to replace with quotes (since "dispatch" and "consume" have different
> meaning in this context):
>
> ---- >8 ----
> diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> index 81f78e05a6c214..a2ad963b227a1b 100644
> --- a/Documentation/scheduler/sched-ext.rst
> +++ b/Documentation/scheduler/sched-ext.rst
> @@ -114,8 +114,8 @@ there is one global FIFO (``SCX_DSQ_GLOBAL``), and one local dsq per CPU
> (``SCX_DSQ_LOCAL``). The BPF scheduler can manage an arbitrary number of
> dsq's using ``scx_bpf_create_dsq()`` and ``scx_bpf_destroy_dsq()``.
>
> -A task is always *dispatch*ed to a dsq for execution. The task starts
> -execution when a CPU *consume*s the task from the dsq.
> +A task is always "dispatched" to a dsq for execution. The task starts
> +execution when a CPU "consumes" the task from the dsq.
>
> Internally, a CPU only executes tasks which are running on its local dsq,
> and the ``.consume()`` operation is in fact a transfer of a task from a
>
> Otherwise the doc LGTM.
Will apply the suggested changes.
Thanks.
--
tejun
On Tue, Nov 29, 2022 at 10:22:42PM -1000, Tejun Heo wrote:
> Core scheduling is an example of a feature that took a significant amount of
> time and effort to integrate into the kernel.
Mostly because I dropped it on the floor once I heard about MDS. That
made me lose interest entirely. The only reason it eventually happened
was ChromeOS (Joel) pushing for it again.
> Part of the difficulty with core
> scheduling was the inherent mismatch in abstraction between the desire to
> perform core-wide scheduling, and the per-cpu design of the kernel scheduler.
Not really; the main difficultly was due to me wanting to do it outside
of the scheduling classes so that it fundamentally covers all of them.
Doing it inside a class (say CFS) would've made it significantly simpler.
> This caused issues, for example ensuring proper fairness between the
> independent runqueues of SMT siblings.
Inter-runqueue fairness is a known issue of CFS and quite independent of
core scheduling.
Anyway, I hate all of this. Linus NAK'ed loadable schedulers a number of
times in the past and this is just that again -- with the extra downside
of the whole BPF thing on top :/
You look to be exposing a ton of stuff I've so far even refused
tracepoints for :-(
Anyway, I'm just back from a heavy dose of Covid and still taking it
easy, but I'll go read through the whole thing, hopefully I'll finish
before vanishing again for the x-mas break.
On Tue, Nov 29, 2022 at 10:22:42PM -1000, Tejun Heo wrote:
> Rolling out kernel upgrades is a slow and iterative process. At a large scale
> it can take months to roll a new kernel out to a fleet of servers. While this
> latency is expected and inevitable for normal kernel upgrades, it can become
> highly problematic when kernel changes are required to fix bugs. Livepatch [9]
> is available to quickly roll out critical security fixes to large fleets, but
> the scope of changes that can be applied with livepatching is fairly limited,
> and would likely not be usable for patching scheduling policies. With
> sched_ext, new scheduling policies can be rapidly rolled out to production
> environments.
I don't think we can or should use this argument to push BPF into ever
more places.
On Tue, Nov 29, 2022 at 10:22:48PM -1000, Tejun Heo wrote:
> When a task switches to a new sched_class, the prev and new classes are
> notified through ->switched_from() and ->switched_to(), respectively, after
> the switching is done. However, a new sched_class needs to prepare the task
> state before it is enqueued on the new class for the first time.
How and why isn't sched_fork() sufficient?
On Tue, Nov 29, 2022 at 10:22:52PM -1000, Tejun Heo wrote:
> ->rq_{on|off}line are called either during CPU hotplug or cpuset partition
> updates. Let's add an argument to distinguish the two cases. The argument
> will be used by a new sched_class to track CPU hotplug events.
Again, utter lack of why.
On Tue, Nov 29, 2022 at 10:22:53PM -1000, Tejun Heo wrote:
> sched_move_task() can be called for both cgroup and autogroup moves. Add a
> parameter to distinguish the two cases. This will be used by a new
> sched_class to track cgroup migrations.
This all seems pointless, you can trivially distinguish a
cgroup/autogroup task_group if you so want (again for unspecified
raisins).
On Tue, Nov 29, 2022 at 10:22:56PM -1000, Tejun Heo wrote:
> @@ -11242,3 +11268,38 @@ void call_trace_sched_update_nr_running(struct rq *rq, int count)
> {
> trace_sched_update_nr_running_tp(rq, count);
> }
> +
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +void sched_deq_and_put_task(struct task_struct *p, int queue_flags,
> + struct sched_enq_and_set_ctx *ctx)
> +{
> + struct rq *rq = task_rq(p);
> +
> + lockdep_assert_rq_held(rq);
> +
> + *ctx = (struct sched_enq_and_set_ctx){
> + .p = p,
> + .queue_flags = queue_flags | DEQUEUE_NOCLOCK,
> + .queued = task_on_rq_queued(p),
> + .running = task_current(rq, p),
> + };
> +
> + update_rq_clock(rq);
> + if (ctx->queued)
> + dequeue_task(rq, p, queue_flags);
> + if (ctx->running)
> + put_prev_task(rq, p);
> +}
> +
> +void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
> +{
> + struct rq *rq = task_rq(ctx->p);
> +
> + lockdep_assert_rq_held(rq);
> +
> + if (ctx->queued)
> + enqueue_task(rq, ctx->p, ctx->queue_flags);
> + if (ctx->running)
> + set_next_task(rq, ctx->p);
> +}
> +#endif /* CONFIG_SCHED_CLASS_EXT */
So no. Like the whole __setscheduler_prio() thing, this doesn't make
sense outside of the core code, policy/class code should never need to
do this.
Also: https://lkml.kernel.org/r/[email protected]
On Tue, Nov 29, 2022 at 10:23:10PM -1000, Tejun Heo wrote:
If you expect me to read this, please as to provide something readable,
not markup gibberish.
On 12/12/22 13:28, Tejun Heo wrote:
>> No patch description? Really? Please write one.
>
> That's unnecessarily grating. I can add some blurb but here's an honest
> question. What pertinent information would the description contain that
> shouldn't be in the doc itself?
>
Sorry I don't know the answer, but the description should at least
expand from the patch subject.
Thanks anyway.
--
An old man doll... just what I always wanted! - Clara
On Tue, Nov 29, 2022 at 10:22:56PM -1000, Tejun Heo wrote:
> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> index d06ada2341cb..cfbfc47692eb 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -131,6 +131,7 @@
> *(__dl_sched_class) \
> *(__rt_sched_class) \
> *(__fair_sched_class) \
> + *(__ext_sched_class) \
> *(__idle_sched_class) \
> __sched_class_lowest = .;
>
> @@ -9654,8 +9675,13 @@ void __init sched_init(void)
> int i;
>
> /* Make sure the linker didn't screw up */
> - BUG_ON(&idle_sched_class != &fair_sched_class + 1 ||
> - &fair_sched_class != &rt_sched_class + 1 ||
> +#ifdef CONFIG_SCHED_CLASS_EXT
> + BUG_ON(&idle_sched_class != &ext_sched_class + 1 ||
> + &ext_sched_class != &fair_sched_class + 1);
> +#else
> + BUG_ON(&idle_sched_class != &fair_sched_class + 1);
> +#endif
> + BUG_ON(&fair_sched_class != &rt_sched_class + 1 ||
> &rt_sched_class != &dl_sched_class + 1);
> #ifdef CONFIG_SMP
> BUG_ON(&dl_sched_class != &stop_sched_class + 1);
Perhaps the saner way to write this is:
#ifdef CONFIG_SMP
BUG_ON(!sched_class_above(&stop_sched_class, &dl_sched_class));
#endif
BUG_ON(!sched_class_above(&dl_sched_class, &rt_sched_class));
BUG_ON(!sched_class_above(&rt_sched_class, &fair_sched_class));
BUG_ON(!sched_class_above(&fair_sched_class, &idle_sched_class));
#ifdef CONFIG_...
BUG_ON(!sched_class_above(&fair_sched_class, &ext_sched_class));
BUG_ON(!sched_class_above(&ext_sched_class, &idle_sched_class));
#endif
> +static inline const struct sched_class *next_active_class(const struct sched_class *class)
> +{
> + class++;
> + if (!scx_enabled() && class == &ext_sched_class)
> + class++;
> + return class;
> +}
> +
> +#define for_active_class_range(class, _from, _to) \
> + for (class = (_from); class != (_to); class = next_active_class(class))
> +
> +#define for_each_active_class(class) \
> + for_active_class_range(class, __sched_class_highest, __sched_class_lowest)
> +
> +/*
> + * SCX requires a balance() call before every pick_next_task() call including
> + * when waking up from idle.
> + */
> +#define for_balance_class_range(class, prev_class, end_class) \
> + for_active_class_range(class, (prev_class) > &ext_sched_class ? \
> + &ext_sched_class : (prev_class), (end_class))
> +
This seems quite insane; why not simply make the ext methods effective
no-ops? Both balance and pick explicitly support that already, no?
> @@ -5800,10 +5812,13 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
> * We can terminate the balance pass as soon as we know there is
> * a runnable task of @class priority or higher.
> */
> - for_class_range(class, prev->sched_class, &idle_sched_class) {
> + for_balance_class_range(class, prev->sched_class, &idle_sched_class) {
> if (class->balance(rq, prev, rf))
> break;
> }
> +#else
> + /* SCX needs the balance call even in UP, call it explicitly */
This, *WHY* !?!
> + balance_scx_on_up(rq, prev, rf);
> #endif
>
> put_prev_task(rq, prev);
> @@ -5818,6 +5833,9 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> const struct sched_class *class;
> struct task_struct *p;
>
> + if (scx_enabled())
> + goto restart;
> +
> /*
> * Optimization: we know that if all tasks are in the fair class we can
> * call that function directly, but only if the @prev task wasn't of a
> @@ -5843,7 +5861,7 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> restart:
> put_prev_task_balance(rq, prev, rf);
>
> - for_each_class(class) {
> + for_each_active_class(class) {
> p = class->pick_next_task(rq);
> if (p)
> return p;
> @@ -5876,7 +5894,7 @@ static inline struct task_struct *pick_task(struct rq *rq)
> const struct sched_class *class;
> struct task_struct *p;
>
> - for_each_class(class) {
> + for_each_active_class(class) {
> p = class->pick_task(rq);
> if (p)
> return p;
But this.. afaict that means that:
- the whole EXT thing is incompatible with SCHED_CORE
- the whole EXT thing can be trivially starved by the presence of a
single CFS/BATCH/IDLE task.
Both seems like deal breakers.
On Tue, Nov 29, 2022 at 10:23:13PM -1000, Tejun Heo wrote:
> From: Dan Schatzberg <[email protected]>
>
> Atropos is a multi-domain BPF / userspace hybrid scheduler where the BPF
> part does simple round robin in each domain and the userspace part
> calculates the load factor of each domain and tells the BPF part how to load
> balance the domains.
>
> This scheduler demonstrates dividing scheduling logic between BPF and
> userspace and using rust to build the userspace part.
And here I am, speaking neither Rust nor BPF.
But really, having seen some of this I long for the UMCG patches -- that
at least was somewhat sane and trivially composes, unlike all this
madness.
Hello,
On Mon, Dec 12, 2022 at 01:39:04PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 29, 2022 at 10:23:10PM -1000, Tejun Heo wrote:
>
> If you expect me to read this, please as to provide something readable,
> not markup gibberish.
Hmmm... Everything under Documentation/scheduler is in rst markup. I don't
care about the file format. If plain text is preferable, that's fine but
that'd look odd in that directory.
For your reading convenience, the following is the formatted output:
https://github.com/htejun/sched_ext/blob/sched_ext/Documentation/scheduler/sched-ext.rst
and the following is the plain text version with markups stripped out. FWIW,
while the double-backtick quoting is a bit distracting, there aren't whole
lot of markups to strip out.
==========================
Extensible Scheduler Class
==========================
sched_ext is a scheduler class whose behavior can be defined by a set of BPF
programs - the BPF scheduler.
* sched_ext exports a full scheduling interface so that any scheduling
algorithm can be implemented on top.
* The BPF scheduler can group CPUs however it sees fit and schedule them
together, as tasks aren't tied to specific CPUs at the time of wakeup.
* The BPF scheduler can be turned on and off dynamically anytime.
* The system integrity is maintained no matter what the BPF scheduler does.
The default scheduling behavior is restored anytime an error is detected,
a runnable task stalls, or on invoking SysRq key sequence like `SysRq-s`.
Switching to and from sched_ext
===============================
CONFIG_SCHED_CLASS_EXT is the config option to enable sched_ext and
tools/sched_ext contains the example schedulers.
sched_ext is used only when the BPF scheduler is loaded and running.
If a task explicitly sets its scheduling policy to SCHED_EXT, it will be
treated as SCHED_NORMAL and scheduled by CFS until the BPF scheduler is
loaded. On load, such tasks will be switched to and scheduled by sched_ext.
The BPF scheduler can choose to schedule all normal and lower class tasks by
calling scx_bpf_switch_all() from its init() operation. In this case, all
SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE and SCHED_EXT tasks are scheduled by
sched_ext. In the example schedulers, this mode can be selected with the -a
option.
Terminating the sched_ext scheduler program, triggering SysRq key, or
detection of any internal error including stalled runnable tasks aborts the
BPF scheduler and reverts all tasks back to CFS.
# make -j16 -C tools/sched_ext
# tools/sched_ext/scx_example_dummy -a
local=0 global=3
local=5 global=24
local=9 global=44
local=13 global=56
local=17 global=72
^CEXIT: BPF scheduler unregistered
If CONFIG_SCHED_DEBUG is set, the current status of the BPF scheduler and
whether a given task is on sched_ext can be determined as follows:
# cat /sys/kernel/debug/sched/ext
ops : dummy
enabled : 1
switching_all : 1
switched_all : 1
enable_state : enabled
# grep ext /proc/self/sched
ext.enabled : 1
The Basics
==========
Userspace can implement an arbitrary BPF scheduler by loading a set of BPF
programs that implement struct sched_ext_ops. The only mandatory field is
ops.name which must be a valid BPF object name. All operations are optional.
The following modified excerpt is from tools/sched/scx_example_dummy.bpf.c
showing a minimal global FIFO scheduler.
s32 BPF_STRUCT_OPS(dummy_init)
{
if (switch_all)
scx_bpf_switch_all();
return 0;
}
void BPF_STRUCT_OPS(dummy_enqueue, struct task_struct *p, u64 enq_flags)
{
if (enq_flags & SCX_ENQ_LOCAL)
scx_bpf_dispatch(p, SCX_DSQ_LOCAL, enq_flags);
else
scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, enq_flags);
}
void BPF_STRUCT_OPS(dummy_exit, struct scx_exit_info *ei)
{
exit_type = ei->type;
}
SEC(".struct_ops")
struct sched_ext_ops dummy_ops = {
.enqueue = (void *)dummy_enqueue,
.init = (void *)dummy_init,
.exit = (void *)dummy_exit,
.name = "dummy",
};
Dispatch Queues
---------------
To match the impedance between the scheduler core and the BPF scheduler,
sched_ext uses simple FIFOs called DSQs (dispatch queues). By default, there
is one global FIFO (SCX_DSQ_GLOBAL), and one local dsq per CPU
(SCX_DSQ_LOCAL). The BPF scheduler can manage an arbitrary number of dsq's
using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().
A CPU always executes a task from its local DSQ. A task is "dispatched" to a
DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's
local DSQ.
When a CPU is looking for the next task to run, if the local DSQ is not
empty, the first task is picked. Otherwise, the CPU tries to consume the
global DSQ. If that doesn't yield a runnable task either, ops.dispatch() is
invoked.
Scheduling Cycle
----------------
The following briefly shows how a waking task is scheduled and executed.
1. When a task is waking up, ops.select_cpu() is the first operation
invoked. This serves two purposes. First, CPU selection optimization
hint. Second, waking up the selected CPU if idle.
The CPU selected by ops.select_cpu() is an optimization hint and not
binding. The actual decision is made at the last step of scheduling.
However, there is a small performance gain if the CPU ops.select_cpu()
returns matches the CPU the task eventually runs on.
A side-effect of selecting a CPU is waking it up from idle. While a BPF
scheduler can wake up any cpu using the scx_bpf_kick_cpu() helper, using
ops.select_cpu() judiciously can be simpler and more efficient.
Note that the scheduler core will ignore an invalid CPU selection, for
example, if it's outside the allowed cpumask of the task.
2. Once the target CPU is selected, ops.enqueue() is invoked. It can make
one of the following decisions:
* Immediately dispatch the task to either the global or local DSQ by
calling scx_bpf_dispatch() with SCX_DSQ_GLOBAL or SCX_DSQ_LOCAL,
respectively.
* Immediately dispatch the task to a custom DSQ by calling
scx_bpf_dispatch() with a DSQ ID which is smaller than 2^63.
* Queue the task on the BPF side.
3. When a CPU is ready to schedule, it first looks at its local DSQ. If
empty, it then looks at the global DSQ. If there still isn't a task to
run, ops.dispatch() is invoked which can use the following two functions
to populate the local DSQ.
* scx_bpf_dispatch() dispatches a task to a DSQ. Any target DSQ can be
used - SCX_DSQ_LOCAL, SCX_DSQ_LOCAL_ON | cpu, SCX_DSQ_GLOBAL or a
custom DSQ. While scx_bpf_dispatch() currently can't be called with BPF
locks held, this is being worked on and will be supported.
scx_bpf_dispatch() schedules dispatching rather than performing them
immediately. There can be up to ops.dispatch_max_batch pending tasks.
* scx_bpf_consume() tranfers a task from the specified non-local DSQ to
the dispatching DSQ. This function cannot be called with any BPF locks
held. scx_bpf_consume() flushes the pending dispatched tasks before
trying to consume the specified DSQ.
4. After ops.dispatch() returns, if there are tasks in the local DSQ, the
CPU runs the first one. If empty, the following steps are taken:
* Try to consume the global DSQ. If successful, run the task.
* If ops.dispatch() has dispatched any tasks, retry #3.
* If the previous task is an SCX task and still runnable, keep executing
it (see SCX_OPS_ENQ_LAST).
* Go idle.
Note that the BPF scheduler can always choose to dispatch tasks immediately
in ops.enqueue() as illustrated in the above dummy example. If only the
built-in DSQs are used, there is no need to implement ops.dispatch() as a
task is never queued on the BPF scheduler and both the local and global DSQs
are consumed automatically.
Where to Look
=============
* include/linux/sched/ext.h defines the core data structures, ops table and
constants.
* kernel/sched/ext.c contains sched_ext core implementation and helpers. The
functions prefixed with scx_bpf_ can be called from the BPF scheduler.
* tools/sched_ext/ hosts example BPF scheduler implementations.
* scx_example_dummy[.bpf].c: Minimal global FIFO scheduler example using a
custom DSQ.
* scx_example_qmap[.bpf].c: A multi-level FIFO scheduler supporting five
levels of priority implemented with BPF_MAP_TYPE_QUEUE.
ABI Instability
===============
The APIs provided by sched_ext to BPF schedulers programs have no stability
guarantees. This includes the ops table callbacks and constants defined in
include/linux/sched/ext.h, as well as the scx_bpf_ kfuncs defined in
kernel/sched/ext.c.
While we will attempt to provide a relatively stable API surface when
possible, they are subject to change without warning between kernel
versions.
Caveats
=======
* The current implementation isn't safe in that the BPF scheduler can crash
the kernel.
* Unsafe cpumask helpers should be replaced by proper generic BPF helpers.
* Currently, all kfunc helpers can be called by any operation as BPF
doesn't yet support filtering kfunc calls per struct_ops operation. Some
helpers are context sensitive as should be restricted accordingly.
* Timers used by the BPF scheduler should be shut down when aborting.
* There are a couple BPF hacks which are still needed even for sched_ext
proper. They should be removed in the near future.
--
tejun
Hello,
On Mon, Dec 12, 2022 at 10:37:50AM +0100, Peter Zijlstra wrote:
> Anyway, I hate all of this. Linus NAK'ed loadable schedulers a number of
> times in the past and this is just that again -- with the extra downside
> of the whole BPF thing on top :/
Yeah, the whole BPF thing is what made this interesting for me and while
there still are frustrating shortcomings, BPF does make iteration and
deployment drastically safer and faster and we can try out ideas which
didn't seem feasible before with relative ease. Given the history, I suppose
some of the decision is upto Linus.
> You look to be exposing a ton of stuff I've so far even refused
> tracepoints for :-(
I tried to keep the exposure limited and am sure it can be much improved. If
you have specific ideas, I'm all ears.
> Anyway, I'm just back from a heavy dose of Covid and still taking it
> easy, but I'll go read through the whole thing, hopefully I'll finish
> before vanishing again for the x-mas break.
Mine wasn't too bad but it still sucked mightily. I wish a quick and full
recovery and thanks for taking a look.
--
tejun
On Mon, Dec 12, 2022 at 08:07:32PM +0700, Bagas Sanjaya wrote:
> On 12/12/22 13:28, Tejun Heo wrote:
> >> No patch description? Really? Please write one.
> >
> > That's unnecessarily grating. I can add some blurb but here's an honest
> > question. What pertinent information would the description contain that
> > shouldn't be in the doc itself?
> >
>
> Sorry I don't know the answer, but the description should at least
> expand from the patch subject.
Yeah, I can put some filler text. It just feels a bit silly. Not a big deal
at all either way.
Thanks.
--
tejun
On Mon, Dec 12, 2022 at 12:28:29PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 29, 2022 at 10:22:48PM -1000, Tejun Heo wrote:
> > When a task switches to a new sched_class, the prev and new classes are
> > notified through ->switched_from() and ->switched_to(), respectively, after
> > the switching is done. However, a new sched_class needs to prepare the task
> > state before it is enqueued on the new class for the first time.
>
> How and why isn't sched_fork() sufficient?
sched_ext has callbacks which allow the BPF scheduler to keep track of
relevant task states (like priority and cpumask). Those callbacks aren't
called while a task isn't on sched_ext. When a task comes back to SCX, we
wanna tell the BPF scheduler the up-to-date state before the task gets
enqueued, so the need for a hook which is called before the switching is
committed.
Thanks.
--
tejun
On Mon, Dec 12, 2022 at 01:00:35PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 29, 2022 at 10:22:53PM -1000, Tejun Heo wrote:
> > sched_move_task() can be called for both cgroup and autogroup moves. Add a
> > parameter to distinguish the two cases. This will be used by a new
> > sched_class to track cgroup migrations.
>
> This all seems pointless, you can trivially distinguish a
> cgroup/autogroup task_group if you so want (again for unspecified
> raisins).
Lemme add better explanations on the patches. This one, sched_ext just wants
to tell cgroup moves from autogroup ones to decide whether to invoke the BPF
scheduler's cgroup migration callback. But, yeah, you're right. It should be
able to tell that by looking at the task_group itself. Will try that.
Thanks.
--
tejun
On Mon, Dec 12, 2022 at 12:57:36PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 29, 2022 at 10:22:52PM -1000, Tejun Heo wrote:
> > ->rq_{on|off}line are called either during CPU hotplug or cpuset partition
> > updates. Let's add an argument to distinguish the two cases. The argument
> > will be used by a new sched_class to track CPU hotplug events.
>
> Again, utter lack of why.
sched_ext wants to tell the BPF scheduler about CPU hotplugs in a way that's
synchronized with rq state changes, so it needs to distinguish the two
cases. Will update the description.
Thanks.
--
tejun
Hello,
On Mon, Dec 12, 2022 at 01:31:11PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 29, 2022 at 10:22:56PM -1000, Tejun Heo wrote:
> > @@ -11242,3 +11268,38 @@ void call_trace_sched_update_nr_running(struct rq *rq, int count)
> > {
> > trace_sched_update_nr_running_tp(rq, count);
> > }
> > +
> > +#ifdef CONFIG_SCHED_CLASS_EXT
> > +void sched_deq_and_put_task(struct task_struct *p, int queue_flags,
> > + struct sched_enq_and_set_ctx *ctx)
> > +{
> > + struct rq *rq = task_rq(p);
> > +
> > + lockdep_assert_rq_held(rq);
> > +
> > + *ctx = (struct sched_enq_and_set_ctx){
> > + .p = p,
> > + .queue_flags = queue_flags | DEQUEUE_NOCLOCK,
> > + .queued = task_on_rq_queued(p),
> > + .running = task_current(rq, p),
> > + };
> > +
> > + update_rq_clock(rq);
> > + if (ctx->queued)
> > + dequeue_task(rq, p, queue_flags);
> > + if (ctx->running)
> > + put_prev_task(rq, p);
> > +}
> > +
> > +void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
> > +{
> > + struct rq *rq = task_rq(ctx->p);
> > +
> > + lockdep_assert_rq_held(rq);
> > +
> > + if (ctx->queued)
> > + enqueue_task(rq, ctx->p, ctx->queue_flags);
> > + if (ctx->running)
> > + set_next_task(rq, ctx->p);
> > +}
> > +#endif /* CONFIG_SCHED_CLASS_EXT */
>
> So no. Like the whole __setscheduler_prio() thing, this doesn't make
> sense outside of the core code, policy/class code should never need to
> do this.
Continuing from the __setscheduler_prio() discussion, the need arises from
the fact that whether a task is on SCX or CFS changes depending on whether
the BPF scheduler is loaded or not - e.g. when the BPF scheduler gets
disabled, all tasks that were on it need to be moved back into CFS. There
are different ways to implement this but it needs to be solved somehow.
> Also: https://lkml.kernel.org/r/[email protected]
Yeah, it'd be nice to encapsulate this sequence. The FOR_CHANGE_GUARD naming
throws me off a bit tho as it's not really a loop. Wouldn't it be better to
call it CHANGE_GUARD_BLOCK or sth?
Thanks.
--
tejun
Peter Zijlstra wrote:
> I long for the UMCG patches -- that
> at least was somewhat sane and trivially composes, unlike all this
> madness.
A surprise, to be sure, but a welcome one!
We are in the process of finalizing UMCG internally, and I plan
to post the patches here once all reviews/testing and some preliminary
rollouts are done.
Thanks,
Peter
Hello,
On Mon, Dec 12, 2022 at 01:53:55PM +0100, Peter Zijlstra wrote:
> Perhaps the saner way to write this is:
>
> #ifdef CONFIG_SMP
> BUG_ON(!sched_class_above(&stop_sched_class, &dl_sched_class));
> #endif
> BUG_ON(!sched_class_above(&dl_sched_class, &rt_sched_class));
> BUG_ON(!sched_class_above(&rt_sched_class, &fair_sched_class));
> BUG_ON(!sched_class_above(&fair_sched_class, &idle_sched_class));
> #ifdef CONFIG_...
> BUG_ON(!sched_class_above(&fair_sched_class, &ext_sched_class));
> BUG_ON(!sched_class_above(&ext_sched_class, &idle_sched_class));
> #endif
Yeah, this looks way better. Will update.
> > +static inline const struct sched_class *next_active_class(const struct sched_class *class)
> > +{
> > + class++;
> > + if (!scx_enabled() && class == &ext_sched_class)
> > + class++;
> > + return class;
> > +}
> > +
> > +#define for_active_class_range(class, _from, _to) \
> > + for (class = (_from); class != (_to); class = next_active_class(class))
> > +
> > +#define for_each_active_class(class) \
> > + for_active_class_range(class, __sched_class_highest, __sched_class_lowest)
> > +
> > +/*
> > + * SCX requires a balance() call before every pick_next_task() call including
> > + * when waking up from idle.
> > + */
> > +#define for_balance_class_range(class, prev_class, end_class) \
> > + for_active_class_range(class, (prev_class) > &ext_sched_class ? \
> > + &ext_sched_class : (prev_class), (end_class))
> > +
>
> This seems quite insane; why not simply make the ext methods effective
> no-ops? Both balance and pick explicitly support that already, no?
Yeah, we can definitely do that. It's just nice to guarantee from the core
code that we aren't calling into a sched class which isn't enabled at the
moment. If you take a look at "[PATCH 20/31] sched_ext: Allow BPF schedulers
to switch all eligible tasks into sched_ext", it's used the other way around
too to elide calling into CFS when it knows that there are no tasks there.
If the core code doesn't elide the calls, we might need some gating in CFS
too.
> > @@ -5800,10 +5812,13 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
> > * We can terminate the balance pass as soon as we know there is
> > * a runnable task of @class priority or higher.
> > */
> > - for_class_range(class, prev->sched_class, &idle_sched_class) {
> > + for_balance_class_range(class, prev->sched_class, &idle_sched_class) {
> > if (class->balance(rq, prev, rf))
> > break;
> > }
> > +#else
> > + /* SCX needs the balance call even in UP, call it explicitly */
>
> This, *WHY* !?!
This comes from the fact that there are no strict rq boundaries and the BPF
scheduler can share scheduling queues across any subset of CPUs however they
see fit. So, task <-> rq association is flexible until the very last moment
the task gets picked for execution. For the dispatch path to support this,
it needs to be able to migrate tasks across rq's which requires unlocking
the current rq which can only be done in ->balance(). So, sched_ext uses
->balance() to run the dispatch path and thus needs it called even on UP.
Given that UP doesn't need to transfer tasks across, it might be possible to
move the whole dispatch operation into ->pick_next_task() but the current
state would be different, so it's more complicated and will likely be more
brittle.
> > @@ -5876,7 +5894,7 @@ static inline struct task_struct *pick_task(struct rq *rq)
> > const struct sched_class *class;
> > struct task_struct *p;
> >
> > - for_each_class(class) {
> > + for_each_active_class(class) {
> > p = class->pick_task(rq);
> > if (p)
> > return p;
>
>
> But this.. afaict that means that:
>
> - the whole EXT thing is incompatible with SCHED_CORE
Can you expand on why this would be? I didn't test against SCHED_CORE, so am
sure things might be broken but can't think of a reason why it'd be
fundamentally incompatible.
> - the whole EXT thing can be trivially starved by the presence of a
> single CFS/BATCH/IDLE task.
It's a simliar situation w/ RT vs. CFS, which is resolved via RT having
starvation avoidance. Here, the way it's handled is a bit different, SCX has
a watchdog mechanism implemented in "[PATCH 18/31] sched_ext: Implement
runnable task stall watchdog", so if SCX tasks hang for whatever reason
including being starved by CFS, it will get aborted and all tasks will be
handed back to CFS. IOW, it's treated like any other BPF scheduler errors
that can lead to stalls and recovered the same way.
Thanks.
--
tejun
Hello,
On Mon, Dec 12, 2022 at 03:03:20PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 29, 2022 at 10:23:13PM -1000, Tejun Heo wrote:
> > From: Dan Schatzberg <[email protected]>
> >
> > Atropos is a multi-domain BPF / userspace hybrid scheduler where the BPF
> > part does simple round robin in each domain and the userspace part
> > calculates the load factor of each domain and tells the BPF part how to load
> > balance the domains.
> >
> > This scheduler demonstrates dividing scheduling logic between BPF and
> > userspace and using rust to build the userspace part.
>
> And here I am, speaking neither Rust nor BPF.
I'm not super fluent in rust but do really enjoy whenever I get to do things
in it. What the language pulls off is actually really neat. It does take
some getting-used-to tho.
> But really, having seen some of this I long for the UMCG patches -- that
> at least was somewhat sane and trivially composes, unlike all this
> madness.
Putting aside lack of familiarity, there are several things which make the
examples including this one not very readable. e.g. how the loops have to be
structured in BPF and the inability to seamlessly access the elements of
certain BPF map types do hamper ergonomics and readability quite a bit. That
said, there are a lot of new developments in BPF which should improve many
of these areas, so hopefully things should keep getting better.
Thanks.
--
tejun
Hey Peter,
On Mon, Dec 12, 2022 at 6:03 AM Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Nov 29, 2022 at 10:23:13PM -1000, Tejun Heo wrote:
> > From: Dan Schatzberg <[email protected]>
> >
> > Atropos is a multi-domain BPF / userspace hybrid scheduler where the BPF
> > part does simple round robin in each domain and the userspace part
> > calculates the load factor of each domain and tells the BPF part how to load
> > balance the domains.
> >
> > This scheduler demonstrates dividing scheduling logic between BPF and
> > userspace and using rust to build the userspace part.
>
> And here I am, speaking neither Rust nor BPF.
>
> But really, having seen some of this I long for the UMCG patches -- that
> at least was somewhat sane and trivially composes, unlike all this
> madness.
I wasn't sure if you were focusing specifically on how the BPF portion
is implemented, or on UMCG vs sched_ext. For the latter, and ignoring
the specifics of this example, the UMCG and sched_ext work are
complementary, but not mutually exclusive. UMCG is about driving
cooperative scheduling within a particular application. UMCG does not
have control over or react to external preemption, nor does it make
thread placement decisions. sched_ext is considering things more at
the system level: arbitrating fairness and preemption between
processes, deciding when and where threads run, etc., and also being
able to take application-specific hints if desired.
On Mon, Dec 12, 2022 at 11:33:12AM -1000, Tejun Heo wrote:
> > But this.. afaict that means that:
> >
> > - the whole EXT thing is incompatible with SCHED_CORE
>
> Can you expand on why this would be? I didn't test against SCHED_CORE, so am
> sure things might be broken but can't think of a reason why it'd be
> fundamentally incompatible.
For starters, SCHED_CORE doesn't use __pick_next_task() (much). But I
think you're going to have more trouble with prio_less() (which is the
3rd implementation of the scheduling function :/)
> > - the whole EXT thing can be trivially starved by the presence of a
> > single CFS/BATCH/IDLE task.
>
> It's a simliar situation w/ RT vs. CFS, which is resolved via RT having
> starvation avoidance.
That is a horrible situation as is, FIFO/RR are very crap scheduling
policies for a number of reasons but we're stuck with them due to
history and POSIX :-(, that is not something you should justify anything
with.
In fact, it should be an example of what to avoid.
Specifically, FIFO/RR fail at the fundamentals of OS
abstractions -- they provide neither resource distribution nor
isolation.
> Here, the way it's handled is a bit different, SCX has
> a watchdog mechanism implemented in "[PATCH 18/31] sched_ext: Implement
> runnable task stall watchdog", so if SCX tasks hang for whatever reason
> including being starved by CFS, it will get aborted and all tasks will be
> handed back to CFS. IOW, it's treated like any other BPF scheduler errors
> that can lead to stalls and recovered the same way.
That all sounds quite terrible.. :/
When the scheduler isn't available it should be an error to switch a
task to the policy, when there are tasks in the policy, it must not go
away.
The policy itself should never cause policy changes.
On Mon, Dec 12, 2022 at 11:33:12AM -1000, Tejun Heo wrote:
> > > @@ -5800,10 +5812,13 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
> > > * We can terminate the balance pass as soon as we know there is
> > > * a runnable task of @class priority or higher.
> > > */
> > > - for_class_range(class, prev->sched_class, &idle_sched_class) {
> > > + for_balance_class_range(class, prev->sched_class, &idle_sched_class) {
> > > if (class->balance(rq, prev, rf))
> > > break;
> > > }
> > > +#else
> > > + /* SCX needs the balance call even in UP, call it explicitly */
> >
> > This, *WHY* !?!
>
> This comes from the fact that there are no strict rq boundaries and the BPF
> scheduler can share scheduling queues across any subset of CPUs however they
> see fit. So, task <-> rq association is flexible until the very last moment
> the task gets picked for execution. For the dispatch path to support this,
> it needs to be able to migrate tasks across rq's which requires unlocking
> the current rq which can only be done in ->balance(). So, sched_ext uses
> ->balance() to run the dispatch path and thus needs it called even on UP.
Fundamentally none of that makes sense, on UP there is no placement,
there is only the one CPU, very little choice to be had.
> Given that UP doesn't need to transfer tasks across, it might be possible to
> move the whole dispatch operation into ->pick_next_task() but the current
> state would be different, so it's more complicated and will likely be more
> brittle.
That sounds like something is amiss, you fundamentally hold all the
right locks, there is only one.
On Mon, Dec 12, 2022 at 01:05:47PM -0800, Peter Oskolkov wrote:
> Peter Zijlstra wrote:
>
> > I long for the UMCG patches -- that
> > at least was somewhat sane and trivially composes, unlike all this
> > madness.
>
> A surprise, to be sure, but a welcome one!
Well, I did somewhat like it as I put significant effort into it. In
fact I was >.< near to merging the lot when you changed your mind and
went the syscall route.
At that point something happened (I fell ill or a holiday or something)
but it basically got snowed under as I had already been ignoring the
inbox while doing the UMCG thing and I just never managed to get back to
it (like a 1000 other things :/ -- this maintainer thing really is like
drinking from a firehose).
On Mon, Dec 12, 2022 at 02:18:59PM -0800, Josh Don wrote:
> > But really, having seen some of this I long for the UMCG patches -- that
> > at least was somewhat sane and trivially composes, unlike all this
> > madness.
>
> I wasn't sure if you were focusing specifically on how the BPF portion
> is implemented, or on UMCG vs sched_ext. For the latter,
The latter, from where I'm sitting UMCG looks a *TON* saner than this
BPF scheduler proposal. In fact, I'm >< close to just saying NAK to the
whole thing and ignoring it henceforth, there's too many problems with
the whole approach.
( Many were already noted by Linus when he NAK'ed loadable schedulers
previously. )
> and ignoring
> the specifics of this example, the UMCG and sched_ext work are
> complementary, but not mutually exclusive. UMCG is about driving
> cooperative scheduling within a particular application. UMCG does not
> have control over or react to external preemption,
It can control preemption inside the process, and if you have the degree
of control you need to make the whole BPF thing work, you also have the
degree of control to ensure you only run the one server task on a CPU
and all that no longer matters because there's only the process and you
control preemption inside that.
> nor does it make thread placement decisions.
It can do that just fine -- inside the process. UMCG has full control
over which server task a worker task is associated with, then run a
single server task per CPU and have them pinned and you get full
placement control.
> sched_ext is considering things more at
> the system level: arbitrating fairness and preemption between
> processes, deciding when and where threads run, etc., and also being
> able to take application-specific hints if desired.
sched_ext does fundamentally not compose, you cannot run two different
schedulers for two different application stacks that happen to co-reside
on the same machine.
While with UMCG that comes naturally.
sched_ext also sits at the very bottom of the class stack (it more or
less has to) the result is that in order to use it at all, you have to
have control over all runnable tasks in the system (a stray CFS task
would interfere quite disastrously) but that is exactly the same
constraint you need to make UMCG work.
Conversely, it is very hard to use the BPF thing to do what UMCG can do.
Using UMCG I can have a SCHED_DEADLINE server implement a task based
pipeline schedule (something that's fairly common and really hard to
pull off with just SCHED_DEADLINE itself).
Additionally, UMCG naturally works with things like Proxy Execution,
seeing how the server task *is* a proxy for the current active worker
task.
Hello,
On Tue, Dec 13, 2022 at 11:57:12AM +0100, Peter Zijlstra wrote:
> > Given that UP doesn't need to transfer tasks across, it might be possible to
> > move the whole dispatch operation into ->pick_next_task() but the current
> > state would be different, so it's more complicated and will likely be more
> > brittle.
>
> That sounds like something is amiss, you fundamentally hold all the
> right locks, there is only one.
Yeah, locking is not the problem on UP. It's more that balance() is before
put_prev_task() and pick_next_task() after, so the %current task state is
different between the two points. I'll see if I can sqaure that from SCX
side. I don't see a reason why it wouldn't be possible but it's likely more
complicated than adding a call in the same spot for UP.
Thanks.
--
tejun
Hello,
On Tue, Dec 13, 2022 at 11:55:10AM +0100, Peter Zijlstra wrote:
> On Mon, Dec 12, 2022 at 11:33:12AM -1000, Tejun Heo wrote:
>
> > > But this.. afaict that means that:
> > >
> > > - the whole EXT thing is incompatible with SCHED_CORE
> >
> > Can you expand on why this would be? I didn't test against SCHED_CORE, so am
> > sure things might be broken but can't think of a reason why it'd be
> > fundamentally incompatible.
>
> For starters, SCHED_CORE doesn't use __pick_next_task() (much). But I
SCX implements ->pick_task() and the CORE selection path calls ->balance()
and then ->pick_task(). That should work, right? Will test later.
> think you're going to have more trouble with prio_less() (which is the
> 3rd implementation of the scheduling function :/)
Can't it take the same approach as CFS? The BPF scheduler is gonna be the
one defining the relative priorities among SCX tasks, so that's where the
decision belongs.
> > > - the whole EXT thing can be trivially starved by the presence of a
> > > single CFS/BATCH/IDLE task.
> >
> > It's a simliar situation w/ RT vs. CFS, which is resolved via RT having
> > starvation avoidance.
>
> That is a horrible situation as is, FIFO/RR are very crap scheduling
> policies for a number of reasons but we're stuck with them due to
> history and POSIX :-(, that is not something you should justify anything
> with.
>
> In fact, it should be an example of what to avoid.
>
> Specifically, FIFO/RR fail at the fundamentals of OS
> abstractions -- they provide neither resource distribution nor
> isolation.
>
> > Here, the way it's handled is a bit different, SCX has
> > a watchdog mechanism implemented in "[PATCH 18/31] sched_ext: Implement
> > runnable task stall watchdog", so if SCX tasks hang for whatever reason
> > including being starved by CFS, it will get aborted and all tasks will be
> > handed back to CFS. IOW, it's treated like any other BPF scheduler errors
> > that can lead to stalls and recovered the same way.
>
> That all sounds quite terrible.. :/
The main source of difference is that we can't implicitly trust the BPF
scheduler and if it malfunctions or on user request, the system should
always be recoverable, so there are some extra things which are inherently
necessary to support that.
> When the scheduler isn't available it should be an error to switch a
> task to the policy, when there are tasks in the policy, it must not go
> away.
Yeah, this part is an interface design choice. Currently, when the BPF
scheduler fails or is not present for any reason, SCX falls back to CFS
because that seemed like the least invasive way to go about it, but it's
trivial to just let SCX do dumb FIFO scheduling with the global DSQ instead,
which in fact is already used during transition to guarantee forward
progress.
Thanks.
--
tejun
On Tue, 2022-12-13 at 08:12 -1000, Tejun Heo wrote:
> Hello,
>
> On Tue, Dec 13, 2022 at 11:55:10AM +0100, Peter Zijlstra wrote:
> > On Mon, Dec 12, 2022 at 11:33:12AM -1000, Tejun Heo wrote:
> >
> > > Here, the way it's handled is a bit different, SCX has
> > > a watchdog mechanism implemented in "[PATCH 18/31] sched_ext:
> > > Implement
> > > runnable task stall watchdog", so if SCX tasks hang for whatever
> > > reason
> > > including being starved by CFS, it will get aborted and all tasks
> > > will be
> > > handed back to CFS. IOW, it's treated like any other BPF
> > > scheduler errors
> > > that can lead to stalls and recovered the same way.
> >
> > That all sounds quite terrible.. :/
>
> The main source of difference is that we can't implicitly trust the
> BPF
> scheduler and if it malfunctions or on user request, the system
> should
> always be recoverable, so there are some extra things which are
> inherently
> necessary to support that.
>
That makes me wonder whether loading an SCX policy
should just have that policy take over all of the
SCHED_OTHER tasks by default, and have a failure of
the policy just return those tasks to CFS?
Having the two be operative at the same time seems
to be a cause of hard to resolve issues, while simply
running all non-RT tasks under the loadable policy
could simplify both internal kernel interfaces, as
well as externally visible effects?
--
All Rights Reversed.
On Tue, Dec 13, 2022 at 3:10 AM Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Dec 12, 2022 at 01:05:47PM -0800, Peter Oskolkov wrote:
> > Peter Zijlstra wrote:
> >
> > > I long for the UMCG patches -- that
> > > at least was somewhat sane and trivially composes, unlike all this
> > > madness.
> >
> > A surprise, to be sure, but a welcome one!
>
> Well, I did somewhat like it as I put significant effort into it. In
> fact I was >.< near to merging the lot when you changed your mind and
> went the syscall route.
Yes, we are doing it via syscalls now, as managing pinned pages in the TLS
approach seemed ... excessive? Complicated? Not sure how to describe it.
But if you believe the TLS approach is sound, feel free to merge it, and we
will make it work on our side. :)
(Tejun, my apologies for hijacking your thread a bit...)
[...]
Hello,
On Tue, Dec 13, 2022 at 12:30:06PM +0100, Peter Zijlstra wrote:
> ( Many were already noted by Linus when he NAK'ed loadable schedulers
> previously. )
Yeah, many of the points Linus raised still stand. However, that was 15
years ago and the situation including hardware reality has changed a lot. As
stated in the cover letter, that makes us (and others) want to try out
various ideas but the barrier has often been too high to do so at any scale,
which BPF drasically improves. Given those, I think it'd be worthwhile to
revisit that discussion.
> sched_ext also sits at the very bottom of the class stack (it more or
> less has to) the result is that in order to use it at all, you have to
> have control over all runnable tasks in the system (a stray CFS task
> would interfere quite disastrously) but that is exactly the same
> constraint you need to make UMCG work.
One important distinction is that it's a lot easier to have control at the
system level than at the application code level. Even for us with pretty
good control over what runs in the fleet, it'd be practically impossible to
effect that level of application change across the board. The situation is
further complicated with containers which can be pretty opaque to the
system. I have a hard time seeing co-operative application-driven scheduling
working among mutiple applications across the whole system. If we get to
non-fleet use-cases, it becomes even worse as you don't have enough resource
on or control over the code base you're running.
There may be some overlapping areas between SCX and UMCG but they're very
different things. After all, we can't let go of system level scheduling
because some applications have better control over their own sequencing.
As for the CFS starvation issue, I obviously don't find the currently
proposed behavior too bad - CFS is always the default scheduler and we fall
back to it whenever the BPF scheduling isn't working out whether that's
outright bugs in the BPF scheduler implementation or starvation through CFS.
That said, this comes down to what kind of behavior we wanna show to
userspace and we can implement whatever is appropriate and acceptable.
Thanks.
--
tejun
On Tue, Dec 13, 2022 at 10:40 AM Rik van Riel <[email protected]> wrote:
>
> On Tue, 2022-12-13 at 08:12 -1000, Tejun Heo wrote:
> > Hello,
> >
> > On Tue, Dec 13, 2022 at 11:55:10AM +0100, Peter Zijlstra wrote:
> > > On Mon, Dec 12, 2022 at 11:33:12AM -1000, Tejun Heo wrote:
> > >
> > > > Here, the way it's handled is a bit different, SCX has
> > > > a watchdog mechanism implemented in "[PATCH 18/31] sched_ext:
> > > > Implement
> > > > runnable task stall watchdog", so if SCX tasks hang for whatever
> > > > reason
> > > > including being starved by CFS, it will get aborted and all tasks
> > > > will be
> > > > handed back to CFS. IOW, it's treated like any other BPF
> > > > scheduler errors
> > > > that can lead to stalls and recovered the same way.
> > >
> > > That all sounds quite terrible.. :/
> >
> > The main source of difference is that we can't implicitly trust the
> > BPF
> > scheduler and if it malfunctions or on user request, the system
> > should
> > always be recoverable, so there are some extra things which are
> > inherently
> > necessary to support that.
> >
> That makes me wonder whether loading an SCX policy
> should just have that policy take over all of the
> SCHED_OTHER tasks by default, and have a failure of
> the policy just return those tasks to CFS?
>
> Having the two be operative at the same time seems
> to be a cause of hard to resolve issues, while simply
> running all non-RT tasks under the loadable policy
> could simplify both internal kernel interfaces, as
> well as externally visible effects?
There are reasons to want to still have CFS available even when SCX is
loaded. For example, on a partitioned shared tenant machine, moving
one application to an SCX policy without needing to move everyone. Or,
wanting to avoid scheduling things like kthreads under an SCX policy,
since for example that makes an SCX policy writer need not only
consider the needs of application threads, but also those of kernel
threads.
On Mon, Dec 12, 2022 at 2:14 AM Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Nov 29, 2022 at 10:22:42PM -1000, Tejun Heo wrote:
>
> > Rolling out kernel upgrades is a slow and iterative process. At a large scale
> > it can take months to roll a new kernel out to a fleet of servers. While this
> > latency is expected and inevitable for normal kernel upgrades, it can become
> > highly problematic when kernel changes are required to fix bugs. Livepatch [9]
> > is available to quickly roll out critical security fixes to large fleets, but
> > the scope of changes that can be applied with livepatching is fairly limited,
> > and would likely not be usable for patching scheduling policies. With
> > sched_ext, new scheduling policies can be rapidly rolled out to production
> > environments.
>
> I don't think we can or should use this argument to push BPF into ever
> more places.
Improving scheduling performance requires rapid iteration to explore
new policies and tune parameters, especially as hardware becomes more
heterogeneous, and applications become more complex. Waiting months
between evaluating scheduler policy changes is simply not scalable,
but this is the reality with large fleets that require time for
testing, qualification, and progressive rollout. The security angle
should be clear from how involved it was to integrate core scheduling,
for example.
> > and ignoring
> > the specifics of this example, the UMCG and sched_ext work are
> > complementary, but not mutually exclusive. UMCG is about driving
> > cooperative scheduling within a particular application. UMCG does not
> > have control over or react to external preemption,
>
> It can control preemption inside the process, and if you have the degree
> of control you need to make the whole BPF thing work, you also have the
> degree of control to ensure you only run the one server task on a CPU
> and all that no longer matters because there's only the process and you
> control preemption inside that.
To an extent yes, but this doesn't extend to the case where cpu is
overcommitted. Even if not by other applications, then responding to
preemption by, for example, kthreads (necessary for microsecond scale
workloads). But in general the common case is interference from other
applications, something which is handled by a system level scheduler
like sched_ext. The application vs system level control is an
important distinction here.
> > nor does it make thread placement decisions.
>
> It can do that just fine -- inside the process. UMCG has full control
> over which server task a worker task is associated with, then run a
> single server task per CPU and have them pinned and you get full
> placement control.
Again, this doesn't really scale past single server per cpu. It is not
feasible to partition systems in this way due to the loss of
efficiency.
> > sched_ext is considering things more at
> > the system level: arbitrating fairness and preemption between
> > processes, deciding when and where threads run, etc., and also being
> > able to take application-specific hints if desired.
>
> sched_ext does fundamentally not compose, you cannot run two different
> schedulers for two different application stacks that happen to co-reside
> on the same machine.
We're actually already developing a framework (and plan to share) to
support composing an arbitrary combination of schedulers. Essentially,
a "scheduler of schedulers". This supports the case, for example, of a
system that runs most tasks under some default SCX scheduler, but
allows a particular application or group of applications to utilize a
bespoke SCX scheduler of their own.
> sched_ext also sits at the very bottom of the class stack (it more or
> less has to) the result is that in order to use it at all, you have to
> have control over all runnable tasks in the system (a stray CFS task
> would interfere quite disastrously) but that is exactly the same
> constraint you need to make UMCG work.
UMCG still works when mixed with other tasks. You're specifying which
threads of your application you want running, but no guarantees are
made that they'll run right now if the system has other work to do.
SCX vs CFS is a more interesting story. Yes it is true that a single
CFS task could hog a cpu, but since SCX is managing things at a system
level, we feel that this is something that should be handled by system
administration. You shouldn't expect to mix cpu bound CFS tasks in the
same partition as threads running under SCX with good results.
> Conversely, it is very hard to use the BPF thing to do what UMCG can do.
> Using UMCG I can have a SCHED_DEADLINE server implement a task based
> pipeline schedule (something that's fairly common and really hard to
> pull off with just SCHED_DEADLINE itself).
UMCG and SCX are solving different problems though. An application can
decide execution order or control internal preemption via UMCG, while
SXC arbitrates allocation of system resources over time.
And, conversely, SCX can do things very difficult or impossible with
UMCG. For example, implementing core scheduling. Guaranteeing
microsecond scale tail latency. Applying a new scheduling algorithm
across multiple independent applications.
> Additionally, UMCG naturally works with things like Proxy Execution,
> seeing how the server task *is* a proxy for the current active worker
> task.
Proxy execution should also work with SCX; the enqueue/dequeue
abstraction can still be used to allow the SCX scheduler to select the
proxy.
On Tue, Dec 13, 2022 at 06:11:38PM -0800, Josh Don wrote:
> Improving scheduling performance requires rapid iteration to explore
> new policies and tune parameters, especially as hardware becomes more
> heterogeneous, and applications become more complex. Waiting months
> between evaluating scheduler policy changes is simply not scalable,
> but this is the reality with large fleets that require time for
> testing, qualification, and progressive rollout. The security angle
> should be clear from how involved it was to integrate core scheduling,
> for example.
Surely you can evaluate stuff on a small subset of machines -- I'm
fairly sure I've had google and facebook people tell me they do just
that, roll out the test kernel on tens to hundreds of thousand of
machines instead of the stupid number and see how it behaves there.
Statistics has something here I think, you can get a reliable
representation of stuff without having to sample *everyone*.
I was given to believe this was a fairly rapid process.
Just because you guys have more machines than is reasonable, doesn't
mean we have to put BPF everywhere.
Additionally, we don't merge and ship everybodies random debug patch
either -- you're free to do whatever you need to iterate on your own and
then send the patches that result from this experiment upstream. This is
how development works, no?
Hello,
On Wed, Dec 14, 2022 at 09:55:38AM +0100, Peter Zijlstra wrote:
> On Tue, Dec 13, 2022 at 06:11:38PM -0800, Josh Don wrote:
> > Improving scheduling performance requires rapid iteration to explore
> > new policies and tune parameters, especially as hardware becomes more
> > heterogeneous, and applications become more complex. Waiting months
> > between evaluating scheduler policy changes is simply not scalable,
> > but this is the reality with large fleets that require time for
> > testing, qualification, and progressive rollout. The security angle
> > should be clear from how involved it was to integrate core scheduling,
> > for example.
>
> Surely you can evaluate stuff on a small subset of machines -- I'm
> fairly sure I've had google and facebook people tell me they do just
> that, roll out the test kernel on tens to hundreds of thousand of
> machines instead of the stupid number and see how it behaves there.
>
> Statistics has something here I think, you can get a reliable
> representation of stuff without having to sample *everyone*.
Google guys probably have a lot to say here too and there may be many
commonalties, but here's how things are on our end.
We (Meta) experiment and debug at multiple levels. For example, when
qualifying a new kernel or feature, a common pattern we follow is
two-phased. The first phase is testing it on several well-known and widely
used workloads in a controlled experiment environment with fewer number of
machines, usually some dozens but can go one or two orders of magnitude
higher. Once that looks okay, the second phase is to gradually deploy while
monitoring system-level behaviors (crashes, utilization, latency and
pressure metrics and so on) and feedbacks from service owners.
We run tens of thousands of different workloads in the fleet and we try hard
to do as much as possible in the first phase but many of the difficult and
subtle problems are only detectable in the second phase. When we detect such
problems in the second phase, we triage the problem and pull back deployment
if necessary and then restart after fixing.
As the overused saying goes, quantity has a quality of its own. The
workloads become largely opaque because there are so many of them doing so
many different things for anyone from system side to examine each of them.
In many cases, the best and sometimes only visibility we get is statistical
- comparing two chunks of the fleet which are large enough for the
statistical signals to overcome the noise. That threshold can be pretty
high. Multiple hundreds of thousands of machines being used for a test set
isn't all that uncommon.
One complicating factor for the second phase is that we're deploying on
production fleet running live production workloads. Besides the obvious fact
that users become mightily unhappy when machines crash, there are
complicating matters like limits on how many and which machines can be
rebooted at any given time due to interactions with capacity and maintenance
which severely restricts how fast kernels can be iterated. A full sweep
through the fleet can easily take months.
Between a large number of opaque workloads and production constraints which
limit the type and speed of kernel iterations, our ability to experiment
with scheduling by modifying the kernel directly is severely limited. We can
do small things but trying out big ideas can become logistically
prohibitive.
Note that all these get even worse for public cloud operators. If we really
need to, we can at least find the service owner and talk with them. For
public cloud operators, the workloads are truly opaque.
There's yet another aspect which is caused by fleet dynamism. When we're
hunting down a scheduling misbehavior and want to test out specific ideas,
it can actually be pretty difficult to get back the same workload
composition after a reboot or crash. The fleet management layer will kick in
right away and the workloads get reallocated who-knows-where. This problem
is likely shared by smaller scale operations too. There are just a lot of
layers which are difficult to fixate across reboots and crashes. Even in the
same workload, the load balancer or dispatcher might behave very differently
for the machine after a reboot.
> I was given to believe this was a fairly rapid process.
Going back to the first phase where we're experimenting in a more controlled
environment. Yes, that is a faster process but only in comparison to the
second phase. Some controlled experiments, the faster ones, usually take
several hours to obtain a meaningful result. It just takes a while for
production workloads to start, jit-compile all the hot code paths, warm up
caches and so on. Others, unfortunately, take a lot longer to ramp up to the
degree whether it can be compared against production numbers. Some of the
benchmarks stretch multiple days.
With SCX, we can keep just keep hotswapping and tuning the scheduler
behavior getting results in tens of minutes instead of multiple hours and
without worrying about crashing the test machines, which often have
side-effects on the benchmark setup - the benchmarks are often performed
with shadowed production traffic using the same production software and they
get unhappy when a lot of machines crash. These problems can easily take
hours to resolve.
> Just because you guys have more machines than is reasonable, doesn't
> mean we have to put BPF everywhere.
There are some problems which are specific to large operators like us or
google for sure, but many of these problems are shared by other use cases
which need to test with real-world applications. Even on mobile devices,
it's way easier and faster to have a running test environment setup and
iterate through scheduling behavior changes without worrying about crashing
the machine than having to cycle and re-setup test setup for each iteration.
The productivity gain extends to individual kernel developers and
researchers. Just rebooting a server class hardware often takes upwards of
ten minutes, so most of us try to iterate as much on VMs as possible which
unfortunately doesn't work out too well for subtle performance issues. SCX
can easily cut down iteration time by an order of magnitude or more.
> Additionally, we don't merge and ship everybodies random debug patch
> either -- you're free to do whatever you need to iterate on your own and
> then send the patches that result from this experiment upstream. This is
> how development works, no?
We of course don't merge random debug patches which have limited usefulness
to a small number of use cases. However, we absolutely do ship code to
support debugging and development when the benefit outweights the cost, just
to list several examples - lockdep, perf, tracing, all the memory debug
options.
The argument is that given the current situation including hardware and
software landscape, the benefit of having BPF extensible scheduling
framework has enough benefits to justify the cost.
Thanks.
--
tejun
On 12/14/22 17:23, Tejun Heo wrote:
> Google guys probably have a lot to say here too and there may be many
> commonalties, but here's how things are on our end.
your email pretty much captures my experiences from the google side. in
fact, i think i'll save it for the next time someone asks me to
summarize the challenges with both kernel rollouts and testing changes
on workloads. =)
>> I was given to believe this was a fairly rapid process.
>
> Going back to the first phase where we're experimenting in a more controlled
> environment. Yes, that is a faster process but only in comparison to the
> second phase. Some controlled experiments, the faster ones, usually take
> several hours to obtain a meaningful result. It just takes a while for
> production workloads to start, jit-compile all the hot code paths, warm up
> caches and so on. Others, unfortunately, take a lot longer to ramp up to the
> degree whether it can be compared against production numbers. Some of the
> benchmarks stretch multiple days.
>
> With SCX, we can keep just keep hotswapping and tuning the scheduler
> behavior getting results in tens of minutes instead of multiple hours and
> without worrying about crashing the test machines
for testing sched policies on one of our bigger apps, the O(hours)
kernel reboot vs O(minutes) reload of a BPF scheduler is a pain. but
that's only for a single machine; it can be much worse on a full cluster.
full-cluster tests are a different beast. we are one of many groups
that want to do testing, and we have to reserve a time on their cluster.
but to change the kernel, it actually took us weeks to coordinate an
kernel change on the app's large testing cluster - essentially since we
were using an unqualified kernel, we 'blocked' all of the other testing.
> it's way easier and faster to have a running test environment setup and
> iterate through scheduling behavior changes without worrying about crashing
> the machine than having to cycle and re-setup test setup for each iteration.
i'm a newcomer to BPF, but for me the "interaction with live machine" is
a major BPF feature, both in SCX and also more broadly with the various
tracing tools and other BPF uses. (not to mention the per-workload or
per-machine customization that BPF enables, but that's a separate
discussion).
thanks,
barret