2023-01-28 00:17:01

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET v2] sched: Implement BPF extensible scheduler class

Changes
-------

This is v2 of sched_ext (SCX) patchset. The followings are changes from v1
(http://lkml.kernel.org/r/[email protected]).

- Rebased on top of bpf/for-next - a5f6b9d577eb ("Merge branch 'Enable
struct_ops programs to be sleepable'"). There were several missing
features including generic cpumask helpers and sleepable struct_ops
operation support that v1 was working around. The rebase gets rid of all
SCX specific temporary helpers.

- Some kfunc helpers are context-sensitive and can only be called from
specific operations. v1 didn't restrict kfunc accesses allowing them to be
misused which can lead to crashes and other malfunctions. v2 makes more
kfuncs safe to be called from anywhere and implements per-task mask based
runtime access control for the rest. The longer-term plan is to make the
BPF verifier enforce these restrictions. Combined with the above, sans
mistakes and bugs, it shouldn't be possible to crash the machine through
SCX and its helpers.

- Core-sched support. While v1 implemented the pick_task operation, there
were multiple missing pieces for working core-sched support. v2 adds
0027-sched_ext-Implement-core-sched-support.patch. SCX by default
implements global FIFO ordering and allows the BPF schedulers to implement
custom ordering via scx_ops.core_sched_before(). scx_example_qmap is
updated so that the five queues' relative priorities are correctly
reflected when core-sched is enabled.

- Dropped balance_scx_on_up() which was called from put_prev_task_balance().
UP support is now contained in SCX proper.

- 0002-sched-Encapsulate-task-attribute-change-sequence-int.patch adds
SCHED_CHANGE_BLOCK() which encapsulates the preparation and restoration
sequences used for task attribute changes. For SCX, this replaces
sched_deq_and_put_task() and sched_enq_and_set_task() from v1.

- 0011-sched-Add-reason-to-sched_move_task.patch dropped from v1. SCX now
distinguishes cgroup and autogroup tg's using task_group_is_autogroup().

- Other misc changes including fixes for bugs that Julia Lawall noticed and
patch descriptions updates with more details on how the introduced changes
are going to be used.

- MAINTAINERS entries added.

The followings are discussion points which were raised but didn't result in
code changes in this iteration.

- There were discussions around exposing __setscheduler_prio() and, in v2,
SCHED_CHANGE_BLOCK() in kernel/sched/sched.h. Switching scheduler
implementations is innate for SCX. At the very least, it needs to be able
to turn on and off the BPF scheduler which requires something equivalent
to SCHED_CHANGE_BLOCK(). The use of __setscheduler_prio() depends on the
behavior we want to present to userspace. The current one of using CFS as
the fallback when BPF scheduler is not available seems more friendly and
less error-prone to other options.

- Another discussion point was around for_each_active_class() and friends
which skip over CFS or SCX when it's known that the sched_class must be
empty. I left it as-is for now as it seems to be cleaner and more robust
than trying to plug each operation which may added unnecessary overheads.

Overview
--------

This patch set proposes a new scheduler class called ‘ext_sched_class’, or
sched_ext, which allows scheduling policies to be implemented as BPF programs.

More details will be provided on the overall architecture of sched_ext
throughout the various patches in this set, as well as in the “How” section
below. We realize that this patch set is a significant proposal, so we will be
going into depth in the following “Motivation” section to explain why we think
it’s justified. That section is laid out as follows, touching on three main
axes where we believe that sched_ext provides significant value:

1. Ease of experimentation and exploration: Enabling rapid iteration of new
scheduling policies.

2. Customization: Building application-specific schedulers which implement
policies that are not applicable to general-purpose schedulers.

3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling
policies in production environments.

After the motivation section, we’ll provide a more detailed (but still
high-level) overview of how sched_ext works.

Motivation
----------

1. Ease of experimentation and exploration

*Why is exploration important?*

Scheduling is a challenging problem space. Small changes in scheduling
behavior can have a significant impact on various components of a system, with
the corresponding effects varying widely across different platforms,
architectures, and workloads.

While complexities have always existed in scheduling, they have increased
dramatically over the past 10-15 years. In the mid-late 2000s, cores were
typically homogeneous and further apart from each other, with the criteria for
scheduling being roughly the same across the entire die.

Systems in the modern age are by comparison much more complex. Modern CPU
designs, where the total power budget of all CPU cores often far exceeds the
power budget of the socket, with dynamic frequency scaling, and with or
without chiplets, have significantly expanded the scheduling problem space.
Cache hierarchies have become less uniform, with Core Complex (CCX) designs
such as recent AMD processors having multiple shared L3 caches within a single
socket. Such topologies resemble NUMA sans persistent NUMA node stickiness.

Use-cases have become increasingly complex and diverse as well. Applications
such as mobile and VR have strict latency requirements to avoid missing
deadlines that impact user experience. Stacking workloads in servers is
constantly pushing the demands on the scheduler in terms of workload isolation
and resource distribution.

Experimentation and exploration are important for any non-trivial problem
space. However, given the recent hardware and software developments, we
believe that experimentation and exploration are not just important, but
_critical_ in the scheduling problem space.

Indeed, other approaches in industry are already being explored. AMD has
proposed an experimental patch set [0] which enables userspace to provide
hints to the scheduler via “Userspace Hinting”. The approach adds a prctl()
API which allows callers to set a numerical “hint” value on a struct
task_struct. This hint is then optionally read by the scheduler to adjust the
cost calculus for various scheduling decisions.

[0]: https://lore.kernel.org/lkml/[email protected]/

Huawei have also expressed interest [1] in enabling some form of programmable
scheduling. While we’re unaware of any patch sets which have been sent to the
upstream list for this proposal, it similarly illustrates the need for more
flexibility in the scheduler.

[1]: https://lore.kernel.org/bpf/[email protected]/

Additionally, Google has developed ghOSt [2] with the goal of enabling custom,
userspace driven scheduling policies. Prior presentations at LPC [3] have
discussed ghOSt and how BPF can be used to accelerate scheduling.

[2]: https://dl.acm.org/doi/pdf/10.1145/3477132.3483542
[3]: https://lpc.events/event/16/contributions/1365/

*Why can’t we just explore directly with CFS?*

Experimenting with CFS directly or implementing a new sched_class from scratch
is of course possible, but is often difficult and time consuming. Newcomers to
the scheduler often require years to understand the codebase and become
productive contributors. Even for seasoned kernel engineers, experimenting
with and upstreaming features can take a very long time. The iteration process
itself is also time consuming, as testing scheduler changes on real hardware
requires reinstalling the kernel and rebooting the host.

Core scheduling is an example of a feature that took a significant amount of
time and effort to integrate into the kernel. Part of the difficulty with core
scheduling was the inherent mismatch in abstraction between the desire to
perform core-wide scheduling, and the per-cpu design of the kernel scheduler.
This caused issues, for example ensuring proper fairness between the
independent runqueues of SMT siblings.

The high barrier to entry for working on the scheduler is an impediment to
academia as well. Master’s/PhD candidates who are interested in improving the
scheduler will spend years ramping-up, only to complete their degrees just as
they’re finally ready to make significant changes. A lower entrance barrier
would allow researchers to more quickly ramp up, test out hypotheses, and
iterate on novel ideas. Research methodology is also severely hampered by the
high barrier of entry to make modifications; for example, the Shenango [4] and
Shinjuku scheduling policies used sched affinity to replicate the desired
policy semantics, due to the difficulty of incorporating these policies into
the kernel directly.

[4]: https://www.usenix.org/system/files/nsdi19-ousterhout.pdf

The iterative process itself also imposes a significant cost to working on the
scheduler. Testing changes requires developers to recompile and reinstall the
kernel, reboot their machines, rewarm their workloads, and then finally rerun
their benchmarks. Though some of this overhead could potentially be mitigated
by enabling schedulers to be implemented as kernel modules, a machine crash or
subtle system state corruption is always only one innocuous mistake away.
These problems are exacerbated when testing production workloads in a
datacenter environment as well, where multiple hosts may be involved in an
experiment; requiring a significantly longer ramp up time. Warming up memcache
instances in the Meta production environment takes hours, for example.

*How does sched_ext help with exploration?*

sched_ext attempts to address all of the problems described above. In this
section, we’ll describe the benefits to experimentation and exploration that
are afforded by sched_ext, provide real-world examples of those benefits, and
discuss some of the trade-offs and considerations in our design choices.

One of our main goals was to lower the barrier to entry for experimenting with
the scheduler. sched_ext provides ergonomic callbacks and helpers to ease
common operations such as managing idle CPUs, scheduling tasks on arbitrary
CPUs, handling preemptions from other scheduling classes, and more. While
sched_ext does require some ramp-up, the complexity is self-contained, and the
learning curve gradual. Developers can ramp up by first implementing simple
policies such as global FIFO in only tens of lines of code, and then continue
to learn the APIs and building blocks available with sched_ext as they build
more featureful and complex schedulers.

Another critical advantage provided by sched_ext is the use of BPF. BPF
provides strong safety guarantees by statically analyzing programs at load
time to ensure that they cannot corrupt or crash the system. sched_ext
guarantees system integrity no matter what BPF scheduler is loaded, and
provides mechanisms to safely disable the current BPF scheduler and migrate
tasks back to a trusted scheduler. For example, we also implement in-kernel
safety mechanisms to guarantee that a misbehaving scheduler cannot
indefinitely starve tasks. BPF also enables sched_ext to significantly improve
iteration speed for running experiments. Loading and unloading a BPF scheduler
is simply a matter of running and terminating a sched_ext binary.

BPF also provides programs with a rich set of APIs, such as maps, kfuncs, and
BPF helpers. In addition to providing useful building blocks to programs that
run entirely in kernel space (such as many of our example schedulers), these
APIs also allow programs to leverage user space in making scheduling
decisions. Specifically, the Atropos sample scheduler has a relatively simple
FIFO scheduling layer in BPF, paired with a load balancing component in
userspace written in Rust. As described in more detail below, we also built a
more general user-space scheduling framework called “rhone” by leveraging
various BPF features.

On the other hand, BPF does have shortcomings, as can be plainly seen from the
complexity in some of the example schedulers. scx_example_pair.bpf.c
illustrates this point well. To start, it requires a good amount of code to
emulate cgroup-local-storage. In the kernel proper, this would simply be a
matter of adding another pointer to the struct cgroup, but in BPF, it requires
a complex juggling of data amongst multiple different maps, a good amount of
boilerplate code, and some unwieldy bpf_loop()‘s and atomics. The code is also
littered with explicit and often unnecessary sanity checks to appease the
verifier.

That being said, BPF is being rapidly improved. For example, Yonghong Song
recently upstreamed a patch set [5] to add a cgroup local storage map type,
allowing scx_example_pair.bpf.c to be simplified. There are plans to address
other issues as well, such as providing statically-verified locking, and
avoiding the need for unnecessary sanity checks. Addressing these shortcomings
is a high priority for BPF, and as progress continues to be made, we expect
most deficiencies to be addressed in the not-too-distant future.

[5]: https://lore.kernel.org/bpf/[email protected]/

Yet another exploration advantage of sched_ext is helping widening the scope
of experiments. For example, sched_ext makes it easy to defer CPU assignment
until a task starts executing, allowing schedulers to share scheduling queues
at any granularity (hyper-twin, CCX and so on). Additionally, higher level
frameworks can be built on top to further widen the scope. For example, the
aforementioned “rhone” [6] library allows implementing scheduling policies in
user-space by encapsulating the complexity around communicating scheduling
decisions with the kernel. This allows taking advantage of a richer
programming environment in user-space, enabling experimenting with, for
instance, more complex mathematical models.

[6]: https://github.com/Decave/rhone

sched_ext also allows developers to leverage machine learning. At Meta, we
experimented with using machine learning to predict whether a running task
would soon yield its CPU. These predictions can be used to aid the scheduler
in deciding whether to keep a runnable task on its current CPU rather than
migrating it to an idle CPU, with the hope of avoiding unnecessary cache
misses. Using a tiny neural net model with only one hidden layer of size 16,
and a decaying count of 64 syscalls as a feature, we were able to achieve a
15% throughput improvement on an Nginx benchmark, with an 87% inference
accuracy.

2. Customization

This section discusses how sched_ext can enable users to run workloads on
application-specific schedulers.

*Why deploy custom schedulers rather than improving CFS?*

Implementing application-specific schedulers and improving CFS are not
conflicting goals. Scheduling features explored with sched_ext which yield
beneficial results, and which are sufficiently generalizable, can and should
be integrated into CFS. However, CFS is fundamentally designed to be a general
purpose scheduler, and thus is not conducive to being extended with some
highly targeted application or hardware specific changes.

Targeted, bespoke scheduling has many potential use cases. For example, VM
scheduling can make certain optimizations that are infeasible in CFS due to
the constrained problem space (scheduling a static number of long-running
VCPUs versus an arbitrary number of threads). Additionally, certain
applications might want to make targeted policy decisions based on hints
directly from the application (for example, a service that knows the different
deadlines of incoming RPCs).

Google has also experimented with some promising, novel scheduling policies.
One example is “central” scheduling, wherein a single CPU makes all scheduling
decisions for the entire system. This allows most cores on the system to be
fully dedicated to running workloads, and can have significant performance
improvements for certain use cases. For example, central scheduling with VCPUs
can avoid expensive vmexits and cache flushes, by instead delegating the
responsibility of preemption checks from the tick to a single CPU. See
scx_example_central.bpf.c for a simple example of a central scheduling policy
built in sched_ext.

Some workloads also have non-generalizable constraints which enable
optimizations in a scheduling policy which would otherwise not be feasible.
For example,VM workloads at Google typically have a low overcommit ratio
compared to the number of physical CPUs. This allows the scheduler to support
bounded tail latencies, as well as longer blocks of uninterrupted time.

Yet another interesting use case is the scx_example_cgfifo[7] scheduler,
which provides FIFO policies for individual workloads, and a flattened
hierarchical vtree for cgroups. This scheduler does not account for
thundering herd problems among cgroups, and therefore may not be suitable
for inclusion in CFS. However, in a simple benchmark using wrk[8] on apache
serving a CGI script calculating sha1sum of a small file, it outperformed
CFS by ~3% with CPU controller disabled and by ~10% with two apache
instances competing with 2:1 weight ratio nested four level deep.

Note that the scx_example_cgfifo[7] scheduler isn't included in this
patchset because it depends on the BPF rbtree which is still being
developed. The linked version is an earlier draft. We will forward-port and
include it in the series once the BPF rbtree support is in place.

[7] https://github.com/htejun/sched_ext/commit/f2fcd3147fb6286e0a35fcbed33c3bac69546a96
[8] https://github.com/wg/wrk

Certain industries require specific scheduling behaviors that do not apply
broadly. For example, ARINC 653 defines scheduling behavior that is widely
used by avionic software, and some out-of-tree implementations
(https://ieeexplore.ieee.org/document/7005306) have been built. While the
upstream community may decide to merge one such implementation in the future,
it would also be entirely reasonable to not do so given the narrowness of
use-case, and non-generalizable, strict requirements. Such cases can be well
served by sched_ext in all stages of the software development lifecycle --
development, testing, deployment and maintenance.

There are also classes of policy exploration, such as machine learning, or
responding in real-time to application hints, that are significantly harder
(and not necessarily appropriate) to integrate within the kernel itself.

*Won’t this increase fragmentation?*

We acknowledge that to some degree, sched_ext does run the risk of increasing
the fragmentation of scheduler implementations. As a result of exploration,
however, we believe that enabling the larger ecosystem to innovate will
ultimately accelerate the overall development and performance of Linux.
Additionally, our licensing and API stability policies should incentivize
users to upstream their schedulers.

BPF programs are required to be GPLv2, which is enforced by the verifier on
program loads. With regards to API stability, just as with other semi-internal
interfaces such as BPF kfuncs, we won’t be providing any API stability
guarantees to BPF schedulers. While we intend to make an effort to provide
compatibility when possible, we will not provide any explicit, strong
guarantees as the kernel typically does with e.g. UAPI headers. For users who
decide to keep their schedulers out-of-tree,the licensing and maintenance
overheads will be fundamentally the same as for carrying out-of-tree patches.

With regards to the schedulers included in this patch set, and any other
schedulers we implement in the future, both Meta and Google will open-source
all of the schedulers we implement which have any relevance to the broader
upstream community. We expect that some of these, such as the example
schedulers and scx_example_cgfifo scheduler, will be upstreamed as part of
the kernel tree. Distros will be able to package and release these
schedulers with the kernel, allowing users to utilize these schedulers
out-of-the-box without requiring any additional work or dependencies such as
clang or building the scheduler programs themselves. Other schedulers and
scheduling frameworks such as rhone may be open-sourced through separate
per-project repos.

3. Rapid scheduler deployments

Rolling out kernel upgrades is a slow and iterative process. At a large scale
it can take months to roll a new kernel out to a fleet of servers. While this
latency is expected and inevitable for normal kernel upgrades, it can become
highly problematic when kernel changes are required to fix bugs. Livepatch [9]
is available to quickly roll out critical security fixes to large fleets, but
the scope of changes that can be applied with livepatching is fairly limited,
and would likely not be usable for patching scheduling policies. With
sched_ext, new scheduling policies can be rapidly rolled out to production
environments.

[9]: https://www.kernel.org/doc/html/latest/livepatch/livepatch.html

As an example, one of the variants of the L1 Terminal Fault (L1TF) [10]
vulnerability allows a VCPU running a VM to read arbitrary host kernel
memory for pages in L1 data cache. The solution was to implement core
scheduling, which ensures that tasks running as hypertwins have the same
“cookie”.

[10]: https://www.intel.com/content/www/us/en/architecture-and-technology/l1tf.html

While core scheduling works well, it took a long time to finalize and land
upstream. This long rollout period was painful, and required organizations to
make difficult choices amongst a bad set of options. Some companies such as
Google chose to implement and use their own custom L1TF-safe scheduler, others
chose to run without hyper-threading enabled, and yet others left
hyper-threading enabled and crossed their fingers.

Once core scheduling was upstream, organizations had to upgrade the kernels on
their entire fleets. As downtime is not an option for many, these upgrades had
to be gradually rolled out, which can take a very long time for large fleets.

An example of an sched_ext scheduler that illustrates core scheduling
semantics is scx_example_pair.bpf.c, which co-schedules pairs of tasks from
the same cgroup, and is resilient to L1TF vulnerabilities. While this example
scheduler is certainly not suitable for production in its current form, a
similar scheduler that is more performant and featureful could be written and
deployed if necessary.

Rapid scheduling deployments can similarly be useful to quickly roll-out new
scheduling features without requiring kernel upgrades. At Google, for example,
it was observed that some low-priority workloads were causing degraded
performance for higher-priority workloads due to consuming a disproportionate
share of memory bandwidth. While a temporary mitigation was to use sched
affinity to limit the footprint of this low-priority workload to a small
subset of CPUs, a preferable solution would be to implement a more featureful
task-priority mechanism which automatically throttles lower-priority tasks
which are causing memory contention for the rest of the system. Implementing
this in CFS and rolling it out to the fleet could take a very long time.

sched_ext would directly address these gaps. If another hardware bug or
resource contention issue comes in that requires scheduler support to
mitigate, sched_ext can be used to experiment with and test different
policies. Once a scheduler is available, it can quickly be rolled out to as
many hosts as necessary, and function as a stop-gap solution until a
longer-term mitigation is upstreamed.

How
---

sched_ext is a new sched_class which allows scheduling policies to be
implemented in BPF programs.

sched_ext leverages BPF’s struct_ops feature to define a structure which
exports function callbacks and flags to BPF programs that wish to implement
scheduling policies. The struct_ops structure exported by sched_ext is struct
sched_ext_ops, and is conceptually similar to struct sched_class. The role of
sched_ext is to map the complex sched_class callbacks to the more simple and
ergonomic struct sched_ext_ops callbacks.

Unlike some other BPF program types which have ABI requirements due to
exporting UAPIs, struct_ops has no ABI requirements whatsoever. This provides
us with the flexibility to change the APIs provided to schedulers as
necessary. BPF struct_ops is also already being used successfully in other
subsystems, such as in support of TCP congestion control.

The only struct_ops field that is required to be specified by a scheduler is
the ‘name’ field. Otherwise, sched_ext will provide sane default behavior,
such as automatically choosing an idle CPU on the task wakeup path if
.select_cpu() is missing.

*Dispatch queues*

To bridge the workflow imbalance between the scheduler core and sched_ext_ops
callbacks, sched_ext uses simple FIFOs called dispatch queues (dsq's). By
default, there is one global dsq (SCX_DSQ_GLOBAL), and one local per-CPU dsq
(SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for convenience and need not be
used by a scheduler that doesn't require it. As described in more detail
below, SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when
putting the next task on the CPU. The BPF scheduler can manage an arbitrary
number of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().

*Scheduling cycle*

The following briefly shows a typical workflow for how a waking task is
scheduled and executed.

1. When a task is waking up, .select_cpu() is the first operation invoked.
This serves two purposes. It both allows a scheduler to optimize task
placement by specifying a CPU where it expects the task to eventually be
scheduled, and the latter is that the selected CPU will be woken if it’s
idle.

2. Once the target CPU is selected, .enqueue() is invoked. It can make one of
the following decisions:

- Immediately dispatch the task to either the global dsq (SCX_DSQ_GLOBAL)
or the current CPU’s local dsq (SCX_DSQ_LOCAL).

- Immediately dispatch the task to a user-created dispatch queue.

- Queue the task on the BPF side, e.g. in an rbtree map for a vruntime
scheduler, with the intention of dispatching it at a later time from
.dispatch().

3. When a CPU is ready to schedule, it first looks at its local dsq. If empty,
it invokes .consume() which should make one or more scx_bpf_consume() calls
to consume tasks from dsq's. If a scx_bpf_consume() call succeeds, the CPU
has the next task to run and .consume() can return. If .consume() is not
defined, sched_ext will by-default consume from only the built-in
SCX_DSQ_GLOBAL dsq.

4. If there's still no task to run, .dispatch() is invoked which should make
one or more scx_bpf_dispatch() calls to dispatch tasks from the BPF
scheduler to one of the dsq's. If more than one task has been dispatched,
go back to the previous consumption step.

*Verifying callback behavior*

sched_ext always verifies that any value returned from a callback is valid,
and will issue an error and unload the scheduler if it is not. For example, if
.select_cpu() returns an invalid CPU, or if an attempt is made to invoke the
scx_bpf_dispatch() with invalid enqueue flags. Furthermore, if a task remains
runnable for too long without being scheduled, sched_ext will detect it and
error-out the scheduler.

Closing Thoughts
----------------

Both Meta and Google have experimented quite a lot with schedulers in the last
several years. Google has benchmarked various workloads using user space
scheduling, and have achieved performance wins by trading off generality for
application specific needs. At Meta, we have not yet deployed sched_ext on any
production workloads, though our preliminary experiments indicate that
sched_ext would provide significant performance wins when deployed at scale.
If successfully upstreamed, we expect to leverage it extensively to run
various experiments and develop customized schedulers for a number of critical
workloads.

In closing, both Meta and Google believe that sched_ext will significantly
evolve how the broader community explores the scheduling problem space,
empowering continued improvement to the in-kernel scheduler, while also
enabling targeted policies for custom applications. We’ll be able to
experiment easier and faster, explore uncharted areas, and deploy emergency
scheduler changes when necessary. The same applies to anyone who wants to work
on the scheduler, including academia and specialized industries. sched_ext
will push forward the state of the art when it comes to scheduling and
performance in Linux.

Written By
----------

David Vernet <[email protected]>
Josh Don <[email protected]>
Tejun Heo <[email protected]>
Barret Rhoden <[email protected]>

Supported By
------------

Paul Turner <[email protected]>
Neel Natu <[email protected]>
Patrick Bellasi <[email protected]>
Hao Luo <[email protected]>
Dimitrios Skarlatos <[email protected]>

Patchset
--------

This patchset is on top of bpf/for-next as of 2023-01-26:

a5f6b9d577eb ("Merge branch 'Enable struct_ops programs to be sleepable'")

and contains the following patches:

NOTE: The doc added by 0028 contains a high-level overview and might be good
place to start.

0001-cgroup-Implement-cgroup_show_cftypes.patch
0002-sched-Encapsulate-task-attribute-change-sequence-int.patch
0003-sched-Restructure-sched_class-order-sanity-checks-in.patch
0004-sched-Allow-sched_cgroup_fork-to-fail-and-introduce-.patch
0005-sched-Add-sched_class-reweight_task.patch
0006-sched-Add-sched_class-switching_to-and-expose-check_.patch
0007-sched-Factor-out-cgroup-weight-conversion-functions.patch
0008-sched-Expose-css_tg-__setscheduler_prio-and-SCHED_CH.patch
0009-sched-Enumerate-CPU-cgroup-file-types.patch
0010-sched-Add-reason-to-sched_class-rq_-on-off-line.patch
0011-sched-Add-normal_policy.patch
0012-sched_ext-Add-boilerplate-for-extensible-scheduler-c.patch
0013-sched_ext-Implement-BPF-extensible-scheduler-class.patch
0014-sched_ext-Add-scx_example_dummy-and-scx_example_qmap.patch
0015-sched_ext-Add-sysrq-S-which-disables-the-BPF-schedul.patch
0016-sched_ext-Implement-runnable-task-stall-watchdog.patch
0017-sched_ext-Allow-BPF-schedulers-to-disallow-specific-.patch
0018-sched_ext-Allow-BPF-schedulers-to-switch-all-eligibl.patch
0019-sched_ext-Implement-scx_bpf_kick_cpu-and-task-preemp.patch
0020-sched_ext-Make-watchdog-handle-ops.dispatch-looping-.patch
0021-sched_ext-Add-task-state-tracking-operations.patch
0022-sched_ext-Implement-tickless-support.patch
0023-sched_ext-Add-cgroup-support.patch
0024-sched_ext-Implement-SCX_KICK_WAIT.patch
0025-sched_ext-Implement-sched_ext_ops.cpu_acquire-releas.patch
0026-sched_ext-Implement-sched_ext_ops.cpu_online-offline.patch
0027-sched_ext-Implement-core-sched-support.patch
0028-sched_ext-Documentation-scheduler-Document-extensibl.patch
0029-sched_ext-Add-a-basic-userland-vruntime-scheduler.patch
0030-sched_ext-Add-a-rust-userspace-hybrid-example-schedu.patch

0001 : Cgroup prep.

0002-0011: Scheduler prep.

0012-0014: sched_ext core implementation and a couple example BPF scheduler.

0015-0018: Utility features including safety mechanisms and switch-all.

0020-0022: Kicking and preempting other CPUs, task state transition tracking
and tickless support. Demonstrated with an example central
scheduler which makes all scheduling decisions on one CPU.

0023-0025: cgroup support and the ability to wait for other CPUs after
kicking them. Demonstrated with an example pair scheduler which
guarantees that a hyperthread pair always executes tasks from the
same cgroup at any given time.

0026 : Add CPU hotplug callbacks.

0027 : Add core-sched support.

0028 : Add documentation.

0029-0030: Add two example schedulers. One demonstrating deferring most
scheduling decisions to userland. The other demonstrating a
hybrid approach where load balancing decisions are made by
userspace written in rust.

The patchset is also available in the following git branch:

https://github.com/htejun/sched_ext sched_ext-v2

diffstat follows.

Documentation/scheduler/index.rst | 1
Documentation/scheduler/sched-ext.rst | 224 ++
MAINTAINERS | 3
drivers/tty/sysrq.c | 1
include/asm-generic/vmlinux.lds.h | 1
include/linux/cgroup-defs.h | 8
include/linux/cgroup.h | 1
include/linux/sched.h | 5
include/linux/sched/ext.h | 678 ++++++++
include/linux/sched/task.h | 3
include/uapi/linux/sched.h | 1
init/Kconfig | 5
init/init_task.c | 12
kernel/Kconfig.preempt | 24
kernel/bpf/bpf_struct_ops_types.h | 4
kernel/cgroup/cgroup.c | 97 +
kernel/fork.c | 17
kernel/sched/build_policy.c | 5
kernel/sched/core.c | 493 +++---
kernel/sched/deadline.c | 4
kernel/sched/debug.c | 6
kernel/sched/ext.c | 4037 +++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/ext.h | 241 +++
kernel/sched/fair.c | 9
kernel/sched/idle.c | 2
kernel/sched/rt.c | 4
kernel/sched/sched.h | 159 +-
kernel/sched/topology.c | 4
tools/sched_ext/.gitignore | 8
tools/sched_ext/Makefile | 212 ++
tools/sched_ext/atropos/.gitignore | 3
tools/sched_ext/atropos/Cargo.toml | 28
tools/sched_ext/atropos/build.rs | 70
tools/sched_ext/atropos/rustfmt.toml | 8
tools/sched_ext/atropos/src/bpf/atropos.bpf.c | 632 +++++++
tools/sched_ext/atropos/src/bpf/atropos.h | 44
tools/sched_ext/atropos/src/main.rs | 648 ++++++++
tools/sched_ext/atropos/src/oss/atropos_sys.rs | 10
tools/sched_ext/atropos/src/oss/mod.rs | 29
tools/sched_ext/atropos/src/util.rs | 24
tools/sched_ext/gnu/stubs.h | 1
tools/sched_ext/scx_common.bpf.h | 134 +
tools/sched_ext/scx_example_central.bpf.c | 337 ++++
tools/sched_ext/scx_example_central.c | 94 +
tools/sched_ext/scx_example_dummy.bpf.c | 67
tools/sched_ext/scx_example_dummy.c | 97 +
tools/sched_ext/scx_example_pair.bpf.c | 646 ++++++++
tools/sched_ext/scx_example_pair.c | 143 +
tools/sched_ext/scx_example_pair.h | 10
tools/sched_ext/scx_example_qmap.bpf.c | 398 +++++
tools/sched_ext/scx_example_qmap.c | 106 +
tools/sched_ext/scx_example_userland.bpf.c | 275 +++
tools/sched_ext/scx_example_userland.c | 403 +++++
tools/sched_ext/scx_example_userland_common.h | 19
tools/sched_ext/user_exit_info.h | 50
55 files changed, 10314 insertions(+), 231 deletions(-)

Thanks.

--
tejun



2023-01-28 00:17:09

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 02/30] sched: Encapsulate task attribute change sequence into a helper macro

A task needs to be dequeued and put before an attribute change and then
restored afterwards. This is currently open-coded in multiple places. This
patch encapsulates the preparation and restoration sequences into
SCHED_CHANGE_BLOCK which allows the actual attribute changes to be put
inside its nested block. While the conversions are generally
straightforward, there are some subtleties:

* If a variable is specified for the flags argument, it can be modified from
inside the block body to allow using a different flags value for
re-enqueueing. This is used by rt_mutex_setprio() and
__sched_setscheduler().

* __sched_setscheduler() used to only set ENQUEUE_HEAD if the task is
queued. After the conversion, it sets the flag whether the task is queued
or not. This doesn't cause any behavioral differences and is simpler than
accessing the internal state of the helper.

* In a similar vein, sched_move_task() tests task_current() again after the
change block instead of carrying over the test result from inside the
change block.

This patch is adopted from Peter Zijlstra's draft patch linked below. The
changes are:

* Call fini explicitly from for() instead of using the __cleanup__
attribute.

* Allow the queue flag variable to be modified directly so that the user
doesn't have to poke into sched_change_guard struct. Also, in the original
patch, rt_mutex_setprio() was incorrectly updating its queue_flag instead
of cg.flags.

* Some cosmetic changes.

Signed-off-by: Tejun Heo <[email protected]>
Original-patch-by: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/all/[email protected]/T/#u
Reviewed-by: David Vernet <[email protected]>
---
kernel/sched/core.c | 260 ++++++++++++++++++++++----------------------
1 file changed, 130 insertions(+), 130 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 25b582b6ee5f..bfc3312f305a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2094,6 +2094,76 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
dequeue_task(rq, p, flags);
}

+struct sched_change_guard {
+ struct task_struct *p;
+ struct rq *rq;
+ bool queued;
+ bool running;
+ bool done;
+};
+
+static struct sched_change_guard
+sched_change_guard_init(struct rq *rq, struct task_struct *p, int flags)
+{
+ struct sched_change_guard cg = {
+ .rq = rq,
+ .p = p,
+ .queued = task_on_rq_queued(p),
+ .running = task_current(rq, p),
+ };
+
+ if (cg.queued) {
+ /*
+ * __kthread_bind() may call this on blocked tasks without
+ * holding rq->lock through __do_set_cpus_allowed(). Assert @rq
+ * locked iff @p is queued.
+ */
+ lockdep_assert_rq_held(rq);
+ dequeue_task(rq, p, flags);
+ }
+ if (cg.running)
+ put_prev_task(rq, p);
+
+ return cg;
+}
+
+static void sched_change_guard_fini(struct sched_change_guard *cg, int flags)
+{
+ if (cg->queued)
+ enqueue_task(cg->rq, cg->p, flags | ENQUEUE_NOCLOCK);
+ if (cg->running)
+ set_next_task(cg->rq, cg->p);
+ cg->done = true;
+}
+
+/**
+ * SCHED_CHANGE_BLOCK - Nested block for task attribute updates
+ * @__rq: Runqueue the target task belongs to
+ * @__p: Target task
+ * @__flags: DEQUEUE/ENQUEUE_* flags
+ *
+ * A task may need to be dequeued and put_prev_task'd for attribute updates and
+ * set_next_task'd and re-enqueued afterwards. This helper defines a nested
+ * block which automatically handles these preparation and cleanup operations.
+ *
+ * SCHED_CHANGE_BLOCK(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK) {
+ * update_attribute(p);
+ * ...
+ * }
+ *
+ * If @__flags is a variable, the variable may be updated in the block body and
+ * the updated value will be used when re-enqueueing @p.
+ *
+ * If %DEQUEUE_NOCLOCK is specified, the caller is responsible for calling
+ * update_rq_clock() beforehand. Otherwise, the rq clock is automatically
+ * updated iff the task needs to be dequeued and re-enqueued. Only the former
+ * case guarantees that the rq clock is up-to-date inside and after the block.
+ */
+#define SCHED_CHANGE_BLOCK(__rq, __p, __flags) \
+ for (struct sched_change_guard __cg = \
+ sched_change_guard_init(__rq, __p, __flags); \
+ !__cg.done; sched_change_guard_fini(&__cg, __flags))
+
static inline int __normal_prio(int policy, int rt_prio, int nice)
{
int prio;
@@ -2552,7 +2622,6 @@ static void
__do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
{
struct rq *rq = task_rq(p);
- bool queued, running;

/*
* This here violates the locking rules for affinity, since we're only
@@ -2571,26 +2640,9 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
else
lockdep_assert_held(&p->pi_lock);

- queued = task_on_rq_queued(p);
- running = task_current(rq, p);
-
- if (queued) {
- /*
- * Because __kthread_bind() calls this on blocked tasks without
- * holding rq->lock.
- */
- lockdep_assert_rq_held(rq);
- dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
+ SCHED_CHANGE_BLOCK(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK) {
+ p->sched_class->set_cpus_allowed(p, ctx);
}
- if (running)
- put_prev_task(rq, p);
-
- p->sched_class->set_cpus_allowed(p, ctx);
-
- if (queued)
- enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
- if (running)
- set_next_task(rq, p);
}

/*
@@ -6922,7 +6974,7 @@ static inline int rt_effective_prio(struct task_struct *p, int prio)
*/
void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
{
- int prio, oldprio, queued, running, queue_flag =
+ int prio, oldprio, queue_flag =
DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
const struct sched_class *prev_class;
struct rq_flags rf;
@@ -6982,49 +7034,39 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
queue_flag &= ~DEQUEUE_MOVE;

prev_class = p->sched_class;
- queued = task_on_rq_queued(p);
- running = task_current(rq, p);
- if (queued)
- dequeue_task(rq, p, queue_flag);
- if (running)
- put_prev_task(rq, p);
-
- /*
- * Boosting condition are:
- * 1. -rt task is running and holds mutex A
- * --> -dl task blocks on mutex A
- *
- * 2. -dl task is running and holds mutex A
- * --> -dl task blocks on mutex A and could preempt the
- * running task
- */
- if (dl_prio(prio)) {
- if (!dl_prio(p->normal_prio) ||
- (pi_task && dl_prio(pi_task->prio) &&
- dl_entity_preempt(&pi_task->dl, &p->dl))) {
- p->dl.pi_se = pi_task->dl.pi_se;
- queue_flag |= ENQUEUE_REPLENISH;
+ SCHED_CHANGE_BLOCK(rq, p, queue_flag) {
+ /*
+ * Boosting condition are:
+ * 1. -rt task is running and holds mutex A
+ * --> -dl task blocks on mutex A
+ *
+ * 2. -dl task is running and holds mutex A
+ * --> -dl task blocks on mutex A and could preempt the
+ * running task
+ */
+ if (dl_prio(prio)) {
+ if (!dl_prio(p->normal_prio) ||
+ (pi_task && dl_prio(pi_task->prio) &&
+ dl_entity_preempt(&pi_task->dl, &p->dl))) {
+ p->dl.pi_se = pi_task->dl.pi_se;
+ queue_flag |= ENQUEUE_REPLENISH;
+ } else {
+ p->dl.pi_se = &p->dl;
+ }
+ } else if (rt_prio(prio)) {
+ if (dl_prio(oldprio))
+ p->dl.pi_se = &p->dl;
+ if (oldprio < prio)
+ queue_flag |= ENQUEUE_HEAD;
} else {
- p->dl.pi_se = &p->dl;
+ if (dl_prio(oldprio))
+ p->dl.pi_se = &p->dl;
+ if (rt_prio(oldprio))
+ p->rt.timeout = 0;
}
- } else if (rt_prio(prio)) {
- if (dl_prio(oldprio))
- p->dl.pi_se = &p->dl;
- if (oldprio < prio)
- queue_flag |= ENQUEUE_HEAD;
- } else {
- if (dl_prio(oldprio))
- p->dl.pi_se = &p->dl;
- if (rt_prio(oldprio))
- p->rt.timeout = 0;
- }
-
- __setscheduler_prio(p, prio);

- if (queued)
- enqueue_task(rq, p, queue_flag);
- if (running)
- set_next_task(rq, p);
+ __setscheduler_prio(p, prio);
+ }

check_class_changed(rq, p, prev_class, oldprio);
out_unlock:
@@ -7046,7 +7088,6 @@ static inline int rt_effective_prio(struct task_struct *p, int prio)

void set_user_nice(struct task_struct *p, long nice)
{
- bool queued, running;
int old_prio;
struct rq_flags rf;
struct rq *rq;
@@ -7070,22 +7111,13 @@ void set_user_nice(struct task_struct *p, long nice)
p->static_prio = NICE_TO_PRIO(nice);
goto out_unlock;
}
- queued = task_on_rq_queued(p);
- running = task_current(rq, p);
- if (queued)
- dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
- if (running)
- put_prev_task(rq, p);
-
- p->static_prio = NICE_TO_PRIO(nice);
- set_load_weight(p, true);
- old_prio = p->prio;
- p->prio = effective_prio(p);

- if (queued)
- enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
- if (running)
- set_next_task(rq, p);
+ SCHED_CHANGE_BLOCK(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK) {
+ p->static_prio = NICE_TO_PRIO(nice);
+ set_load_weight(p, true);
+ old_prio = p->prio;
+ p->prio = effective_prio(p);
+ }

/*
* If the task increased its priority or is running and
@@ -7469,7 +7501,7 @@ static int __sched_setscheduler(struct task_struct *p,
bool user, bool pi)
{
int oldpolicy = -1, policy = attr->sched_policy;
- int retval, oldprio, newprio, queued, running;
+ int retval, oldprio, newprio;
const struct sched_class *prev_class;
struct balance_callback *head;
struct rq_flags rf;
@@ -7634,33 +7666,22 @@ static int __sched_setscheduler(struct task_struct *p,
queue_flags &= ~DEQUEUE_MOVE;
}

- queued = task_on_rq_queued(p);
- running = task_current(rq, p);
- if (queued)
- dequeue_task(rq, p, queue_flags);
- if (running)
- put_prev_task(rq, p);
-
- prev_class = p->sched_class;
+ SCHED_CHANGE_BLOCK(rq, p, queue_flags) {
+ prev_class = p->sched_class;

- if (!(attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)) {
- __setscheduler_params(p, attr);
- __setscheduler_prio(p, newprio);
- }
- __setscheduler_uclamp(p, attr);
+ if (!(attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)) {
+ __setscheduler_params(p, attr);
+ __setscheduler_prio(p, newprio);
+ }
+ __setscheduler_uclamp(p, attr);

- if (queued) {
/*
* We enqueue to tail when the priority of a task is
* increased (user space view).
*/
if (oldprio < p->prio)
queue_flags |= ENQUEUE_HEAD;
-
- enqueue_task(rq, p, queue_flags);
}
- if (running)
- set_next_task(rq, p);

check_class_changed(rq, p, prev_class, oldprio);

@@ -9177,25 +9198,15 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
*/
void sched_setnuma(struct task_struct *p, int nid)
{
- bool queued, running;
struct rq_flags rf;
struct rq *rq;

rq = task_rq_lock(p, &rf);
- queued = task_on_rq_queued(p);
- running = task_current(rq, p);

- if (queued)
- dequeue_task(rq, p, DEQUEUE_SAVE);
- if (running)
- put_prev_task(rq, p);
-
- p->numa_preferred_nid = nid;
+ SCHED_CHANGE_BLOCK(rq, p, DEQUEUE_SAVE) {
+ p->numa_preferred_nid = nid;
+ }

- if (queued)
- enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
- if (running)
- set_next_task(rq, p);
task_rq_unlock(rq, p, &rf);
}
#endif /* CONFIG_NUMA_BALANCING */
@@ -10287,35 +10298,24 @@ static void sched_change_group(struct task_struct *tsk)
*/
void sched_move_task(struct task_struct *tsk)
{
- int queued, running, queue_flags =
- DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
struct rq_flags rf;
struct rq *rq;

rq = task_rq_lock(tsk, &rf);
update_rq_clock(rq);

- running = task_current(rq, tsk);
- queued = task_on_rq_queued(tsk);
-
- if (queued)
- dequeue_task(rq, tsk, queue_flags);
- if (running)
- put_prev_task(rq, tsk);
-
- sched_change_group(tsk);
+ SCHED_CHANGE_BLOCK(rq, tsk,
+ DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK) {
+ sched_change_group(tsk);
+ }

- if (queued)
- enqueue_task(rq, tsk, queue_flags);
- if (running) {
- set_next_task(rq, tsk);
- /*
- * After changing group, the running task may have joined a
- * throttled one but it's still the running task. Trigger a
- * resched to make sure that task can still run.
- */
+ /*
+ * After changing group, the running task may have joined a throttled
+ * one but it's still the running task. Trigger a resched to make sure
+ * that task can still run.
+ */
+ if (task_current(rq, tsk))
resched_curr(rq);
- }

task_rq_unlock(rq, tsk, &rf);
}
--
2.39.1


2023-01-28 00:17:15

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 04/30] sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork()

A new BPF extensible sched_class will need more control over the forking
process. It wants to be able to fail from sched_cgroup_fork() after the new
task's sched_task_group is initialized so that the loaded BPF program can
prepare the task with its cgroup association is established and reject fork
if e.g. allocation fails.

Allow sched_cgroup_fork() to fail by making it return int instead of void
and adding sched_cancel_fork() to undo sched_fork() in the error path.

sched_cgroup_fork() doesn't fail yet and this patch shouldn't cause any
behavior changes.

v2: Patch description updated to detail the expected use.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/sched/task.h | 3 ++-
kernel/fork.c | 15 ++++++++++-----
kernel/sched/core.c | 8 +++++++-
3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 357e0068497c..dcff721170c3 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -58,7 +58,8 @@ extern asmlinkage void schedule_tail(struct task_struct *prev);
extern void init_idle(struct task_struct *idle, int cpu);

extern int sched_fork(unsigned long clone_flags, struct task_struct *p);
-extern void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs);
+extern int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs);
+extern void sched_cancel_fork(struct task_struct *p);
extern void sched_post_fork(struct task_struct *p);
extern void sched_dead(struct task_struct *p);

diff --git a/kernel/fork.c b/kernel/fork.c
index 9f7fe3541897..184c622b5513 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2239,7 +2239,7 @@ static __latent_entropy struct task_struct *copy_process(

retval = perf_event_init_task(p, clone_flags);
if (retval)
- goto bad_fork_cleanup_policy;
+ goto bad_fork_sched_cancel_fork;
retval = audit_alloc(p);
if (retval)
goto bad_fork_cleanup_perf;
@@ -2380,7 +2380,9 @@ static __latent_entropy struct task_struct *copy_process(
* cgroup specific, it unconditionally needs to place the task on a
* runqueue.
*/
- sched_cgroup_fork(p, args);
+ retval = sched_cgroup_fork(p, args);
+ if (retval)
+ goto bad_fork_cancel_cgroup;

/*
* From this point on we must avoid any synchronous user-space
@@ -2426,13 +2428,13 @@ static __latent_entropy struct task_struct *copy_process(
/* Don't start children in a dying pid namespace */
if (unlikely(!(ns_of_pid(pid)->pid_allocated & PIDNS_ADDING))) {
retval = -ENOMEM;
- goto bad_fork_cancel_cgroup;
+ goto bad_fork_core_free;
}

/* Let kill terminate clone/fork in the middle */
if (fatal_signal_pending(current)) {
retval = -EINTR;
- goto bad_fork_cancel_cgroup;
+ goto bad_fork_core_free;
}

/* No more failure paths after this point. */
@@ -2507,10 +2509,11 @@ static __latent_entropy struct task_struct *copy_process(

return p;

-bad_fork_cancel_cgroup:
+bad_fork_core_free:
sched_core_free(p);
spin_unlock(&current->sighand->siglock);
write_unlock_irq(&tasklist_lock);
+bad_fork_cancel_cgroup:
cgroup_cancel_fork(p, args);
bad_fork_put_pidfd:
if (clone_flags & CLONE_PIDFD) {
@@ -2549,6 +2552,8 @@ static __latent_entropy struct task_struct *copy_process(
audit_free(p);
bad_fork_cleanup_perf:
perf_event_free_task(p);
+bad_fork_sched_cancel_fork:
+ sched_cancel_fork(p);
bad_fork_cleanup_policy:
lockdep_free_task(p);
#ifdef CONFIG_NUMA
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 65f21f8bf738..49b3d8ce84ca 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4709,7 +4709,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
return 0;
}

-void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
+int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
{
unsigned long flags;

@@ -4736,6 +4736,12 @@ void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
if (p->sched_class->task_fork)
p->sched_class->task_fork(p);
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+
+ return 0;
+}
+
+void sched_cancel_fork(struct task_struct *p)
+{
}

void sched_post_fork(struct task_struct *p)
--
2.39.1


2023-01-28 00:17:12

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 03/30] sched: Restructure sched_class order sanity checks in sched_init()

Currently, sched_init() checks that the sched_class'es are in the expected
order by testing each adjacency which is a bit brittle and makes it
cumbersome to add optional sched_class'es. Instead, let's verify whether
they're in the expected order using sched_class_above() which is what
matters.

Signed-off-by: Tejun Heo <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Reviewed-by: David Vernet <[email protected]>
---
kernel/sched/core.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bfc3312f305a..65f21f8bf738 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9721,12 +9721,12 @@ void __init sched_init(void)
int i;

/* Make sure the linker didn't screw up */
- BUG_ON(&idle_sched_class != &fair_sched_class + 1 ||
- &fair_sched_class != &rt_sched_class + 1 ||
- &rt_sched_class != &dl_sched_class + 1);
#ifdef CONFIG_SMP
- BUG_ON(&dl_sched_class != &stop_sched_class + 1);
+ BUG_ON(!sched_class_above(&stop_sched_class, &dl_sched_class));
#endif
+ BUG_ON(!sched_class_above(&dl_sched_class, &rt_sched_class));
+ BUG_ON(!sched_class_above(&rt_sched_class, &fair_sched_class));
+ BUG_ON(!sched_class_above(&fair_sched_class, &idle_sched_class));

wait_bit_init();

--
2.39.1


2023-01-28 00:17:18

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 05/30] sched: Add sched_class->reweight_task()

Currently, during a task weight change, sched core directly calls
reweight_task() defined in fair.c if @p is on CFS. Let's make it a proper
sched_class operation instead. CFS's reweight_task() is renamed to
reweight_task_fair() and now called through sched_class.

While it turns a direct call into an indirect one, set_load_weight() isn't
called from a hot path and this change shouldn't cause any noticeable
difference. This will be used to implement reweight_task for a new BPF
extensible sched_class so that it can keep its cached task weight
up-to-date.

This will be used by a new sched_class to track weight changes.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/core.c | 4 ++--
kernel/sched/fair.c | 3 ++-
kernel/sched/sched.h | 4 ++--
3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 49b3d8ce84ca..ce27ed857975 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1275,8 +1275,8 @@ static void set_load_weight(struct task_struct *p, bool update_load)
* SCHED_OTHER tasks have to update their load when changing their
* weight
*/
- if (update_load && p->sched_class == &fair_sched_class) {
- reweight_task(p, prio);
+ if (update_load && p->sched_class->reweight_task) {
+ p->sched_class->reweight_task(task_rq(p), p, prio);
} else {
load->weight = scale_load(sched_prio_to_weight[prio]);
load->inv_weight = sched_prio_to_wmult[prio];
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c36aa54ae071..fbedf99ed953 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3342,7 +3342,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,

}

-void reweight_task(struct task_struct *p, int prio)
+static void reweight_task_fair(struct rq *rq, struct task_struct *p, int prio)
{
struct sched_entity *se = &p->se;
struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -12413,6 +12413,7 @@ DEFINE_SCHED_CLASS(fair) = {
.task_tick = task_tick_fair,
.task_fork = task_fork_fair,

+ .reweight_task = reweight_task_fair,
.prio_changed = prio_changed_fair,
.switched_from = switched_from_fair,
.switched_to = switched_to_fair,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 771f8ddb7053..070603f00470 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2208,6 +2208,8 @@ struct sched_class {
*/
void (*switched_from)(struct rq *this_rq, struct task_struct *task);
void (*switched_to) (struct rq *this_rq, struct task_struct *task);
+ void (*reweight_task)(struct rq *this_rq, struct task_struct *task,
+ int newprio);
void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
int oldprio);

@@ -2360,8 +2362,6 @@ extern void init_sched_dl_class(void);
extern void init_sched_rt_class(void);
extern void init_sched_fair_class(void);

-extern void reweight_task(struct task_struct *p, int prio);
-
extern void resched_curr(struct rq *rq);
extern void resched_cpu(int cpu);

--
2.39.1


2023-01-28 00:17:21

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 06/30] sched: Add sched_class->switching_to() and expose check_class_changing/changed()

When a task switches to a new sched_class, the prev and new classes are
notified through ->switched_from() and ->switched_to(), respectively, after
the switching is done.

A new BPF extensible sched_class will have callbacks that allow the BPF
scheduler to keep track of relevant task states (like priority and cpumask).
Those callbacks aren't called while a task is on a different sched_class.
When a task comes back, we wanna tell the BPF progs the up-to-date state
before the task gets enqueued, so we need a hook which is called before the
switching is committed.

This patch adds ->switching_to() which is called during sched_class switch
through check_class_changing() before the task is restored. Also, this patch
exposes check_class_changing/changed() in kernel/sched/sched.h. They will be
used by the new BPF extensible sched_class to implement implicit sched_class
switching which is used e.g. when falling back to CFS when the BPF scheduler
fails or unloads.

This is a prep patch and doesn't cause any behavior changes. The new
operation and exposed functions aren't used yet.

v2: Improve patch description w/ details on planned use.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/core.c | 20 +++++++++++++++++---
kernel/sched/sched.h | 7 +++++++
2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ce27ed857975..e6d5374edf58 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2221,6 +2221,17 @@ inline int task_curr(const struct task_struct *p)
return cpu_curr(task_cpu(p)) == p;
}

+/*
+ * ->switching_to() is called with the pi_lock and rq_lock held and must not
+ * mess with locking.
+ */
+void check_class_changing(struct rq *rq, struct task_struct *p,
+ const struct sched_class *prev_class)
+{
+ if (prev_class != p->sched_class && p->sched_class->switching_to)
+ p->sched_class->switching_to(rq, p);
+}
+
/*
* switched_from, switched_to and prio_changed must _NOT_ drop rq->lock,
* use the balance_callback list if you want balancing.
@@ -2228,9 +2239,9 @@ inline int task_curr(const struct task_struct *p)
* this means any call to check_class_changed() must be followed by a call to
* balance_callback().
*/
-static inline void check_class_changed(struct rq *rq, struct task_struct *p,
- const struct sched_class *prev_class,
- int oldprio)
+void check_class_changed(struct rq *rq, struct task_struct *p,
+ const struct sched_class *prev_class,
+ int oldprio)
{
if (prev_class != p->sched_class) {
if (prev_class->switched_from)
@@ -7072,6 +7083,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
}

__setscheduler_prio(p, prio);
+ check_class_changing(rq, p, prev_class);
}

check_class_changed(rq, p, prev_class, oldprio);
@@ -7681,6 +7693,8 @@ static int __sched_setscheduler(struct task_struct *p,
}
__setscheduler_uclamp(p, attr);

+ check_class_changing(rq, p, prev_class);
+
/*
* We enqueue to tail when the priority of a task is
* increased (user space view).
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 070603f00470..c083395c5477 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2206,6 +2206,7 @@ struct sched_class {
* cannot assume the switched_from/switched_to pair is serialized by
* rq->lock. They are however serialized by p->pi_lock.
*/
+ void (*switching_to) (struct rq *this_rq, struct task_struct *task);
void (*switched_from)(struct rq *this_rq, struct task_struct *task);
void (*switched_to) (struct rq *this_rq, struct task_struct *task);
void (*reweight_task)(struct rq *this_rq, struct task_struct *task,
@@ -2442,6 +2443,12 @@ static inline void sub_nr_running(struct rq *rq, unsigned count)
extern void activate_task(struct rq *rq, struct task_struct *p, int flags);
extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);

+extern void check_class_changing(struct rq *rq, struct task_struct *p,
+ const struct sched_class *prev_class);
+extern void check_class_changed(struct rq *rq, struct task_struct *p,
+ const struct sched_class *prev_class,
+ int oldprio);
+
extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);

#ifdef CONFIG_PREEMPT_RT
--
2.39.1


2023-01-28 00:17:41

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 07/30] sched: Factor out cgroup weight conversion functions

Factor out sched_weight_from/to_cgroup() which convert between scheduler
shares and cgroup weight. No functional change. The factored out functions
will be used by a new BPF extensible sched_class so that the weights can be
exposed to the BPF programs in a way which is consistent cgroup weights and
easier to interpret.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/core.c | 28 +++++++++++++---------------
kernel/sched/sched.h | 18 ++++++++++++++++++
2 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e6d5374edf58..70698725f6de 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11072,29 +11072,27 @@ static int cpu_extra_stat_show(struct seq_file *sf,
}

#ifdef CONFIG_FAIR_GROUP_SCHED
+
+static unsigned long tg_weight(struct task_group *tg)
+{
+ return scale_load_down(tg->shares);
+}
+
static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- struct task_group *tg = css_tg(css);
- u64 weight = scale_load_down(tg->shares);
-
- return DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024);
+ return sched_weight_to_cgroup(tg_weight(css_tg(css)));
}

static int cpu_weight_write_u64(struct cgroup_subsys_state *css,
- struct cftype *cft, u64 weight)
+ struct cftype *cft, u64 cgrp_weight)
{
- /*
- * cgroup weight knobs should use the common MIN, DFL and MAX
- * values which are 1, 100 and 10000 respectively. While it loses
- * a bit of range on both ends, it maps pretty well onto the shares
- * value used by scheduler and the round-trip conversions preserve
- * the original value over the entire range.
- */
- if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX)
+ unsigned long weight;
+
+ if (cgrp_weight < CGROUP_WEIGHT_MIN || cgrp_weight > CGROUP_WEIGHT_MAX)
return -ERANGE;

- weight = DIV_ROUND_CLOSEST_ULL(weight * 1024, CGROUP_WEIGHT_DFL);
+ weight = sched_weight_from_cgroup(cgrp_weight);

return sched_group_set_shares(css_tg(css), scale_load(weight));
}
@@ -11102,7 +11100,7 @@ static int cpu_weight_write_u64(struct cgroup_subsys_state *css,
static s64 cpu_weight_nice_read_s64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- unsigned long weight = scale_load_down(css_tg(css)->shares);
+ unsigned long weight = tg_weight(css_tg(css));
int last_delta = INT_MAX;
int prio, delta;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c083395c5477..946fdb51b6e6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -435,6 +435,24 @@ struct task_group {
#define MAX_SHARES (1UL << 18)
#endif

+/*
+ * cgroup weight knobs should use the common MIN, DFL and MAX values which are
+ * 1, 100 and 10000 respectively. While it loses a bit of range on both ends, it
+ * maps pretty well onto the shares value used by scheduler and the round-trip
+ * conversions preserve the original value over the entire range.
+ */
+static inline unsigned long sched_weight_from_cgroup(unsigned long cgrp_weight)
+{
+ return DIV_ROUND_CLOSEST_ULL(cgrp_weight * 1024, CGROUP_WEIGHT_DFL);
+}
+
+static inline unsigned long sched_weight_to_cgroup(unsigned long weight)
+{
+ return clamp_t(unsigned long,
+ DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024),
+ CGROUP_WEIGHT_MIN, CGROUP_WEIGHT_MAX);
+}
+
typedef int (*tg_visitor)(struct task_group *, void *);

extern int walk_tg_tree_from(struct task_group *from,
--
2.39.1


2023-01-28 00:17:44

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 08/30] sched: Expose css_tg(), __setscheduler_prio() and SCHED_CHANGE_BLOCK()

These will be used by a new BPF extensible sched_class.

css_tg() will be used in the init and exit paths to visit all task_groups by
walking cgroups.

__setscheduler_prio() is used to pick the sched_class matching the current
prio of the task. For the new BPF extensible sched_class, the mapping from
the task configuration to sched_class isn't static and depends on a few
factors - e.g. whether the BPF progs implementing the scheduler are loaded
and in a serviceable state. That mapping logic will be added to
__setscheduler_prio().

When the BPF scheduler progs get loaded and unloaded, the mapping changes
and the new sched_class will walk the tasks applying the new mapping using
SCHED_CHANGE_BLOCK() and __setscheduler_prio().

v2: Expose SCHED_CHANGE_BLOCK() too and update the description.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/core.c | 47 +++----------------------------------------
kernel/sched/sched.h | 48 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 51 insertions(+), 44 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 70698725f6de..1fd4e2cde35c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2094,15 +2094,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
dequeue_task(rq, p, flags);
}

-struct sched_change_guard {
- struct task_struct *p;
- struct rq *rq;
- bool queued;
- bool running;
- bool done;
-};
-
-static struct sched_change_guard
+struct sched_change_guard
sched_change_guard_init(struct rq *rq, struct task_struct *p, int flags)
{
struct sched_change_guard cg = {
@@ -2127,7 +2119,7 @@ sched_change_guard_init(struct rq *rq, struct task_struct *p, int flags)
return cg;
}

-static void sched_change_guard_fini(struct sched_change_guard *cg, int flags)
+void sched_change_guard_fini(struct sched_change_guard *cg, int flags)
{
if (cg->queued)
enqueue_task(cg->rq, cg->p, flags | ENQUEUE_NOCLOCK);
@@ -2136,34 +2128,6 @@ static void sched_change_guard_fini(struct sched_change_guard *cg, int flags)
cg->done = true;
}

-/**
- * SCHED_CHANGE_BLOCK - Nested block for task attribute updates
- * @__rq: Runqueue the target task belongs to
- * @__p: Target task
- * @__flags: DEQUEUE/ENQUEUE_* flags
- *
- * A task may need to be dequeued and put_prev_task'd for attribute updates and
- * set_next_task'd and re-enqueued afterwards. This helper defines a nested
- * block which automatically handles these preparation and cleanup operations.
- *
- * SCHED_CHANGE_BLOCK(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK) {
- * update_attribute(p);
- * ...
- * }
- *
- * If @__flags is a variable, the variable may be updated in the block body and
- * the updated value will be used when re-enqueueing @p.
- *
- * If %DEQUEUE_NOCLOCK is specified, the caller is responsible for calling
- * update_rq_clock() beforehand. Otherwise, the rq clock is automatically
- * updated iff the task needs to be dequeued and re-enqueued. Only the former
- * case guarantees that the rq clock is up-to-date inside and after the block.
- */
-#define SCHED_CHANGE_BLOCK(__rq, __p, __flags) \
- for (struct sched_change_guard __cg = \
- sched_change_guard_init(__rq, __p, __flags); \
- !__cg.done; sched_change_guard_fini(&__cg, __flags))
-
static inline int __normal_prio(int policy, int rt_prio, int nice)
{
int prio;
@@ -6949,7 +6913,7 @@ int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flag
}
EXPORT_SYMBOL(default_wake_function);

-static void __setscheduler_prio(struct task_struct *p, int prio)
+void __setscheduler_prio(struct task_struct *p, int prio)
{
if (dl_prio(prio))
p->sched_class = &dl_sched_class;
@@ -10340,11 +10304,6 @@ void sched_move_task(struct task_struct *tsk)
task_rq_unlock(rq, tsk, &rf);
}

-static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
-{
- return css ? container_of(css, struct task_group, css) : NULL;
-}
-
static struct cgroup_subsys_state *
cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
{
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 946fdb51b6e6..1927adc6c4bb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -469,6 +469,11 @@ static inline int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
return walk_tg_tree_from(&root_task_group, down, up, data);
}

+static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
+{
+ return css ? container_of(css, struct task_group, css) : NULL;
+}
+
extern int tg_nop(struct task_group *tg, void *data);

extern void free_fair_sched_group(struct task_group *tg);
@@ -493,6 +498,8 @@ extern long sched_group_rt_runtime(struct task_group *tg);
extern long sched_group_rt_period(struct task_group *tg);
extern int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk);

+extern void __setscheduler_prio(struct task_struct *p, int prio);
+
extern struct task_group *sched_create_group(struct task_group *parent);
extern void sched_online_group(struct task_group *tg,
struct task_group *parent);
@@ -2461,6 +2468,47 @@ static inline void sub_nr_running(struct rq *rq, unsigned count)
extern void activate_task(struct rq *rq, struct task_struct *p, int flags);
extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);

+struct sched_change_guard {
+ struct task_struct *p;
+ struct rq *rq;
+ bool queued;
+ bool running;
+ bool done;
+};
+
+extern struct sched_change_guard
+sched_change_guard_init(struct rq *rq, struct task_struct *p, int flags);
+
+extern void sched_change_guard_fini(struct sched_change_guard *cg, int flags);
+
+/**
+ * SCHED_CHANGE_BLOCK - Nested block for task attribute updates
+ * @__rq: Runqueue the target task belongs to
+ * @__p: Target task
+ * @__flags: DEQUEUE/ENQUEUE_* flags
+ *
+ * A task may need to be dequeued and put_prev_task'd for attribute updates and
+ * set_next_task'd and re-enqueued afterwards. This helper defines a nested
+ * block which automatically handles these preparation and cleanup operations.
+ *
+ * SCHED_CHANGE_BLOCK(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK) {
+ * update_attribute(p);
+ * ...
+ * }
+ *
+ * If @__flags is a variable, the variable may be updated in the block body and
+ * the updated value will be used when re-enqueueing @p.
+ *
+ * If %DEQUEUE_NOCLOCK is specified, the caller is responsible for calling
+ * update_rq_clock() beforehand. Otherwise, the rq clock is automatically
+ * updated iff the task needs to be dequeued and re-enqueued. Only the former
+ * case guarantees that the rq clock is up-to-date inside and after the block.
+ */
+#define SCHED_CHANGE_BLOCK(__rq, __p, __flags) \
+ for (struct sched_change_guard __cg = \
+ sched_change_guard_init(__rq, __p, __flags); \
+ !__cg.done; sched_change_guard_fini(&__cg, __flags))
+
extern void check_class_changing(struct rq *rq, struct task_struct *p,
const struct sched_class *prev_class);
extern void check_class_changed(struct rq *rq, struct task_struct *p,
--
2.39.1


2023-01-28 00:17:58

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 09/30] sched: Enumerate CPU cgroup file types

Rename cpu[_legacy]_files to cpu[_legacy]_cftypes for clarity and add
cpu_cftype_id which enumerates every cgroup2 interface file type. This
doesn't make any functional difference now. The enums will be used to access
specific cftypes by a new BPF extensible sched_class to selectively show and
hide CPU controller interface files depending on the capability of the
currently loaded BPF scheduler progs.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/core.c | 22 +++++++++++-----------
kernel/sched/sched.h | 21 +++++++++++++++++++++
2 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1fd4e2cde35c..729de63ee6ec 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10941,7 +10941,7 @@ static int cpu_idle_write_s64(struct cgroup_subsys_state *css,
}
#endif

-static struct cftype cpu_legacy_files[] = {
+static struct cftype cpu_legacy_cftypes[] = {
#ifdef CONFIG_FAIR_GROUP_SCHED
{
.name = "shares",
@@ -11148,21 +11148,21 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
}
#endif

-static struct cftype cpu_files[] = {
+struct cftype cpu_cftypes[CPU_CFTYPE_CNT + 1] = {
#ifdef CONFIG_FAIR_GROUP_SCHED
- {
+ [CPU_CFTYPE_WEIGHT] = {
.name = "weight",
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = cpu_weight_read_u64,
.write_u64 = cpu_weight_write_u64,
},
- {
+ [CPU_CFTYPE_WEIGHT_NICE] = {
.name = "weight.nice",
.flags = CFTYPE_NOT_ON_ROOT,
.read_s64 = cpu_weight_nice_read_s64,
.write_s64 = cpu_weight_nice_write_s64,
},
- {
+ [CPU_CFTYPE_IDLE] = {
.name = "idle",
.flags = CFTYPE_NOT_ON_ROOT,
.read_s64 = cpu_idle_read_s64,
@@ -11170,13 +11170,13 @@ static struct cftype cpu_files[] = {
},
#endif
#ifdef CONFIG_CFS_BANDWIDTH
- {
+ [CPU_CFTYPE_MAX] = {
.name = "max",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cpu_max_show,
.write = cpu_max_write,
},
- {
+ [CPU_CFTYPE_MAX_BURST] = {
.name = "max.burst",
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = cpu_cfs_burst_read_u64,
@@ -11184,13 +11184,13 @@ static struct cftype cpu_files[] = {
},
#endif
#ifdef CONFIG_UCLAMP_TASK_GROUP
- {
+ [CPU_CFTYPE_UCLAMP_MIN] = {
.name = "uclamp.min",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cpu_uclamp_min_show,
.write = cpu_uclamp_min_write,
},
- {
+ [CPU_CFTYPE_UCLAMP_MAX] = {
.name = "uclamp.max",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cpu_uclamp_max_show,
@@ -11210,8 +11210,8 @@ struct cgroup_subsys cpu_cgrp_subsys = {
.can_attach = cpu_cgroup_can_attach,
#endif
.attach = cpu_cgroup_attach,
- .legacy_cftypes = cpu_legacy_files,
- .dfl_cftypes = cpu_files,
+ .legacy_cftypes = cpu_legacy_cftypes,
+ .dfl_cftypes = cpu_cftypes,
.early_init = true,
.threaded = true,
};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1927adc6c4bb..dd567eb7881a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3334,4 +3334,25 @@ static inline void update_current_exec_runtime(struct task_struct *curr,
cgroup_account_cputime(curr, delta_exec);
}

+#ifdef CONFIG_CGROUP_SCHED
+enum cpu_cftype_id {
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ CPU_CFTYPE_WEIGHT,
+ CPU_CFTYPE_WEIGHT_NICE,
+ CPU_CFTYPE_IDLE,
+#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+ CPU_CFTYPE_MAX,
+ CPU_CFTYPE_MAX_BURST,
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ CPU_CFTYPE_UCLAMP_MIN,
+ CPU_CFTYPE_UCLAMP_MAX,
+#endif
+ CPU_CFTYPE_CNT,
+};
+
+extern struct cftype cpu_cftypes[CPU_CFTYPE_CNT + 1];
+#endif /* CONFIG_CGROUP_SCHED */
+
#endif /* _KERNEL_SCHED_SCHED_H */
--
2.39.1


2023-01-28 00:18:02

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 10/30] sched: Add @reason to sched_class->rq_{on|off}line()

->rq_{on|off}line are called either during CPU hotplug or cpuset partition
updates. A planned BPF extensible sched_class wants to tell the BPF
scheduler progs about CPU hotplug events in a way that's synchronized with
rq state changes.

As the BPF scheduler progs aren't necessarily affected by cpuset partition
updates, we need a way to distinguish the two types of events. Let's add an
argument to tell them apart.

v2: Patch description updated to detail the expected use.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/core.c | 12 ++++++------
kernel/sched/deadline.c | 4 ++--
kernel/sched/fair.c | 4 ++--
kernel/sched/rt.c | 4 ++--
kernel/sched/sched.h | 13 +++++++++----
kernel/sched/topology.c | 4 ++--
6 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 729de63ee6ec..6447f10ecd44 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9355,7 +9355,7 @@ static inline void balance_hotplug_wait(void)

#endif /* CONFIG_HOTPLUG_CPU */

-void set_rq_online(struct rq *rq)
+void set_rq_online(struct rq *rq, enum rq_onoff_reason reason)
{
if (!rq->online) {
const struct sched_class *class;
@@ -9365,19 +9365,19 @@ void set_rq_online(struct rq *rq)

for_each_class(class) {
if (class->rq_online)
- class->rq_online(rq);
+ class->rq_online(rq, reason);
}
}
}

-void set_rq_offline(struct rq *rq)
+void set_rq_offline(struct rq *rq, enum rq_onoff_reason reason)
{
if (rq->online) {
const struct sched_class *class;

for_each_class(class) {
if (class->rq_offline)
- class->rq_offline(rq);
+ class->rq_offline(rq, reason);
}

cpumask_clear_cpu(rq->cpu, rq->rd->online);
@@ -9473,7 +9473,7 @@ int sched_cpu_activate(unsigned int cpu)
rq_lock_irqsave(rq, &rf);
if (rq->rd) {
BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
- set_rq_online(rq);
+ set_rq_online(rq, RQ_ONOFF_HOTPLUG);
}
rq_unlock_irqrestore(rq, &rf);

@@ -9518,7 +9518,7 @@ int sched_cpu_deactivate(unsigned int cpu)
if (rq->rd) {
update_rq_clock(rq);
BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
- set_rq_offline(rq);
+ set_rq_offline(rq, RQ_ONOFF_HOTPLUG);
}
rq_unlock_irqrestore(rq, &rf);

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0d97d54276cc..f264814e0513 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2518,7 +2518,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
}

/* Assumes rq->lock is held */
-static void rq_online_dl(struct rq *rq)
+static void rq_online_dl(struct rq *rq, enum rq_onoff_reason reason)
{
if (rq->dl.overloaded)
dl_set_overload(rq);
@@ -2529,7 +2529,7 @@ static void rq_online_dl(struct rq *rq)
}

/* Assumes rq->lock is held */
-static void rq_offline_dl(struct rq *rq)
+static void rq_offline_dl(struct rq *rq, enum rq_onoff_reason reason)
{
if (rq->dl.overloaded)
dl_clear_overload(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fbedf99ed953..b4db8943ed8b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11673,14 +11673,14 @@ void trigger_load_balance(struct rq *rq)
nohz_balancer_kick(rq);
}

-static void rq_online_fair(struct rq *rq)
+static void rq_online_fair(struct rq *rq, enum rq_onoff_reason reason)
{
update_sysctl();

update_runtime_enabled(rq);
}

-static void rq_offline_fair(struct rq *rq)
+static void rq_offline_fair(struct rq *rq, enum rq_onoff_reason reason)
{
update_sysctl();

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ed2a47e4ddae..0fb7ee087669 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2470,7 +2470,7 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
}

/* Assumes rq->lock is held */
-static void rq_online_rt(struct rq *rq)
+static void rq_online_rt(struct rq *rq, enum rq_onoff_reason reason)
{
if (rq->rt.overloaded)
rt_set_overload(rq);
@@ -2481,7 +2481,7 @@ static void rq_online_rt(struct rq *rq)
}

/* Assumes rq->lock is held */
-static void rq_offline_rt(struct rq *rq)
+static void rq_offline_rt(struct rq *rq, enum rq_onoff_reason reason)
{
if (rq->rt.overloaded)
rt_clear_overload(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dd567eb7881a..cc1163b15aa0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2180,6 +2180,11 @@ extern const u32 sched_prio_to_wmult[40];

#define RETRY_TASK ((void *)-1UL)

+enum rq_onoff_reason {
+ RQ_ONOFF_HOTPLUG, /* CPU is going on/offline */
+ RQ_ONOFF_TOPOLOGY, /* sched domain topology update */
+};
+
struct affinity_context {
const struct cpumask *new_mask;
struct cpumask *user_mask;
@@ -2216,8 +2221,8 @@ struct sched_class {

void (*set_cpus_allowed)(struct task_struct *p, struct affinity_context *ctx);

- void (*rq_online)(struct rq *rq);
- void (*rq_offline)(struct rq *rq);
+ void (*rq_online)(struct rq *rq, enum rq_onoff_reason reason);
+ void (*rq_offline)(struct rq *rq, enum rq_onoff_reason reason);

struct rq *(*find_lock_rq)(struct task_struct *p, struct rq *rq);
#endif
@@ -2782,8 +2787,8 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
raw_spin_rq_unlock(rq1);
}

-extern void set_rq_online (struct rq *rq);
-extern void set_rq_offline(struct rq *rq);
+extern void set_rq_online (struct rq *rq, enum rq_onoff_reason reason);
+extern void set_rq_offline(struct rq *rq, enum rq_onoff_reason reason);
extern bool sched_smp_initialized;

#else /* CONFIG_SMP */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..0e859bea1cb6 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -493,7 +493,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
old_rd = rq->rd;

if (cpumask_test_cpu(rq->cpu, old_rd->online))
- set_rq_offline(rq);
+ set_rq_offline(rq, RQ_ONOFF_TOPOLOGY);

cpumask_clear_cpu(rq->cpu, old_rd->span);

@@ -511,7 +511,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)

cpumask_set_cpu(rq->cpu, rd->span);
if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
- set_rq_online(rq);
+ set_rq_online(rq, RQ_ONOFF_TOPOLOGY);

raw_spin_rq_unlock_irqrestore(rq, flags);

--
2.39.1


2023-01-28 00:18:07

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 11/30] sched: Add normal_policy()

A new BPF extensible sched_class will need to dynamically change how a task
picks its sched_class. For example, if the loaded BPF scheduler progs fail,
the tasks will be forced back on CFS even if the task's policy is set to the
new sched_class. To support such mapping, add normal_policy() which wraps
testing for %SCHED_NORMAL. This doesn't cause any behavior changes.

v2: Update the description with more details on the expected use.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/fair.c | 2 +-
kernel/sched/sched.h | 8 +++++++-
2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b4db8943ed8b..6055903cc3de 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7617,7 +7617,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
* Batch and idle tasks do not preempt non-idle tasks (their preemption
* is driven by the tick):
*/
- if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
+ if (unlikely(!normal_policy(p->policy)) || !sched_feat(WAKEUP_PREEMPTION))
return;

find_matching_se(&se, &pse);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cc1163b15aa0..91b6fed6aa93 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -182,9 +182,15 @@ static inline int idle_policy(int policy)
{
return policy == SCHED_IDLE;
}
+
+static inline int normal_policy(int policy)
+{
+ return policy == SCHED_NORMAL;
+}
+
static inline int fair_policy(int policy)
{
- return policy == SCHED_NORMAL || policy == SCHED_BATCH;
+ return normal_policy(policy) || policy == SCHED_BATCH;
}

static inline int rt_policy(int policy)
--
2.39.1


2023-01-28 00:18:18

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 12/30] sched_ext: Add boilerplate for extensible scheduler class

This adds dummy implementations of sched_ext interfaces which interact with
the scheduler core and hook them in the correct places. As they're all
dummies, this doesn't cause any behavior changes. This is split out to help
reviewing.

v2: balance_scx_on_up() dropped. This will be handled in sched_ext proper.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/sched/ext.h | 12 ++++++++++++
kernel/fork.c | 2 ++
kernel/sched/core.c | 32 ++++++++++++++++++++++++--------
kernel/sched/ext.h | 24 ++++++++++++++++++++++++
kernel/sched/idle.c | 2 ++
kernel/sched/sched.h | 2 ++
6 files changed, 66 insertions(+), 8 deletions(-)
create mode 100644 include/linux/sched/ext.h
create mode 100644 kernel/sched/ext.h

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
new file mode 100644
index 000000000000..a05dfcf533b0
--- /dev/null
+++ b/include/linux/sched/ext.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SCHED_EXT_H
+#define _LINUX_SCHED_EXT_H
+
+#ifdef CONFIG_SCHED_CLASS_EXT
+#error "NOT IMPLEMENTED YET"
+#else /* !CONFIG_SCHED_CLASS_EXT */
+
+static inline void sched_ext_free(struct task_struct *p) {}
+
+#endif /* CONFIG_SCHED_CLASS_EXT */
+#endif /* _LINUX_SCHED_EXT_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 184c622b5513..f5aad7fa6350 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -23,6 +23,7 @@
#include <linux/sched/task.h>
#include <linux/sched/task_stack.h>
#include <linux/sched/cputime.h>
+#include <linux/sched/ext.h>
#include <linux/seq_file.h>
#include <linux/rtmutex.h>
#include <linux/init.h>
@@ -846,6 +847,7 @@ void __put_task_struct(struct task_struct *tsk)
WARN_ON(refcount_read(&tsk->usage));
WARN_ON(tsk == current);

+ sched_ext_free(tsk);
io_uring_free(tsk);
cgroup_free(tsk);
task_numa_free(tsk, true);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6447f10ecd44..6058042dcc3d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4623,6 +4623,8 @@ late_initcall(sched_core_sysctl_init);
*/
int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
+ int ret;
+
__sched_fork(clone_flags, p);
/*
* We mark the process as NEW here. This guarantees that
@@ -4659,12 +4661,16 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
p->sched_reset_on_fork = 0;
}

- if (dl_prio(p->prio))
- return -EAGAIN;
- else if (rt_prio(p->prio))
+ scx_pre_fork(p);
+
+ if (dl_prio(p->prio)) {
+ ret = -EAGAIN;
+ goto out_cancel;
+ } else if (rt_prio(p->prio)) {
p->sched_class = &rt_sched_class;
- else
+ } else {
p->sched_class = &fair_sched_class;
+ }

init_entity_runnable_average(&p->se);

@@ -4682,6 +4688,10 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
RB_CLEAR_NODE(&p->pushable_dl_tasks);
#endif
return 0;
+
+out_cancel:
+ scx_cancel_fork(p);
+ return ret;
}

int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
@@ -4712,16 +4722,18 @@ int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
p->sched_class->task_fork(p);
raw_spin_unlock_irqrestore(&p->pi_lock, flags);

- return 0;
+ return scx_fork(p);
}

void sched_cancel_fork(struct task_struct *p)
{
+ scx_cancel_fork(p);
}

void sched_post_fork(struct task_struct *p)
{
uclamp_post_fork(p);
+ scx_post_fork(p);
}

unsigned long to_ratio(u64 period, u64 runtime)
@@ -5868,7 +5880,7 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
* We can terminate the balance pass as soon as we know there is
* a runnable task of @class priority or higher.
*/
- for_class_range(class, prev->sched_class, &idle_sched_class) {
+ for_balance_class_range(class, prev->sched_class, &idle_sched_class) {
if (class->balance(rq, prev, rf))
break;
}
@@ -5886,6 +5898,9 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
const struct sched_class *class;
struct task_struct *p;

+ if (scx_enabled())
+ goto restart;
+
/*
* Optimization: we know that if all tasks are in the fair class we can
* call that function directly, but only if the @prev task wasn't of a
@@ -5911,7 +5926,7 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
restart:
put_prev_task_balance(rq, prev, rf);

- for_each_class(class) {
+ for_each_active_class(class) {
p = class->pick_next_task(rq);
if (p)
return p;
@@ -5944,7 +5959,7 @@ static inline struct task_struct *pick_task(struct rq *rq)
const struct sched_class *class;
struct task_struct *p;

- for_each_class(class) {
+ for_each_active_class(class) {
p = class->pick_task(rq);
if (p)
return p;
@@ -9880,6 +9895,7 @@ void __init sched_init(void)
balance_push_set(smp_processor_id(), false);
#endif
init_sched_fair_class();
+ init_sched_ext_class();

psi_init();

diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
new file mode 100644
index 000000000000..6a93c4825339
--- /dev/null
+++ b/kernel/sched/ext.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifdef CONFIG_SCHED_CLASS_EXT
+#error "NOT IMPLEMENTED YET"
+#else /* CONFIG_SCHED_CLASS_EXT */
+
+#define scx_enabled() false
+
+static inline void scx_pre_fork(struct task_struct *p) {}
+static inline int scx_fork(struct task_struct *p) { return 0; }
+static inline void scx_post_fork(struct task_struct *p) {}
+static inline void scx_cancel_fork(struct task_struct *p) {}
+static inline void init_sched_ext_class(void) {}
+
+#define for_each_active_class for_each_class
+#define for_balance_class_range for_class_range
+
+#endif /* CONFIG_SCHED_CLASS_EXT */
+
+#if defined(CONFIG_SCHED_CLASS_EXT) && defined(CONFIG_SMP)
+#error "NOT IMPLEMENTED YET"
+#else
+static inline void scx_update_idle(struct rq *rq, bool idle) {}
+#endif
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index f26ab2675f7d..86bc5832bdc4 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -428,11 +428,13 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl

static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
{
+ scx_update_idle(rq, false);
}

static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
{
update_idle_core(rq);
+ scx_update_idle(rq, true);
schedstat_inc(rq->sched_goidle);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 91b6fed6aa93..d57d17d8ea6b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3366,4 +3366,6 @@ enum cpu_cftype_id {
extern struct cftype cpu_cftypes[CPU_CFTYPE_CNT + 1];
#endif /* CONFIG_CGROUP_SCHED */

+#include "ext.h"
+
#endif /* _KERNEL_SCHED_SCHED_H */
--
2.39.1


2023-01-28 00:18:32

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 15/30] sched_ext: Add sysrq-S which disables the BPF scheduler

This enables the admin to abort the BPF scheduler and revert to CFS anytime.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
drivers/tty/sysrq.c | 1 +
include/linux/sched/ext.h | 1 +
kernel/sched/build_policy.c | 1 +
kernel/sched/ext.c | 20 ++++++++++++++++++++
4 files changed, 23 insertions(+)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index b6e70c5cfa17..ddfcdb6aecd7 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -520,6 +520,7 @@ static const struct sysrq_key_op *sysrq_key_table[62] = {
NULL, /* P */
NULL, /* Q */
NULL, /* R */
+ /* S: May be registered by sched_ext for resetting */
NULL, /* S */
NULL, /* T */
NULL, /* U */
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 8894b6a9977d..988d1e30e26c 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -55,6 +55,7 @@ enum scx_exit_type {
SCX_EXIT_DONE,

SCX_EXIT_UNREG = 64, /* BPF unregistration */
+ SCX_EXIT_SYSRQ, /* requested by 'S' sysrq */

SCX_EXIT_ERROR = 1024, /* runtime error, error msg contains details */
SCX_EXIT_ERROR_BPF, /* ERROR but triggered through scx_bpf_error() */
diff --git a/kernel/sched/build_policy.c b/kernel/sched/build_policy.c
index 4c658b21f603..005025f55bea 100644
--- a/kernel/sched/build_policy.c
+++ b/kernel/sched/build_policy.c
@@ -28,6 +28,7 @@
#include <linux/suspend.h>
#include <linux/tsacct_kern.h>
#include <linux/vtime.h>
+#include <linux/sysrq.h>
#include <linux/percpu-rwsem.h>

#include <uapi/linux/sched/types.h>
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 38786d36a356..ebbf3f3c1cf7 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1881,6 +1881,9 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
case SCX_EXIT_UNREG:
reason = "BPF scheduler unregistered";
break;
+ case SCX_EXIT_SYSRQ:
+ reason = "disabled by sysrq-S";
+ break;
case SCX_EXIT_ERROR:
reason = "runtime error";
break;
@@ -2490,6 +2493,21 @@ struct bpf_struct_ops bpf_sched_ext_ops = {
.name = "sched_ext_ops",
};

+static void sysrq_handle_sched_ext_reset(int key)
+{
+ if (scx_ops_helper)
+ scx_ops_disable(SCX_EXIT_SYSRQ);
+ else
+ pr_info("sched_ext: BPF scheduler not yet used\n");
+}
+
+static const struct sysrq_key_op sysrq_sched_ext_reset_op = {
+ .handler = sysrq_handle_sched_ext_reset,
+ .help_msg = "reset-sched-ext(S)",
+ .action_msg = "Disable sched_ext and revert all tasks to CFS",
+ .enable_mask = SYSRQ_ENABLE_RTNICE,
+};
+
void __init init_sched_ext_class(void)
{
int cpu;
@@ -2513,6 +2531,8 @@ void __init init_sched_ext_class(void)

init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL);
}
+
+ register_sysrq_key('S', &sysrq_sched_ext_reset_op);
}


--
2.39.1


2023-01-28 00:18:41

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 13/30] sched_ext: Implement BPF extensible scheduler class

Implement a new scheduler class sched_ext (SCX), which allows scheduling
policies to be implemented as BPF programs to achieve the following:

1. Ease of experimentation and exploration: Enabling rapid iteration of new
scheduling policies.

2. Customization: Building application-specific schedulers which implement
policies that are not applicable to general-purpose schedulers.

3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling
policies in production environments.

sched_ext leverages BPF’s struct_ops feature to define a structure which
exports function callbacks and flags to BPF programs that wish to implement
scheduling policies. The struct_ops structure exported by sched_ext is
struct sched_ext_ops, and is conceptually similar to struct sched_class. The
role of sched_ext is to map the complex sched_class callbacks to the more
simple and ergonomic struct sched_ext_ops callbacks.

For more detailed discussion on the motivations and overview, please refer
to the cover letter.

Later patches will also add several example schedulers and documentation.

This patch implements the minimum core framework to enable implementation of
BPF schedulers. Subsequent patches will gradually add functionalities
including safety guarantee mechanisms, nohz and cgroup support.

include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on
top, each operation should be self-explanatory. The followings are worth
noting:

* Both "sched_ext" and its shorthand "scx" are used. If the identifier
already has "sched" in it, "ext" is used; otherwise, "scx".

* In sched_ext_ops, only .name is mandatory. Every operation is optional and
if omitted a simple but functional default behavior is provided.

* A new policy constant SCHED_EXT is added and a task can select sched_ext
by invoking sched_setscheduler(2) with the new policy constant. However,
if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL
and the task is scheduled by CFS. When the BPF scheduler is loaded, all
tasks which have the SCHED_EXT policy are switched to sched_ext.

* To bridge the workflow imbalance between the scheduler core and
sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch
queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and
one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for
convenience and need not be used by a scheduler that doesn't require it.
SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting
the next task on the CPU. The BPF scheduler can manage an arbitrary number
of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().

* sched_ext guarantees system integrity no matter what the BPF scheduler
does. To enable this, each task's ownership is tracked through
p->scx.ops_state and all tasks are put on scx_tasks list. The disable path
can always recover and revert all tasks back to CFS. See p->scx.ops_state
and scx_tasks.

* A task is not tied to its rq while enqueued. This decouples CPU selection
from queueing and allows sharing a scheduling queue across an arbitrary
subset of CPUs. This adds some complexities as a task may need to be
bounced between rq's right before it starts executing. See
dispatch_to_local_dsq() and move_task_to_local_dsq().

* One complication that arises from the above weak association between task
and rq is that synchronizing with dequeue() gets complicated as dequeue()
may happen anytime while the task is enqueued and the dispatch path might
need to release the rq lock to transfer the task. Solving this requires a
bit of complexity. See the logic around p->scx.sticky_cpu and
p->scx.ops_qseq.

* Both enable and disable paths are a bit complicated. The enable path
switches all tasks without blocking to avoid issues which can arise from
partially switched states (e.g. the switching task itself being starved).
The disable path can't trust the BPF scheduler at all, so it also has to
guarantee forward progress without blocking. See scx_ops_enable() and
scx_ops_disable_workfn().

* When sched_ext is disabled, static_branches are used to shut down the
entry points from hot paths.

v2: * balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is
called from put_prev_taks_scx() and pick_next_task_scx() as necessary.
To determine whether balance_scx() should be called from
put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the
comment in put_prev_task_scx() for details.

* sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced
with SCHED_CHANGE_BLOCK().

* Unused all_dsqs list removed. This was a left-over from previous
iterations.

* p->scx.kf_mask is added to track and enforce which kfunc helpers are
allowed. Also, init/exit sequences are updated to make some kfuncs
always safe to call regardless of the current BPF scheduler state.
Combined, this should make all the kfuncs safe.

* BPF now supports sleepable struct_ops operations. Hacky workaround
removed and operations and kfunc helpers are tagged appropriately.

* BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask()
and friends are added so that BPF schedulers can use the idle masks
with the generic helpers. This replaces the hacky kfunc helpers added
by a separate patch in V1.

* CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is
enabled. This restriction will be removed by a later patch which adds
core-sched support.

* Add MAINTAINERS entries and other misc changes.

Signed-off-by: Tejun Heo <[email protected]>
Co-authored-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
MAINTAINERS | 3 +
include/asm-generic/vmlinux.lds.h | 1 +
include/linux/sched.h | 5 +
include/linux/sched/ext.h | 389 +++-
include/uapi/linux/sched.h | 1 +
init/init_task.c | 10 +
kernel/Kconfig.preempt | 22 +-
kernel/bpf/bpf_struct_ops_types.h | 4 +
kernel/sched/build_policy.c | 4 +
kernel/sched/core.c | 26 +
kernel/sched/debug.c | 6 +
kernel/sched/ext.c | 2979 +++++++++++++++++++++++++++++
kernel/sched/ext.h | 97 +-
kernel/sched/sched.h | 15 +
14 files changed, 3558 insertions(+), 4 deletions(-)
create mode 100644 kernel/sched/ext.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 0f61b5d64716..678e511bc3ba 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18588,6 +18588,8 @@ R: Ben Segall <[email protected]> (CONFIG_CFS_BANDWIDTH)
R: Mel Gorman <[email protected]> (CONFIG_NUMA_BALANCING)
R: Daniel Bristot de Oliveira <[email protected]> (SCHED_DEADLINE)
R: Valentin Schneider <[email protected]> (TOPOLOGY)
+R: Tejun Heo <[email protected]> (SCHED_EXT)
+R: David Vernet <[email protected]> (SCHED_EXT)
L: [email protected]
S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core
@@ -18596,6 +18598,7 @@ F: include/linux/sched.h
F: include/linux/wait.h
F: include/uapi/linux/sched.h
F: kernel/sched/
+F: tools/sched_ext/

SCR24X CHIP CARD INTERFACE DRIVER
M: Lubomir Rintel <[email protected]>
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 659bf3b31c91..b9974e6eadd8 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -131,6 +131,7 @@
*(__dl_sched_class) \
*(__rt_sched_class) \
*(__fair_sched_class) \
+ *(__ext_sched_class) \
*(__idle_sched_class) \
__sched_class_lowest = .;

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 853d08f7562b..9c0362e3be71 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -70,6 +70,8 @@ struct signal_struct;
struct task_delay_info;
struct task_group;

+#include <linux/sched/ext.h>
+
/*
* Task state bitmask. NOTE! These bits are also
* encoded in fs/proc/array.c: get_task_state().
@@ -788,6 +790,9 @@ struct task_struct {
struct sched_entity se;
struct sched_rt_entity rt;
struct sched_dl_entity dl;
+#ifdef CONFIG_SCHED_CLASS_EXT
+ struct sched_ext_entity scx;
+#endif
const struct sched_class *sched_class;

#ifdef CONFIG_SCHED_CORE
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index a05dfcf533b0..8894b6a9977d 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -1,9 +1,396 @@
/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
#ifndef _LINUX_SCHED_EXT_H
#define _LINUX_SCHED_EXT_H

#ifdef CONFIG_SCHED_CLASS_EXT
-#error "NOT IMPLEMENTED YET"
+
+#include <linux/rhashtable.h>
+#include <linux/llist.h>
+
+enum scx_consts {
+ SCX_OPS_NAME_LEN = 128,
+ SCX_EXIT_REASON_LEN = 128,
+ SCX_EXIT_BT_LEN = 64,
+ SCX_EXIT_MSG_LEN = 1024,
+
+ SCX_SLICE_DFL = 20 * NSEC_PER_MSEC,
+};
+
+/*
+ * DSQ (dispatch queue) IDs are 64bit of the format:
+ *
+ * Bits: [63] [62 .. 0]
+ * [ B] [ ID ]
+ *
+ * B: 1 for IDs for built-in DSQs, 0 for ops-created user DSQs
+ * ID: 63 bit ID
+ *
+ * Built-in IDs:
+ *
+ * Bits: [63] [62] [61..32] [31 .. 0]
+ * [ 1] [ L] [ R ] [ V ]
+ *
+ * 1: 1 for built-in DSQs.
+ * L: 1 for LOCAL_ON DSQ IDs, 0 for others
+ * V: For LOCAL_ON DSQ IDs, a CPU number. For others, a pre-defined value.
+ */
+enum scx_dsq_id_flags {
+ SCX_DSQ_FLAG_BUILTIN = 1LLU << 63,
+ SCX_DSQ_FLAG_LOCAL_ON = 1LLU << 62,
+
+ SCX_DSQ_INVALID = SCX_DSQ_FLAG_BUILTIN | 0,
+ SCX_DSQ_GLOBAL = SCX_DSQ_FLAG_BUILTIN | 1,
+ SCX_DSQ_LOCAL = SCX_DSQ_FLAG_BUILTIN | 2,
+ SCX_DSQ_LOCAL_ON = SCX_DSQ_FLAG_BUILTIN | SCX_DSQ_FLAG_LOCAL_ON,
+ SCX_DSQ_LOCAL_CPU_MASK = 0xffffffffLLU,
+};
+
+enum scx_exit_type {
+ SCX_EXIT_NONE,
+ SCX_EXIT_DONE,
+
+ SCX_EXIT_UNREG = 64, /* BPF unregistration */
+
+ SCX_EXIT_ERROR = 1024, /* runtime error, error msg contains details */
+ SCX_EXIT_ERROR_BPF, /* ERROR but triggered through scx_bpf_error() */
+};
+
+/*
+ * scx_exit_info is passed to ops.exit() to describe why the BPF scheduler is
+ * being disabled.
+ */
+struct scx_exit_info {
+ /* %SCX_EXIT_* - broad category of the exit reason */
+ enum scx_exit_type type;
+ /* textual representation of the above */
+ char reason[SCX_EXIT_REASON_LEN];
+ /* number of entries in the backtrace */
+ u32 bt_len;
+ /* backtrace if exiting due to an error */
+ unsigned long bt[SCX_EXIT_BT_LEN];
+ /* extra message */
+ char msg[SCX_EXIT_MSG_LEN];
+};
+
+/* sched_ext_ops.flags */
+enum scx_ops_flags {
+ /*
+ * Keep built-in idle tracking even if ops.update_idle() is implemented.
+ */
+ SCX_OPS_KEEP_BUILTIN_IDLE = 1LLU << 0,
+
+ /*
+ * By default, if there are no other task to run on the CPU, ext core
+ * keeps running the current task even after its slice expires. If this
+ * flag is specified, such tasks are passed to ops.enqueue() with
+ * %SCX_ENQ_LAST. See the comment above %SCX_ENQ_LAST for more info.
+ */
+ SCX_OPS_ENQ_LAST = 1LLU << 1,
+
+ /*
+ * An exiting task may schedule after PF_EXITING is set. In such cases,
+ * scx_bpf_find_task_by_pid() may not be able to find the task and if
+ * the BPF scheduler depends on pid lookup for dispatching, the task
+ * will be lost leading to various issues including RCU grace period
+ * stalls.
+ *
+ * To mask this problem, by default, unhashed tasks are automatically
+ * dispatched to the local DSQ on enqueue. If the BPF scheduler doesn't
+ * depend on pid lookups and wants to handle these tasks directly, the
+ * following flag can be used.
+ */
+ SCX_OPS_ENQ_EXITING = 1LLU << 2,
+
+ SCX_OPS_ALL_FLAGS = SCX_OPS_KEEP_BUILTIN_IDLE |
+ SCX_OPS_ENQ_LAST |
+ SCX_OPS_ENQ_EXITING,
+};
+
+/* argument container for ops.enable() and friends */
+struct scx_enable_args {
+ /* empty for now */
+};
+
+/**
+ * struct sched_ext_ops - Operation table for BPF scheduler implementation
+ *
+ * Userland can implement an arbitrary scheduling policy by implementing and
+ * loading operations in this table.
+ */
+struct sched_ext_ops {
+ /**
+ * select_cpu - Pick the target CPU for a task which is being woken up
+ * @p: task being woken up
+ * @prev_cpu: the cpu @p was on before sleeping
+ * @wake_flags: SCX_WAKE_*
+ *
+ * Decision made here isn't final. @p may be moved to any CPU while it
+ * is getting dispatched for execution later. However, as @p is not on
+ * the rq at this point, getting the eventual execution CPU right here
+ * saves a small bit of overhead down the line.
+ *
+ * If an idle CPU is returned, the CPU is kicked and will try to
+ * dispatch. While an explicit custom mechanism can be added,
+ * select_cpu() serves as the default way to wake up idle CPUs.
+ */
+ s32 (*select_cpu)(struct task_struct *p, s32 prev_cpu, u64 wake_flags);
+
+ /**
+ * enqueue - Enqueue a task on the BPF scheduler
+ * @p: task being enqueued
+ * @enq_flags: %SCX_ENQ_*
+ *
+ * @p is ready to run. Dispatch directly by calling scx_bpf_dispatch()
+ * or enqueue on the BPF scheduler. If not directly dispatched, the bpf
+ * scheduler owns @p and if it fails to dispatch @p, the task will
+ * stall.
+ */
+ void (*enqueue)(struct task_struct *p, u64 enq_flags);
+
+ /**
+ * dequeue - Remove a task from the BPF scheduler
+ * @p: task being dequeued
+ * @deq_flags: %SCX_DEQ_*
+ *
+ * Remove @p from the BPF scheduler. This is usually called to isolate
+ * the task while updating its scheduling properties (e.g. priority).
+ *
+ * The ext core keeps track of whether the BPF side owns a given task or
+ * not and can gracefully ignore spurious dispatches from BPF side,
+ * which makes it safe to not implement this method. However, depending
+ * on the scheduling logic, this can lead to confusing behaviors - e.g.
+ * scheduling position not being updated across a priority change.
+ */
+ void (*dequeue)(struct task_struct *p, u64 deq_flags);
+
+ /**
+ * dispatch - Dispatch tasks from the BPF scheduler and/or consume DSQs
+ * @cpu: CPU to dispatch tasks for
+ * @prev: previous task being switched out
+ *
+ * Called when a CPU's local dsq is empty. The operation should dispatch
+ * one or more tasks from the BPF scheduler into the DSQs using
+ * scx_bpf_dispatch() and/or consume user DSQs into the local DSQ using
+ * scx_bpf_consume().
+ *
+ * The maximum number of times scx_bpf_dispatch() can be called without
+ * an intervening scx_bpf_consume() is specified by
+ * ops.dispatch_max_batch. See the comments on top of the two functions
+ * for more details.
+ *
+ * When not %NULL, @prev is an SCX task with its slice depleted. If
+ * @prev is still runnable as indicated by set %SCX_TASK_QUEUED in
+ * @prev->scx.flags, it is not enqueued yet and will be enqueued after
+ * ops.dispatch() returns. To keep executing @prev, return without
+ * dispatching or consuming any tasks. Also see %SCX_OPS_ENQ_LAST.
+ */
+ void (*dispatch)(s32 cpu, struct task_struct *prev);
+
+ /**
+ * yield - Yield CPU
+ * @from: yielding task
+ * @to: optional yield target task
+ *
+ * If @to is NULL, @from is yielding the CPU to other runnable tasks.
+ * The BPF scheduler should ensure that other available tasks are
+ * dispatched before the yielding task. Return value is ignored in this
+ * case.
+ *
+ * If @to is not-NULL, @from wants to yield the CPU to @to. If the bpf
+ * scheduler can implement the request, return %true; otherwise, %false.
+ */
+ bool (*yield)(struct task_struct *from, struct task_struct *to);
+
+ /**
+ * set_cpumask - Set CPU affinity
+ * @p: task to set CPU affinity for
+ * @cpumask: cpumask of cpus that @p can run on
+ *
+ * Update @p's CPU affinity to @cpumask.
+ */
+ void (*set_cpumask)(struct task_struct *p, struct cpumask *cpumask);
+
+ /**
+ * update_idle - Update the idle state of a CPU
+ * @cpu: CPU to udpate the idle state for
+ * @idle: whether entering or exiting the idle state
+ *
+ * This operation is called when @rq's CPU goes or leaves the idle
+ * state. By default, implementing this operation disables the built-in
+ * idle CPU tracking and the following helpers become unavailable:
+ *
+ * - scx_bpf_select_cpu_dfl()
+ * - scx_bpf_test_and_clear_cpu_idle()
+ * - scx_bpf_pick_idle_cpu()
+ * - scx_bpf_any_idle_cpu()
+ *
+ * The user also must implement ops.select_cpu() as the default
+ * implementation relies on scx_bpf_select_cpu_dfl().
+ *
+ * If you keep the built-in idle tracking, specify the
+ * %SCX_OPS_KEEP_BUILTIN_IDLE flag.
+ */
+ void (*update_idle)(s32 cpu, bool idle);
+
+ /**
+ * prep_enable - Prepare to enable BPF scheduling for a task
+ * @p: task to prepare BPF scheduling for
+ * @args: enable arguments, see the struct definition
+ *
+ * Either we're loading a BPF scheduler or a new task is being forked.
+ * Prepare BPF scheduling for @p. This operation may block and can be
+ * used for allocations.
+ *
+ * Return 0 for success, -errno for failure. An error return while
+ * loading will abort loading of the BPF scheduler. During a fork, will
+ * abort the specific fork.
+ */
+ s32 (*prep_enable)(struct task_struct *p, struct scx_enable_args *args);
+
+ /**
+ * enable - Enable BPF scheduling for a task
+ * @p: task to enable BPF scheduling for
+ * @args: enable arguments, see the struct definition
+ *
+ * Enable @p for BPF scheduling. @p will start running soon.
+ */
+ void (*enable)(struct task_struct *p, struct scx_enable_args *args);
+
+ /**
+ * cancel_enable - Cancel prep_enable()
+ * @p: task being canceled
+ * @args: enable arguments, see the struct definition
+ *
+ * @p was prep_enable()'d but failed before reaching enable(). Undo the
+ * preparation.
+ */
+ void (*cancel_enable)(struct task_struct *p,
+ struct scx_enable_args *args);
+
+ /**
+ * disable - Disable BPF scheduling for a task
+ * @p: task to disable BPF scheduling for
+ *
+ * @p is exiting, leaving SCX or the BPF scheduler is being unloaded.
+ * Disable BPF scheduling for @p.
+ */
+ void (*disable)(struct task_struct *p);
+
+ /*
+ * All online ops must come before ops.init().
+ */
+
+ /**
+ * init - Initialize the BPF scheduler
+ */
+ s32 (*init)(void);
+
+ /**
+ * exit - Clean up after the BPF scheduler
+ * @info: Exit info
+ */
+ void (*exit)(struct scx_exit_info *info);
+
+ /**
+ * dispatch_max_batch - Max nr of tasks that dispatch() can dispatch
+ */
+ u32 dispatch_max_batch;
+
+ /**
+ * flags - %SCX_OPS_* flags
+ */
+ u64 flags;
+
+ /**
+ * name - BPF scheduler's name
+ *
+ * Must be a non-zero valid BPF object name including only isalnum(),
+ * '_' and '.' chars. Shows up in kernel.sched_ext_ops sysctl while the
+ * BPF scheduler is enabled.
+ */
+ char name[SCX_OPS_NAME_LEN];
+};
+
+/*
+ * Dispatch queue (dsq) is a simple FIFO which is used to buffer between the
+ * scheduler core and the BPF scheduler. See the documentation for more details.
+ */
+struct scx_dispatch_q {
+ raw_spinlock_t lock;
+ struct list_head fifo;
+ u64 id;
+ u32 nr;
+ struct rhash_head hash_node;
+ struct llist_node free_node;
+ struct rcu_head rcu;
+};
+
+/* scx_entity.flags */
+enum scx_ent_flags {
+ SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */
+ SCX_TASK_BAL_KEEP = 1 << 1, /* balance decided to keep current */
+ SCX_TASK_ENQ_LOCAL = 1 << 2, /* used by scx_select_cpu_dfl() to set SCX_ENQ_LOCAL */
+
+ SCX_TASK_OPS_PREPPED = 1 << 3, /* prepared for BPF scheduler enable */
+ SCX_TASK_OPS_ENABLED = 1 << 4, /* task has BPF scheduler enabled */
+
+ SCX_TASK_DEQD_FOR_SLEEP = 1 << 6, /* last dequeue was for SLEEP */
+
+ SCX_TASK_CURSOR = 1 << 7, /* iteration cursor, not a task */
+};
+
+/*
+ * Mask bits for scx_entity.kf_mask. Not all kfuncs can be called from
+ * everywhere and the following bits track which kfunc sets are currently
+ * allowed for %current. This simple per-task tracking works because SCX ops
+ * nest in a limited way. BPF will likely implement a way to allow and disallow
+ * kfuncs depending on the calling context which will replace this manual
+ * mechanism. See scx_kf_allow().
+ */
+enum scx_kf_mask {
+ /* others may be nested inside INIT and SLEEPABLE */
+ SCX_KF_INIT = 1 << 0, /* allowed from ops.init() */
+ SCX_KF_SLEEPABLE = 1 << 1, /* from sleepable init operations */
+
+ SCX_KF_ENQUEUE_DISPATCH = 1 << 3, /* from ops.enqueue() or .dispatch() */
+ SCX_KF_DISPATCH = 1 << 4, /* from ops.dispatch() */
+};
+
+/*
+ * The following is embedded in task_struct and contains all fields necessary
+ * for a task to be scheduled by SCX.
+ */
+struct sched_ext_entity {
+ struct scx_dispatch_q *dsq;
+ struct list_head dsq_node;
+ u32 flags; /* protected by rq lock */
+ u32 weight;
+ s32 sticky_cpu;
+ s32 holding_cpu;
+ u32 kf_mask; /* see scx_kf_mask above */
+ atomic64_t ops_state;
+
+ /* BPF scheduler modifiable fields */
+
+ /*
+ * Runtime budget in nsecs. This is usually set through
+ * scx_bpf_dispatch() but can also be modified directly by the BPF
+ * scheduler. Automatically decreased by SCX as the task executes. On
+ * depletion, a scheduling event is triggered.
+ */
+ u64 slice;
+
+ /* cold fields */
+ struct list_head tasks_node;
+};
+
+void sched_ext_free(struct task_struct *p);
+
#else /* !CONFIG_SCHED_CLASS_EXT */

static inline void sched_ext_free(struct task_struct *p) {}
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 3bac0a8ceab2..359a14cc76a4 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -118,6 +118,7 @@ struct clone_args {
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE 5
#define SCHED_DEADLINE 6
+#define SCHED_EXT 7

/* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
#define SCHED_RESET_ON_FORK 0x40000000
diff --git a/init/init_task.c b/init/init_task.c
index ff6c4b9bfe6b..bdbc663107bf 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -6,6 +6,7 @@
#include <linux/sched/sysctl.h>
#include <linux/sched/rt.h>
#include <linux/sched/task.h>
+#include <linux/sched/ext.h>
#include <linux/init.h>
#include <linux/fs.h>
#include <linux/mm.h>
@@ -101,6 +102,15 @@ struct task_struct init_task
#endif
#ifdef CONFIG_CGROUP_SCHED
.sched_task_group = &root_task_group,
+#endif
+#ifdef CONFIG_SCHED_CLASS_EXT
+ .scx = {
+ .dsq_node = LIST_HEAD_INIT(init_task.scx.dsq_node),
+ .sticky_cpu = -1,
+ .holding_cpu = -1,
+ .ops_state = ATOMIC_INIT(0),
+ .slice = SCX_SLICE_DFL,
+ },
#endif
.ptraced = LIST_HEAD_INIT(init_task.ptraced),
.ptrace_entry = LIST_HEAD_INIT(init_task.ptrace_entry),
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index c2f1fd95a821..0afcda19bc50 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -133,4 +133,24 @@ config SCHED_CORE
which is the likely usage by Linux distributions, there should
be no measurable impact on performance.

-
+config SCHED_CLASS_EXT
+ bool "Extensible Scheduling Class"
+ depends on BPF_SYSCALL && BPF_JIT && !SCHED_CORE
+ help
+ This option enables a new scheduler class sched_ext (SCX), which
+ allows scheduling policies to be implemented as BPF programs to
+ achieve the following:
+
+ - Ease of experimentation and exploration: Enabling rapid
+ iteration of new scheduling policies.
+ - Customization: Building application-specific schedulers which
+ implement policies that are not applicable to general-purpose
+ schedulers.
+ - Rapid scheduler deployments: Non-disruptive swap outs of
+ scheduling policies in production environments.
+
+ sched_ext leverages BPF’s struct_ops feature to define a structure
+ which exports function callbacks and flags to BPF programs that
+ wish to implement scheduling policies. The struct_ops structure
+ exported by sched_ext is struct sched_ext_ops, and is conceptually
+ similar to struct sched_class.
diff --git a/kernel/bpf/bpf_struct_ops_types.h b/kernel/bpf/bpf_struct_ops_types.h
index 5678a9ddf817..3618769d853d 100644
--- a/kernel/bpf/bpf_struct_ops_types.h
+++ b/kernel/bpf/bpf_struct_ops_types.h
@@ -9,4 +9,8 @@ BPF_STRUCT_OPS_TYPE(bpf_dummy_ops)
#include <net/tcp.h>
BPF_STRUCT_OPS_TYPE(tcp_congestion_ops)
#endif
+#ifdef CONFIG_SCHED_CLASS_EXT
+#include <linux/sched/ext.h>
+BPF_STRUCT_OPS_TYPE(sched_ext_ops)
+#endif
#endif
diff --git a/kernel/sched/build_policy.c b/kernel/sched/build_policy.c
index d9dc9ab3773f..4c658b21f603 100644
--- a/kernel/sched/build_policy.c
+++ b/kernel/sched/build_policy.c
@@ -28,6 +28,7 @@
#include <linux/suspend.h>
#include <linux/tsacct_kern.h>
#include <linux/vtime.h>
+#include <linux/percpu-rwsem.h>

#include <uapi/linux/sched/types.h>

@@ -52,3 +53,6 @@
#include "cputime.c"
#include "deadline.c"

+#ifdef CONFIG_SCHED_CLASS_EXT
+# include "ext.c"
+#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6058042dcc3d..804aa291b837 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4421,6 +4421,18 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->rt.on_rq = 0;
p->rt.on_list = 0;

+#ifdef CONFIG_SCHED_CLASS_EXT
+ p->scx.dsq = NULL;
+ INIT_LIST_HEAD(&p->scx.dsq_node);
+ p->scx.flags = 0;
+ p->scx.weight = 0;
+ p->scx.sticky_cpu = -1;
+ p->scx.holding_cpu = -1;
+ p->scx.kf_mask = 0;
+ atomic64_set(&p->scx.ops_state, 0);
+ p->scx.slice = SCX_SLICE_DFL;
+#endif
+
#ifdef CONFIG_PREEMPT_NOTIFIERS
INIT_HLIST_HEAD(&p->preempt_notifiers);
#endif
@@ -4668,6 +4680,10 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
goto out_cancel;
} else if (rt_prio(p->prio)) {
p->sched_class = &rt_sched_class;
+#ifdef CONFIG_SCHED_CLASS_EXT
+ } else if (task_on_scx(p)) {
+ p->sched_class = &ext_sched_class;
+#endif
} else {
p->sched_class = &fair_sched_class;
}
@@ -6934,6 +6950,10 @@ void __setscheduler_prio(struct task_struct *p, int prio)
p->sched_class = &dl_sched_class;
else if (rt_prio(prio))
p->sched_class = &rt_sched_class;
+#ifdef CONFIG_SCHED_CLASS_EXT
+ else if (task_on_scx(p))
+ p->sched_class = &ext_sched_class;
+#endif
else
p->sched_class = &fair_sched_class;

@@ -8854,6 +8874,7 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
case SCHED_NORMAL:
case SCHED_BATCH:
case SCHED_IDLE:
+ case SCHED_EXT:
ret = 0;
break;
}
@@ -8881,6 +8902,7 @@ SYSCALL_DEFINE1(sched_get_priority_min, int, policy)
case SCHED_NORMAL:
case SCHED_BATCH:
case SCHED_IDLE:
+ case SCHED_EXT:
ret = 0;
}
return ret;
@@ -9726,6 +9748,10 @@ void __init sched_init(void)
BUG_ON(!sched_class_above(&dl_sched_class, &rt_sched_class));
BUG_ON(!sched_class_above(&rt_sched_class, &fair_sched_class));
BUG_ON(!sched_class_above(&fair_sched_class, &idle_sched_class));
+#ifdef CONFIG_SCHED_CLASS_EXT
+ BUG_ON(!sched_class_above(&fair_sched_class, &ext_sched_class));
+ BUG_ON(!sched_class_above(&ext_sched_class, &idle_sched_class));
+#endif

wait_bit_init();

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1637b65ba07a..814ed80b8ff6 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -338,6 +338,9 @@ static __init int sched_init_debug(void)

debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);

+#ifdef CONFIG_SCHED_CLASS_EXT
+ debugfs_create_file("ext", 0444, debugfs_sched, NULL, &sched_ext_fops);
+#endif
return 0;
}
late_initcall(sched_init_debug);
@@ -1047,6 +1050,9 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
P(dl.runtime);
P(dl.deadline);
}
+#ifdef CONFIG_SCHED_CLASS_EXT
+ __PS("ext.enabled", p->sched_class == &ext_sched_class);
+#endif
#undef PN_SCHEDSTAT
#undef P_SCHEDSTAT

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
new file mode 100644
index 000000000000..38786d36a356
--- /dev/null
+++ b/kernel/sched/ext.c
@@ -0,0 +1,2979 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#define SCX_OP_IDX(op) (offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void)))
+
+enum scx_internal_consts {
+ SCX_NR_ONLINE_OPS = SCX_OP_IDX(init),
+ SCX_DSP_DFL_MAX_BATCH = 32,
+};
+
+enum scx_ops_enable_state {
+ SCX_OPS_PREPPING,
+ SCX_OPS_ENABLING,
+ SCX_OPS_ENABLED,
+ SCX_OPS_DISABLING,
+ SCX_OPS_DISABLED,
+};
+
+/*
+ * sched_ext_entity->ops_state
+ *
+ * Used to track the task ownership between the SCX core and the BPF scheduler.
+ * State transitions look as follows:
+ *
+ * NONE -> QUEUEING -> QUEUED -> DISPATCHING
+ * ^ | |
+ * | v v
+ * \-------------------------------/
+ *
+ * QUEUEING and DISPATCHING states can be waited upon. See wait_ops_state() call
+ * sites for explanations on the conditions being waited upon and why they are
+ * safe. Transitions out of them into NONE or QUEUED must store_release and the
+ * waiters should load_acquire.
+ *
+ * Tracking scx_ops_state enables sched_ext core to reliably determine whether
+ * any given task can be dispatched by the BPF scheduler at all times and thus
+ * relaxes the requirements on the BPF scheduler. This allows the BPF scheduler
+ * to try to dispatch any task anytime regardless of its state as the SCX core
+ * can safely reject invalid dispatches.
+ */
+enum scx_ops_state {
+ SCX_OPSS_NONE, /* owned by the SCX core */
+ SCX_OPSS_QUEUEING, /* in transit to the BPF scheduler */
+ SCX_OPSS_QUEUED, /* owned by the BPF scheduler */
+ SCX_OPSS_DISPATCHING, /* in transit back to the SCX core */
+
+ /*
+ * QSEQ brands each QUEUED instance so that, when dispatch races
+ * dequeue/requeue, the dispatcher can tell whether it still has a claim
+ * on the task being dispatched.
+ */
+ SCX_OPSS_QSEQ_SHIFT = 2,
+ SCX_OPSS_STATE_MASK = (1LLU << SCX_OPSS_QSEQ_SHIFT) - 1,
+ SCX_OPSS_QSEQ_MASK = ~SCX_OPSS_STATE_MASK,
+};
+
+/*
+ * During exit, a task may schedule after losing its PIDs. When disabling the
+ * BPF scheduler, we need to be able to iterate tasks in every state to
+ * guarantee system safety. Maintain a dedicated task list which contains every
+ * task between its fork and eventual free.
+ */
+static DEFINE_SPINLOCK(scx_tasks_lock);
+static LIST_HEAD(scx_tasks);
+
+/* ops enable/disable */
+static struct kthread_worker *scx_ops_helper;
+static DEFINE_MUTEX(scx_ops_enable_mutex);
+DEFINE_STATIC_KEY_FALSE(__scx_ops_enabled);
+DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem);
+static atomic_t scx_ops_enable_state_var = ATOMIC_INIT(SCX_OPS_DISABLED);
+static struct sched_ext_ops scx_ops;
+static bool scx_warned_zero_slice;
+
+static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_last);
+static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_exiting);
+static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
+
+struct static_key_false scx_has_op[SCX_NR_ONLINE_OPS] =
+ { [0 ... SCX_NR_ONLINE_OPS-1] = STATIC_KEY_FALSE_INIT };
+
+static atomic_t scx_exit_type = ATOMIC_INIT(SCX_EXIT_DONE);
+static struct scx_exit_info scx_exit_info;
+
+static atomic64_t scx_nr_rejected = ATOMIC64_INIT(0);
+
+/* idle tracking */
+#ifdef CONFIG_SMP
+#ifdef CONFIG_CPUMASK_OFFSTACK
+#define CL_ALIGNED_IF_ONSTACK
+#else
+#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
+#endif
+
+static struct {
+ cpumask_var_t cpu;
+ cpumask_var_t smt;
+} idle_masks CL_ALIGNED_IF_ONSTACK;
+
+static bool __cacheline_aligned_in_smp scx_has_idle_cpus;
+#endif /* CONFIG_SMP */
+
+/*
+ * Direct dispatch marker.
+ *
+ * Non-NULL values are used for direct dispatch from enqueue path. A valid
+ * pointer points to the task currently being enqueued. An ERR_PTR value is used
+ * to indicate that direct dispatch has already happened.
+ */
+static DEFINE_PER_CPU(struct task_struct *, direct_dispatch_task);
+
+/* dispatch queues */
+static struct scx_dispatch_q __cacheline_aligned_in_smp scx_dsq_global;
+
+static const struct rhashtable_params dsq_hash_params = {
+ .key_len = 8,
+ .key_offset = offsetof(struct scx_dispatch_q, id),
+ .head_offset = offsetof(struct scx_dispatch_q, hash_node),
+};
+
+static struct rhashtable dsq_hash;
+static LLIST_HEAD(dsqs_to_free);
+
+/* dispatch buf */
+struct scx_dsp_buf_ent {
+ struct task_struct *task;
+ u64 qseq;
+ u64 dsq_id;
+ u64 enq_flags;
+};
+
+static u32 scx_dsp_max_batch;
+static struct scx_dsp_buf_ent __percpu *scx_dsp_buf;
+
+struct scx_dsp_ctx {
+ struct rq *rq;
+ struct rq_flags *rf;
+ u32 buf_cursor;
+ u32 nr_tasks;
+};
+
+static DEFINE_PER_CPU(struct scx_dsp_ctx, scx_dsp_ctx);
+
+void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice,
+ u64 enq_flags);
+__printf(2, 3) static void scx_ops_error_type(enum scx_exit_type type,
+ const char *fmt, ...);
+#define scx_ops_error(fmt, args...) \
+ scx_ops_error_type(SCX_EXIT_ERROR, fmt, ##args)
+
+struct scx_task_iter {
+ struct sched_ext_entity cursor;
+ struct task_struct *locked;
+ struct rq *rq;
+ struct rq_flags rf;
+};
+
+#define SCX_HAS_OP(op) static_branch_likely(&scx_has_op[SCX_OP_IDX(op)])
+
+/*
+ * scx_kf_mask enforcement. Some kfuncs can only be called from specific SCX
+ * ops. When invoking SCX ops, SCX_CALL_OP[_RET]() should be used to indicate
+ * the allowed kfuncs and those kfuncs should use scx_kf_allowed() to check
+ * whether it's running from an allowed context.
+ */
+static void scx_kf_allow(u32 mask)
+{
+ u32 allowed_nesters = 0;
+
+ /* INIT|SLEEPABLE can nest others but not themselves */
+ if (!(mask & (SCX_KF_INIT | SCX_KF_SLEEPABLE)))
+ allowed_nesters |= SCX_KF_INIT | SCX_KF_SLEEPABLE;
+
+ WARN_ONCE(current->scx.kf_mask & ~allowed_nesters,
+ "invalid nesting current->scx.kf_mask=0x%x mask=0x%x allowed_nesters=0x%x\n",
+ current->scx.kf_mask, mask, allowed_nesters);
+
+ current->scx.kf_mask |= mask;
+}
+
+static void scx_kf_disallow(u32 mask)
+{
+ current->scx.kf_mask &= ~mask;
+}
+
+#define SCX_CALL_OP(mask, op, args...) \
+do { \
+ scx_kf_allow(mask); \
+ scx_ops.op(args); \
+ scx_kf_disallow(mask); \
+} while (0)
+
+#define SCX_CALL_OP_RET(mask, op, args...) \
+({ \
+ __typeof__(scx_ops.op(args)) __ret; \
+ scx_kf_allow(mask); \
+ __ret = scx_ops.op(args); \
+ scx_kf_disallow(mask); \
+ __ret; \
+})
+
+static bool scx_kf_allowed(u32 mask)
+{
+ if (unlikely(!(current->scx.kf_mask & mask))) {
+ scx_ops_error("kfunc with mask 0x%x called from an operation only allowing 0x%x",
+ mask, current->scx.kf_mask);
+ return false;
+ }
+
+ if (unlikely((mask & SCX_KF_SLEEPABLE) && in_interrupt())) {
+ scx_ops_error("sleepable kfunc called from non-sleepable context");
+ return false;
+ }
+
+ return true;
+}
+
+/**
+ * scx_task_iter_init - Initialize a task iterator
+ * @iter: iterator to init
+ *
+ * Initialize @iter. Must be called with scx_tasks_lock held. Once initialized,
+ * @iter must eventually be exited with scx_task_iter_exit().
+ *
+ * scx_tasks_lock may be released between this and the first next() call or
+ * between any two next() calls. If scx_tasks_lock is released between two
+ * next() calls, the caller is responsible for ensuring that the task being
+ * iterated remains accessible either through RCU read lock or obtaining a
+ * reference count.
+ *
+ * All tasks which existed when the iteration started are guaranteed to be
+ * visited as long as they still exist.
+ */
+static void scx_task_iter_init(struct scx_task_iter *iter)
+{
+ lockdep_assert_held(&scx_tasks_lock);
+
+ iter->cursor = (struct sched_ext_entity){ .flags = SCX_TASK_CURSOR };
+ list_add(&iter->cursor.tasks_node, &scx_tasks);
+ iter->locked = NULL;
+}
+
+/**
+ * scx_task_iter_exit - Exit a task iterator
+ * @iter: iterator to exit
+ *
+ * Exit a previously initialized @iter. Must be called with scx_tasks_lock held.
+ * If the iterator holds a task's rq lock, that rq lock is released. See
+ * scx_task_iter_init() for details.
+ */
+static void scx_task_iter_exit(struct scx_task_iter *iter)
+{
+ struct list_head *cursor = &iter->cursor.tasks_node;
+
+ lockdep_assert_held(&scx_tasks_lock);
+
+ if (iter->locked) {
+ task_rq_unlock(iter->rq, iter->locked, &iter->rf);
+ iter->locked = NULL;
+ }
+
+ if (list_empty(cursor))
+ return;
+
+ list_del_init(cursor);
+}
+
+/**
+ * scx_task_iter_next - Next task
+ * @iter: iterator to walk
+ *
+ * Visit the next task. See scx_task_iter_init() for details.
+ */
+static struct task_struct *scx_task_iter_next(struct scx_task_iter *iter)
+{
+ struct list_head *cursor = &iter->cursor.tasks_node;
+ struct sched_ext_entity *pos;
+
+ lockdep_assert_held(&scx_tasks_lock);
+
+ list_for_each_entry(pos, cursor, tasks_node) {
+ if (&pos->tasks_node == &scx_tasks)
+ return NULL;
+ if (!(pos->flags & SCX_TASK_CURSOR)) {
+ list_move(cursor, &pos->tasks_node);
+ return container_of(pos, struct task_struct, scx);
+ }
+ }
+
+ /* can't happen, should always terminate at scx_tasks above */
+ BUG();
+}
+
+/**
+ * scx_task_iter_next_filtered - Next non-idle task
+ * @iter: iterator to walk
+ *
+ * Visit the next non-idle task. See scx_task_iter_init() for details.
+ */
+static struct task_struct *
+scx_task_iter_next_filtered(struct scx_task_iter *iter)
+{
+ struct task_struct *p;
+
+ while ((p = scx_task_iter_next(iter))) {
+ if (!is_idle_task(p))
+ return p;
+ }
+ return NULL;
+}
+
+/**
+ * scx_task_iter_next_filtered_locked - Next non-idle task with its rq locked
+ * @iter: iterator to walk
+ *
+ * Visit the next non-idle task with its rq lock held. See scx_task_iter_init()
+ * for details.
+ */
+static struct task_struct *
+scx_task_iter_next_filtered_locked(struct scx_task_iter *iter)
+{
+ struct task_struct *p;
+
+ if (iter->locked) {
+ task_rq_unlock(iter->rq, iter->locked, &iter->rf);
+ iter->locked = NULL;
+ }
+
+ p = scx_task_iter_next_filtered(iter);
+ if (!p)
+ return NULL;
+
+ iter->rq = task_rq_lock(p, &iter->rf);
+ iter->locked = p;
+ return p;
+}
+
+static enum scx_ops_enable_state scx_ops_enable_state(void)
+{
+ return atomic_read(&scx_ops_enable_state_var);
+}
+
+static enum scx_ops_enable_state
+scx_ops_set_enable_state(enum scx_ops_enable_state to)
+{
+ return atomic_xchg(&scx_ops_enable_state_var, to);
+}
+
+static bool scx_ops_tryset_enable_state(enum scx_ops_enable_state to,
+ enum scx_ops_enable_state from)
+{
+ int from_v = from;
+
+ return atomic_try_cmpxchg(&scx_ops_enable_state_var, &from_v, to);
+}
+
+static bool scx_ops_disabling(void)
+{
+ return unlikely(scx_ops_enable_state() == SCX_OPS_DISABLING);
+}
+
+/**
+ * wait_ops_state - Busy-wait the specified ops state to end
+ * @p: target task
+ * @opss: state to wait the end of
+ *
+ * Busy-wait for @p to transition out of @opss. This can only be used when the
+ * state part of @opss is %SCX_QUEUEING or %SCX_DISPATCHING. This function also
+ * has load_acquire semantics to ensure that the caller can see the updates made
+ * in the enqueueing and dispatching paths.
+ */
+static void wait_ops_state(struct task_struct *p, u64 opss)
+{
+ do {
+ cpu_relax();
+ } while (atomic64_read_acquire(&p->scx.ops_state) == opss);
+}
+
+/**
+ * ops_cpu_valid - Verify a cpu number
+ * @cpu: cpu number which came from a BPF ops
+ *
+ * @cpu is a cpu number which came from the BPF scheduler and can be any value.
+ * Verify that it is in range and one of the possible cpus.
+ */
+static bool ops_cpu_valid(s32 cpu)
+{
+ return likely(cpu >= 0 && cpu < nr_cpu_ids && cpu_possible(cpu));
+}
+
+/**
+ * ops_sanitize_err - Sanitize a -errno value
+ * @ops_name: operation to blame on failure
+ * @err: -errno value to sanitize
+ *
+ * Verify @err is a valid -errno. If not, trigger scx_ops_error() and return
+ * -%EPROTO. This is necessary because returning a rogue -errno up the chain can
+ * cause misbehaviors. For an example, a large negative return from
+ * ops.prep_enable() triggers an oops when passed up the call chain because the
+ * value fails IS_ERR() test after being encoded with ERR_PTR() and then is
+ * handled as a pointer.
+ */
+static int ops_sanitize_err(const char *ops_name, s32 err)
+{
+ if (err < 0 && err >= -MAX_ERRNO)
+ return err;
+
+ scx_ops_error("ops.%s() returned an invalid errno %d", ops_name, err);
+ return -EPROTO;
+}
+
+static void update_curr_scx(struct rq *rq)
+{
+ struct task_struct *curr = rq->curr;
+ u64 now = rq_clock_task(rq);
+ u64 delta_exec;
+
+ if (time_before_eq64(now, curr->se.exec_start))
+ return;
+
+ delta_exec = now - curr->se.exec_start;
+ curr->se.exec_start = now;
+ curr->se.sum_exec_runtime += delta_exec;
+ account_group_exec_runtime(curr, delta_exec);
+ cgroup_account_cputime(curr, delta_exec);
+
+ curr->scx.slice -= min(curr->scx.slice, delta_exec);
+}
+
+static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,
+ u64 enq_flags)
+{
+ bool is_local = dsq->id == SCX_DSQ_LOCAL;
+
+ WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_node));
+
+ if (!is_local) {
+ raw_spin_lock(&dsq->lock);
+ if (unlikely(dsq->id == SCX_DSQ_INVALID)) {
+ scx_ops_error("attempting to dispatch to a destroyed dsq");
+ /* fall back to the global dsq */
+ raw_spin_unlock(&dsq->lock);
+ dsq = &scx_dsq_global;
+ raw_spin_lock(&dsq->lock);
+ }
+ }
+
+ if (enq_flags & SCX_ENQ_HEAD)
+ list_add(&p->scx.dsq_node, &dsq->fifo);
+ else
+ list_add_tail(&p->scx.dsq_node, &dsq->fifo);
+ dsq->nr++;
+ p->scx.dsq = dsq;
+
+ /*
+ * We're transitioning out of QUEUEING or DISPATCHING. store_release to
+ * match waiters' load_acquire.
+ */
+ if (enq_flags & SCX_ENQ_CLEAR_OPSS)
+ atomic64_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
+
+ if (is_local) {
+ struct rq *rq = container_of(dsq, struct rq, scx.local_dsq);
+
+ if (sched_class_above(&ext_sched_class, rq->curr->sched_class))
+ resched_curr(rq);
+ } else {
+ raw_spin_unlock(&dsq->lock);
+ }
+}
+
+static void dispatch_dequeue(struct scx_rq *scx_rq, struct task_struct *p)
+{
+ struct scx_dispatch_q *dsq = p->scx.dsq;
+ bool is_local = dsq == &scx_rq->local_dsq;
+
+ if (!dsq) {
+ WARN_ON_ONCE(!list_empty(&p->scx.dsq_node));
+ /*
+ * When dispatching directly from the BPF scheduler to a local
+ * DSQ, the task isn't associated with any DSQ but
+ * @p->scx.holding_cpu may be set under the protection of
+ * %SCX_OPSS_DISPATCHING.
+ */
+ if (p->scx.holding_cpu >= 0)
+ p->scx.holding_cpu = -1;
+ return;
+ }
+
+ if (!is_local)
+ raw_spin_lock(&dsq->lock);
+
+ /*
+ * Now that we hold @dsq->lock, @p->holding_cpu and @p->scx.dsq_node
+ * can't change underneath us.
+ */
+ if (p->scx.holding_cpu < 0) {
+ /* @p must still be on @dsq, dequeue */
+ WARN_ON_ONCE(list_empty(&p->scx.dsq_node));
+ list_del_init(&p->scx.dsq_node);
+ dsq->nr--;
+ } else {
+ /*
+ * We're racing against dispatch_to_local_dsq() which already
+ * removed @p from @dsq and set @p->scx.holding_cpu. Clear the
+ * holding_cpu which tells dispatch_to_local_dsq() that it lost
+ * the race.
+ */
+ WARN_ON_ONCE(!list_empty(&p->scx.dsq_node));
+ p->scx.holding_cpu = -1;
+ }
+ p->scx.dsq = NULL;
+
+ if (!is_local)
+ raw_spin_unlock(&dsq->lock);
+}
+
+static struct scx_dispatch_q *find_non_local_dsq(u64 dsq_id)
+{
+ lockdep_assert(rcu_read_lock_any_held());
+
+ if (dsq_id == SCX_DSQ_GLOBAL)
+ return &scx_dsq_global;
+ else
+ return rhashtable_lookup_fast(&dsq_hash, &dsq_id,
+ dsq_hash_params);
+}
+
+static struct scx_dispatch_q *find_dsq_for_dispatch(struct rq *rq, u64 dsq_id,
+ struct task_struct *p)
+{
+ struct scx_dispatch_q *dsq;
+
+ if (dsq_id == SCX_DSQ_LOCAL)
+ return &rq->scx.local_dsq;
+
+ dsq = find_non_local_dsq(dsq_id);
+ if (unlikely(!dsq)) {
+ scx_ops_error("non-existent DSQ 0x%llx for %s[%d]",
+ dsq_id, p->comm, p->pid);
+ return &scx_dsq_global;
+ }
+
+ return dsq;
+}
+
+static void direct_dispatch(struct task_struct *ddsp_task, struct task_struct *p,
+ u64 dsq_id, u64 enq_flags)
+{
+ struct scx_dispatch_q *dsq;
+
+ /* @p must match the task which is being enqueued */
+ if (unlikely(p != ddsp_task)) {
+ if (IS_ERR(ddsp_task))
+ scx_ops_error("%s[%d] already direct-dispatched",
+ p->comm, p->pid);
+ else
+ scx_ops_error("enqueueing %s[%d] but trying to direct-dispatch %s[%d]",
+ ddsp_task->comm, ddsp_task->pid,
+ p->comm, p->pid);
+ return;
+ }
+
+ /*
+ * %SCX_DSQ_LOCAL_ON is not supported during direct dispatch because
+ * dispatching to the local DSQ of a different CPU requires unlocking
+ * the current rq which isn't allowed in the enqueue path. Use
+ * ops.select_cpu() to be on the target CPU and then %SCX_DSQ_LOCAL.
+ */
+ if (unlikely((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON)) {
+ scx_ops_error("SCX_DSQ_LOCAL_ON can't be used for direct-dispatch");
+ return;
+ }
+
+ dsq = find_dsq_for_dispatch(task_rq(p), dsq_id, p);
+ dispatch_enqueue(dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
+
+ /*
+ * Mark that dispatch already happened by spoiling direct_dispatch_task
+ * with a non-NULL value which can never match a valid task pointer.
+ */
+ __this_cpu_write(direct_dispatch_task, ERR_PTR(-ESRCH));
+}
+
+static bool test_rq_online(struct rq *rq)
+{
+#ifdef CONFIG_SMP
+ return rq->online;
+#else
+ return true;
+#endif
+}
+
+static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
+ int sticky_cpu)
+{
+ struct task_struct **ddsp_taskp;
+ u64 qseq;
+
+ WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_QUEUED));
+
+ if (p->scx.flags & SCX_TASK_ENQ_LOCAL) {
+ enq_flags |= SCX_ENQ_LOCAL;
+ p->scx.flags &= ~SCX_TASK_ENQ_LOCAL;
+ }
+
+ /* rq migration */
+ if (sticky_cpu == cpu_of(rq))
+ goto local_norefill;
+
+ /*
+ * If !rq->online, we already told the BPF scheduler that the CPU is
+ * offline. We're just trying to on/offline the CPU. Don't bother the
+ * BPF scheduler.
+ */
+ if (unlikely(!test_rq_online(rq)))
+ goto local;
+
+ /* see %SCX_OPS_ENQ_EXITING */
+ if (!static_branch_unlikely(&scx_ops_enq_exiting) &&
+ unlikely(p->flags & PF_EXITING))
+ goto local;
+
+ /* see %SCX_OPS_ENQ_LAST */
+ if (!static_branch_unlikely(&scx_ops_enq_last) &&
+ (enq_flags & SCX_ENQ_LAST))
+ goto local;
+
+ if (!SCX_HAS_OP(enqueue)) {
+ if (enq_flags & SCX_ENQ_LOCAL)
+ goto local;
+ else
+ goto global;
+ }
+
+ /* DSQ bypass didn't trigger, enqueue on the BPF scheduler */
+ qseq = rq->scx.ops_qseq++ << SCX_OPSS_QSEQ_SHIFT;
+
+ WARN_ON_ONCE(atomic64_read(&p->scx.ops_state) != SCX_OPSS_NONE);
+ atomic64_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
+
+ ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
+ WARN_ON_ONCE(*ddsp_taskp);
+ *ddsp_taskp = p;
+
+ SCX_CALL_OP(SCX_KF_ENQUEUE_DISPATCH, enqueue, p, enq_flags);
+
+ /*
+ * If not directly dispatched, QUEUEING isn't clear yet and dispatch or
+ * dequeue may be waiting. The store_release matches their load_acquire.
+ */
+ if (*ddsp_taskp == p)
+ atomic64_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq);
+ *ddsp_taskp = NULL;
+ return;
+
+local:
+ p->scx.slice = SCX_SLICE_DFL;
+local_norefill:
+ dispatch_enqueue(&rq->scx.local_dsq, p, enq_flags);
+ return;
+
+global:
+ p->scx.slice = SCX_SLICE_DFL;
+ dispatch_enqueue(&scx_dsq_global, p, enq_flags);
+}
+
+static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags)
+{
+ int sticky_cpu = p->scx.sticky_cpu;
+
+ if (sticky_cpu >= 0)
+ p->scx.sticky_cpu = -1;
+
+ /*
+ * Restoring a running task will be immediately followed by
+ * set_next_task_scx() which expects the task to not be on the BPF
+ * scheduler as tasks can only start running through local DSQs. Force
+ * direct-dispatch into the local DSQ by setting the sticky_cpu.
+ */
+ if (unlikely(enq_flags & ENQUEUE_RESTORE) && task_current(rq, p))
+ sticky_cpu = cpu_of(rq);
+
+ if (p->scx.flags & SCX_TASK_QUEUED)
+ return;
+
+ p->scx.flags |= SCX_TASK_QUEUED;
+ rq->scx.nr_running++;
+ add_nr_running(rq, 1);
+
+ do_enqueue_task(rq, p, enq_flags, sticky_cpu);
+}
+
+static void ops_dequeue(struct task_struct *p, u64 deq_flags)
+{
+ u64 opss;
+
+ /* acquire ensures that we see the preceding updates on QUEUED */
+ opss = atomic64_read_acquire(&p->scx.ops_state);
+
+ switch (opss & SCX_OPSS_STATE_MASK) {
+ case SCX_OPSS_NONE:
+ break;
+ case SCX_OPSS_QUEUEING:
+ /*
+ * QUEUEING is started and finished while holding @p's rq lock.
+ * As we're holding the rq lock now, we shouldn't see QUEUEING.
+ */
+ BUG();
+ case SCX_OPSS_QUEUED:
+ if (SCX_HAS_OP(dequeue))
+ scx_ops.dequeue(p, deq_flags);
+
+ if (atomic64_try_cmpxchg(&p->scx.ops_state, &opss,
+ SCX_OPSS_NONE))
+ break;
+ fallthrough;
+ case SCX_OPSS_DISPATCHING:
+ /*
+ * If @p is being dispatched from the BPF scheduler to a DSQ,
+ * wait for the transfer to complete so that @p doesn't get
+ * added to its DSQ after dequeueing is complete.
+ *
+ * As we're waiting on DISPATCHING with the rq locked, the
+ * dispatching side shouldn't try to lock the rq while
+ * DISPATCHING is set. See dispatch_to_local_dsq().
+ *
+ * DISPATCHING shouldn't have qseq set and control can reach
+ * here with NONE @opss from the above QUEUED case block.
+ * Explicitly wait on %SCX_OPSS_DISPATCHING instead of @opss.
+ */
+ wait_ops_state(p, SCX_OPSS_DISPATCHING);
+ BUG_ON(atomic64_read(&p->scx.ops_state) != SCX_OPSS_NONE);
+ break;
+ }
+}
+
+static void dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags)
+{
+ struct scx_rq *scx_rq = &rq->scx;
+
+ if (!(p->scx.flags & SCX_TASK_QUEUED))
+ return;
+
+ ops_dequeue(p, deq_flags);
+
+ if (deq_flags & SCX_DEQ_SLEEP)
+ p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP;
+ else
+ p->scx.flags &= ~SCX_TASK_DEQD_FOR_SLEEP;
+
+ p->scx.flags &= ~SCX_TASK_QUEUED;
+ scx_rq->nr_running--;
+ sub_nr_running(rq, 1);
+
+ dispatch_dequeue(scx_rq, p);
+}
+
+static void yield_task_scx(struct rq *rq)
+{
+ struct task_struct *p = rq->curr;
+
+ if (SCX_HAS_OP(yield))
+ scx_ops.yield(p, NULL);
+ else
+ p->scx.slice = 0;
+}
+
+static bool yield_to_task_scx(struct rq *rq, struct task_struct *to)
+{
+ struct task_struct *from = rq->curr;
+
+ if (SCX_HAS_OP(yield))
+ return scx_ops.yield(from, to);
+ else
+ return false;
+}
+
+#ifdef CONFIG_SMP
+/**
+ * move_task_to_local_dsq - Move a task from a different rq to a local DSQ
+ * @rq: rq to move the task into, currently locked
+ * @p: task to move
+ *
+ * Move @p which is currently on a different rq to @rq's local DSQ. The caller
+ * must:
+ *
+ * 1. Start with exclusive access to @p either through its DSQ lock or
+ * %SCX_OPSS_DISPATCHING flag.
+ *
+ * 2. Set @p->scx.holding_cpu to raw_smp_processor_id().
+ *
+ * 3. Remember task_rq(@p). Release the exclusive access so that we don't
+ * deadlock with dequeue.
+ *
+ * 4. Lock @rq and the task_rq from #3.
+ *
+ * 5. Call this function.
+ *
+ * Returns %true if @p was successfully moved. %false after racing dequeue and
+ * losing.
+ */
+static bool move_task_to_local_dsq(struct rq *rq, struct task_struct *p)
+{
+ struct rq *task_rq;
+
+ lockdep_assert_rq_held(rq);
+
+ /*
+ * If dequeue got to @p while we were trying to lock both rq's, it'd
+ * have cleared @p->scx.holding_cpu to -1. While other cpus may have
+ * updated it to different values afterwards, as this operation can't be
+ * preempted or recurse, @p->scx.holding_cpu can never become
+ * raw_smp_processor_id() again before we're done. Thus, we can tell
+ * whether we lost to dequeue by testing whether @p->scx.holding_cpu is
+ * still raw_smp_processor_id().
+ *
+ * See dispatch_dequeue() for the counterpart.
+ */
+ if (unlikely(p->scx.holding_cpu != raw_smp_processor_id()))
+ return false;
+
+ /* @p->rq couldn't have changed if we're still the holding cpu */
+ task_rq = task_rq(p);
+ lockdep_assert_rq_held(task_rq);
+
+ WARN_ON_ONCE(!cpumask_test_cpu(cpu_of(rq), p->cpus_ptr));
+ deactivate_task(task_rq, p, 0);
+ set_task_cpu(p, cpu_of(rq));
+ p->scx.sticky_cpu = cpu_of(rq);
+ activate_task(rq, p, 0);
+
+ return true;
+}
+
+/**
+ * dispatch_to_local_dsq_lock - Ensure source and desitnation rq's are locked
+ * @rq: current rq which is locked
+ * @rf: rq_flags to use when unlocking @rq
+ * @src_rq: rq to move task from
+ * @dst_rq: rq to move task to
+ *
+ * We're holding @rq lock and trying to dispatch a task from @src_rq to
+ * @dst_rq's local DSQ and thus need to lock both @src_rq and @dst_rq. Whether
+ * @rq stays locked isn't important as long as the state is restored after
+ * dispatch_to_local_dsq_unlock().
+ */
+static void dispatch_to_local_dsq_lock(struct rq *rq, struct rq_flags *rf,
+ struct rq *src_rq, struct rq *dst_rq)
+{
+ rq_unpin_lock(rq, rf);
+
+ if (src_rq == dst_rq) {
+ raw_spin_rq_unlock(rq);
+ raw_spin_rq_lock(dst_rq);
+ } else if (rq == src_rq) {
+ double_lock_balance(rq, dst_rq);
+ rq_repin_lock(rq, rf);
+ } else if (rq == dst_rq) {
+ double_lock_balance(rq, src_rq);
+ rq_repin_lock(rq, rf);
+ } else {
+ raw_spin_rq_unlock(rq);
+ double_rq_lock(src_rq, dst_rq);
+ }
+}
+
+/**
+ * dispatch_to_local_dsq_unlock - Undo dispatch_to_local_dsq_lock()
+ * @rq: current rq which is locked
+ * @rf: rq_flags to use when unlocking @rq
+ * @src_rq: rq to move task from
+ * @dst_rq: rq to move task to
+ *
+ * Unlock @src_rq and @dst_rq and ensure that @rq is locked on return.
+ */
+static void dispatch_to_local_dsq_unlock(struct rq *rq, struct rq_flags *rf,
+ struct rq *src_rq, struct rq *dst_rq)
+{
+ if (src_rq == dst_rq) {
+ raw_spin_rq_unlock(dst_rq);
+ raw_spin_rq_lock(rq);
+ rq_repin_lock(rq, rf);
+ } else if (rq == src_rq) {
+ double_unlock_balance(rq, dst_rq);
+ } else if (rq == dst_rq) {
+ double_unlock_balance(rq, src_rq);
+ } else {
+ double_rq_unlock(src_rq, dst_rq);
+ raw_spin_rq_lock(rq);
+ rq_repin_lock(rq, rf);
+ }
+}
+#endif /* CONFIG_SMP */
+
+
+static bool consume_dispatch_q(struct rq *rq, struct rq_flags *rf,
+ struct scx_dispatch_q *dsq)
+{
+ struct scx_rq *scx_rq = &rq->scx;
+ struct task_struct *p;
+ struct rq *task_rq;
+ bool moved = false;
+retry:
+ if (list_empty(&dsq->fifo))
+ return false;
+
+ raw_spin_lock(&dsq->lock);
+ list_for_each_entry(p, &dsq->fifo, scx.dsq_node) {
+ task_rq = task_rq(p);
+ if (rq == task_rq)
+ goto this_rq;
+ if (likely(test_rq_online(rq)) && !is_migration_disabled(p) &&
+ cpumask_test_cpu(cpu_of(rq), p->cpus_ptr))
+ goto remote_rq;
+ }
+ raw_spin_unlock(&dsq->lock);
+ return false;
+
+this_rq:
+ /* @dsq is locked and @p is on this rq */
+ WARN_ON_ONCE(p->scx.holding_cpu >= 0);
+ list_move_tail(&p->scx.dsq_node, &scx_rq->local_dsq.fifo);
+ dsq->nr--;
+ scx_rq->local_dsq.nr++;
+ p->scx.dsq = &scx_rq->local_dsq;
+ raw_spin_unlock(&dsq->lock);
+ return true;
+
+remote_rq:
+#ifdef CONFIG_SMP
+ /*
+ * @dsq is locked and @p is on a remote rq. @p is currently protected by
+ * @dsq->lock. We want to pull @p to @rq but may deadlock if we grab
+ * @task_rq while holding @dsq and @rq locks. As dequeue can't drop the
+ * rq lock or fail, do a little dancing from our side. See
+ * move_task_to_local_dsq().
+ */
+ WARN_ON_ONCE(p->scx.holding_cpu >= 0);
+ list_del_init(&p->scx.dsq_node);
+ dsq->nr--;
+ p->scx.holding_cpu = raw_smp_processor_id();
+ raw_spin_unlock(&dsq->lock);
+
+ rq_unpin_lock(rq, rf);
+ double_lock_balance(rq, task_rq);
+ rq_repin_lock(rq, rf);
+
+ moved = move_task_to_local_dsq(rq, p);
+
+ double_unlock_balance(rq, task_rq);
+#endif /* CONFIG_SMP */
+ if (likely(moved))
+ return true;
+ goto retry;
+}
+
+enum dispatch_to_local_dsq_ret {
+ DTL_DISPATCHED, /* successfully dispatched */
+ DTL_LOST, /* lost race to dequeue */
+ DTL_NOT_LOCAL, /* destination is not a local DSQ */
+ DTL_INVALID, /* invalid local dsq_id */
+};
+
+/**
+ * dispatch_to_local_dsq - Dispatch a task to a local dsq
+ * @rq: current rq which is locked
+ * @rf: rq_flags to use when unlocking @rq
+ * @dsq_id: destination dsq ID
+ * @p: task to dispatch
+ * @enq_flags: %SCX_ENQ_*
+ *
+ * We're holding @rq lock and want to dispatch @p to the local DSQ identified by
+ * @dsq_id. This function performs all the synchronization dancing needed
+ * because local DSQs are protected with rq locks.
+ *
+ * The caller must have exclusive ownership of @p (e.g. through
+ * %SCX_OPSS_DISPATCHING).
+ */
+static enum dispatch_to_local_dsq_ret
+dispatch_to_local_dsq(struct rq *rq, struct rq_flags *rf, u64 dsq_id,
+ struct task_struct *p, u64 enq_flags)
+{
+ struct rq *src_rq = task_rq(p);
+ struct rq *dst_rq;
+
+ /*
+ * We're synchronized against dequeue through DISPATCHING. As @p can't
+ * be dequeued, its task_rq and cpus_allowed are stable too.
+ */
+ if (dsq_id == SCX_DSQ_LOCAL) {
+ dst_rq = rq;
+ } else if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
+ s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
+
+ if (!ops_cpu_valid(cpu)) {
+ scx_ops_error("invalid cpu %d in SCX_DSQ_LOCAL_ON verdict for %s[%d]",
+ cpu, p->comm, p->pid);
+ return DTL_INVALID;
+ }
+ dst_rq = cpu_rq(cpu);
+ } else {
+ return DTL_NOT_LOCAL;
+ }
+
+ /* if dispatching to @rq that @p is already on, no lock dancing needed */
+ if (rq == src_rq && rq == dst_rq) {
+ dispatch_enqueue(&dst_rq->scx.local_dsq, p,
+ enq_flags | SCX_ENQ_CLEAR_OPSS);
+ return DTL_DISPATCHED;
+ }
+
+#ifdef CONFIG_SMP
+ if (cpumask_test_cpu(cpu_of(dst_rq), p->cpus_ptr)) {
+ struct rq *locked_dst_rq = dst_rq;
+ bool dsp;
+
+ /*
+ * @p is on a possibly remote @src_rq which we need to lock to
+ * move the task. If dequeue is in progress, it'd be locking
+ * @src_rq and waiting on DISPATCHING, so we can't grab @src_rq
+ * lock while holding DISPATCHING.
+ *
+ * As DISPATCHING guarantees that @p is wholly ours, we can
+ * pretend that we're moving from a DSQ and use the same
+ * mechanism - mark the task under transfer with holding_cpu,
+ * release DISPATCHING and then follow the same protocol.
+ */
+ p->scx.holding_cpu = raw_smp_processor_id();
+
+ /* store_release ensures that dequeue sees the above */
+ atomic64_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
+
+ dispatch_to_local_dsq_lock(rq, rf, src_rq, locked_dst_rq);
+
+ /*
+ * We don't require the BPF scheduler to avoid dispatching to
+ * offline CPUs mostly for convenience but also because CPUs can
+ * go offline between scx_bpf_dispatch() calls and here. If @p
+ * is destined to an offline CPU, queue it on its current CPU
+ * instead, which should always be safe. As this is an allowed
+ * behavior, don't trigger an ops error.
+ */
+ if (unlikely(!test_rq_online(dst_rq)))
+ dst_rq = src_rq;
+
+ if (src_rq == dst_rq) {
+ /*
+ * As @p is staying on the same rq, there's no need to
+ * go through the full deactivate/activate cycle.
+ * Optimize by abbreviating the operations in
+ * move_task_to_local_dsq().
+ */
+ dsp = p->scx.holding_cpu == raw_smp_processor_id();
+ if (likely(dsp)) {
+ p->scx.holding_cpu = -1;
+ dispatch_enqueue(&dst_rq->scx.local_dsq, p,
+ enq_flags);
+ }
+ } else {
+ dsp = move_task_to_local_dsq(dst_rq, p);
+ }
+
+ /* if the destination CPU is idle, wake it up */
+ if (dsp && p->sched_class > dst_rq->curr->sched_class)
+ resched_curr(dst_rq);
+
+ dispatch_to_local_dsq_unlock(rq, rf, src_rq, locked_dst_rq);
+
+ return dsp ? DTL_DISPATCHED : DTL_LOST;
+ }
+#endif /* CONFIG_SMP */
+
+ scx_ops_error("SCX_DSQ_LOCAL[_ON] verdict target cpu %d not allowed for %s[%d]",
+ cpu_of(dst_rq), p->comm, p->pid);
+ return DTL_INVALID;
+}
+
+/**
+ * finish_dispatch - Asynchronously finish dispatching a task
+ * @rq: current rq which is locked
+ * @rf: rq_flags to use when unlocking @rq
+ * @p: task to finish dispatching
+ * @qseq_at_dispatch: qseq when @p started getting dispatched
+ * @dsq_id: destination DSQ ID
+ * @enq_flags: %SCX_ENQ_*
+ *
+ * Dispatching to local DSQs may need to wait for queueing to complete or
+ * require rq lock dancing. As we don't wanna do either while inside
+ * ops.dispatch() to avoid locking order inversion, we split dispatching into
+ * two parts. scx_bpf_dispatch() which is called by ops.dispatch() records the
+ * task and its qseq. Once ops.dispatch() returns, this function is called to
+ * finish up.
+ *
+ * There is no guarantee that @p is still valid for dispatching or even that it
+ * was valid in the first place. Make sure that the task is still owned by the
+ * BPF scheduler and claim the ownership before dispatching.
+ */
+static void finish_dispatch(struct rq *rq, struct rq_flags *rf,
+ struct task_struct *p, u64 qseq_at_dispatch,
+ u64 dsq_id, u64 enq_flags)
+{
+ struct scx_dispatch_q *dsq;
+ u64 opss;
+
+retry:
+ /*
+ * No need for _acquire here. @p is accessed only after a successful
+ * try_cmpxchg to DISPATCHING.
+ */
+ opss = atomic64_read(&p->scx.ops_state);
+
+ switch (opss & SCX_OPSS_STATE_MASK) {
+ case SCX_OPSS_DISPATCHING:
+ case SCX_OPSS_NONE:
+ /* someone else already got to it */
+ return;
+ case SCX_OPSS_QUEUED:
+ /*
+ * If qseq doesn't match, @p has gone through at least one
+ * dispatch/dequeue and re-enqueue cycle between
+ * scx_bpf_dispatch() and here and we have no claim on it.
+ */
+ if ((opss & SCX_OPSS_QSEQ_MASK) != qseq_at_dispatch)
+ return;
+
+ /*
+ * While we know @p is accessible, we don't yet have a claim on
+ * it - the BPF scheduler is allowed to dispatch tasks
+ * spuriously and there can be a racing dequeue attempt. Let's
+ * claim @p by atomically transitioning it from QUEUED to
+ * DISPATCHING.
+ */
+ if (likely(atomic64_try_cmpxchg(&p->scx.ops_state, &opss,
+ SCX_OPSS_DISPATCHING)))
+ break;
+ goto retry;
+ case SCX_OPSS_QUEUEING:
+ /*
+ * do_enqueue_task() is in the process of transferring the task
+ * to the BPF scheduler while holding @p's rq lock. As we aren't
+ * holding any kernel or BPF resource that the enqueue path may
+ * depend upon, it's safe to wait.
+ */
+ wait_ops_state(p, opss);
+ goto retry;
+ }
+
+ BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
+
+ switch (dispatch_to_local_dsq(rq, rf, dsq_id, p, enq_flags)) {
+ case DTL_DISPATCHED:
+ break;
+ case DTL_LOST:
+ break;
+ case DTL_INVALID:
+ dsq_id = SCX_DSQ_GLOBAL;
+ fallthrough;
+ case DTL_NOT_LOCAL:
+ dsq = find_dsq_for_dispatch(cpu_rq(raw_smp_processor_id()),
+ dsq_id, p);
+ dispatch_enqueue(dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
+ break;
+ }
+}
+
+static void flush_dispatch_buf(struct rq *rq, struct rq_flags *rf)
+{
+ struct scx_dsp_ctx *dspc = this_cpu_ptr(&scx_dsp_ctx);
+ u32 u;
+
+ for (u = 0; u < dspc->buf_cursor; u++) {
+ struct scx_dsp_buf_ent *ent = &this_cpu_ptr(scx_dsp_buf)[u];
+
+ finish_dispatch(rq, rf, ent->task, ent->qseq, ent->dsq_id,
+ ent->enq_flags);
+ }
+
+ dspc->nr_tasks += dspc->buf_cursor;
+ dspc->buf_cursor = 0;
+}
+
+static int balance_scx(struct rq *rq, struct task_struct *prev,
+ struct rq_flags *rf)
+{
+ struct scx_rq *scx_rq = &rq->scx;
+ struct scx_dsp_ctx *dspc = this_cpu_ptr(&scx_dsp_ctx);
+ bool prev_on_scx = prev->sched_class == &ext_sched_class;
+
+ lockdep_assert_rq_held(rq);
+
+ if (prev_on_scx) {
+ WARN_ON_ONCE(prev->scx.flags & SCX_TASK_BAL_KEEP);
+ update_curr_scx(rq);
+
+ /*
+ * If @prev is runnable & has slice left, it has priority and
+ * fetching more just increases latency for the fetched tasks.
+ * Tell put_prev_task_scx() to put @prev on local_dsq.
+ *
+ * See scx_ops_disable_workfn() for the explanation on the
+ * disabling() test.
+ */
+ if ((prev->scx.flags & SCX_TASK_QUEUED) &&
+ prev->scx.slice && !scx_ops_disabling()) {
+ prev->scx.flags |= SCX_TASK_BAL_KEEP;
+ return 1;
+ }
+ }
+
+ /* if there already are tasks to run, nothing to do */
+ if (scx_rq->local_dsq.nr)
+ return 1;
+
+ if (consume_dispatch_q(rq, rf, &scx_dsq_global))
+ return 1;
+
+ if (!SCX_HAS_OP(dispatch))
+ return 0;
+
+ dspc->rq = rq;
+ dspc->rf = rf;
+
+ /*
+ * The dispatch loop. Because flush_dispatch_buf() may drop the rq lock,
+ * the local DSQ might still end up empty after a successful
+ * ops.dispatch(). If the local DSQ is empty even after ops.dispatch()
+ * produced some tasks, retry. The BPF scheduler may depend on this
+ * looping behavior to simplify its implementation.
+ */
+ do {
+ dspc->nr_tasks = 0;
+
+ SCX_CALL_OP(SCX_KF_ENQUEUE_DISPATCH | SCX_KF_DISPATCH,
+ dispatch, cpu_of(rq), prev_on_scx ? prev : NULL);
+
+ flush_dispatch_buf(rq, rf);
+
+ if (scx_rq->local_dsq.nr)
+ return 1;
+ if (consume_dispatch_q(rq, rf, &scx_dsq_global))
+ return 1;
+ } while (dspc->nr_tasks);
+
+ return 0;
+}
+
+static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
+{
+ if (p->scx.flags & SCX_TASK_QUEUED) {
+ WARN_ON_ONCE(atomic64_read(&p->scx.ops_state) != SCX_OPSS_NONE);
+ dispatch_dequeue(&rq->scx, p);
+ }
+
+ p->se.exec_start = rq_clock_task(rq);
+}
+
+static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
+{
+#ifndef CONFIG_SMP
+ /*
+ * UP workaround.
+ *
+ * Because SCX may transfer tasks across CPUs during dispatch, dispatch
+ * is performed from its balance operation which isn't called in UP.
+ * Let's work around by calling it from the operations which come right
+ * after.
+ *
+ * 1. If the prev task is on SCX, pick_next_task() calls
+ * .put_prev_task() right after. As .put_prev_task() is also called
+ * from other places, we need to distinguish the calls which can be
+ * done by looking at the previous task's state - if still queued or
+ * dequeued with %SCX_DEQ_SLEEP, the caller must be pick_next_task().
+ * This case is handled here.
+ *
+ * 2. If the prev task is not on SCX, the first following call into SCX
+ * will be .pick_next_task(), which is covered by calling
+ * balance_scx() from pick_next_task_scx().
+ *
+ * Note that we can't merge the first case into the second as
+ * balance_scx() must be called before the previous SCX task goes
+ * through put_prev_task_scx().
+ *
+ * As UP doesn't transfer tasks around, balance_scx() doesn't need @rf.
+ * Pass in %NULL.
+ */
+ if (p->scx.flags & (SCX_TASK_QUEUED | SCX_TASK_DEQD_FOR_SLEEP))
+ balance_scx(rq, p, NULL);
+#endif
+
+ update_curr_scx(rq);
+
+ /*
+ * If we're being called from put_prev_task_balance(), balance_scx() may
+ * have decided that @p should keep running.
+ */
+ if (p->scx.flags & SCX_TASK_BAL_KEEP) {
+ p->scx.flags &= ~SCX_TASK_BAL_KEEP;
+ dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
+ return;
+ }
+
+ if (p->scx.flags & SCX_TASK_QUEUED) {
+ /*
+ * If @p has slice left and balance_scx() didn't tag it for
+ * keeping, @p is getting preempted by a higher priority
+ * scheduler class. Leave it at the head of the local DSQ.
+ */
+ if (p->scx.slice && !scx_ops_disabling()) {
+ dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
+ return;
+ }
+
+ /*
+ * If we're in the pick_next_task path, balance_scx() should
+ * have already populated the local DSQ if there are any other
+ * available tasks. If empty, tell ops.enqueue() that @p is the
+ * only one available for this cpu. ops.enqueue() should put it
+ * on the local DSQ so that the subsequent pick_next_task_scx()
+ * can find the task unless it wants to trigger a separate
+ * follow-up scheduling event.
+ */
+ if (list_empty(&rq->scx.local_dsq.fifo))
+ do_enqueue_task(rq, p, SCX_ENQ_LAST | SCX_ENQ_LOCAL, -1);
+ else
+ do_enqueue_task(rq, p, 0, -1);
+ }
+}
+
+static struct task_struct *first_local_task(struct rq *rq)
+{
+ return list_first_entry_or_null(&rq->scx.local_dsq.fifo,
+ struct task_struct, scx.dsq_node);
+}
+
+static struct task_struct *pick_next_task_scx(struct rq *rq)
+{
+ struct task_struct *p;
+
+#ifndef CONFIG_SMP
+ /* UP workaround - see the comment at the head of put_prev_task_scx() */
+ if (unlikely(rq->curr->sched_class != &ext_sched_class))
+ balance_scx(rq, rq->curr, NULL);
+#endif
+
+ p = first_local_task(rq);
+ if (!p)
+ return NULL;
+
+ if (unlikely(!p->scx.slice)) {
+ if (!scx_ops_disabling() && !scx_warned_zero_slice) {
+ printk_deferred(KERN_WARNING "sched_ext: %s[%d] has zero slice in pick_next_task_scx()\n",
+ p->comm, p->pid);
+ scx_warned_zero_slice = true;
+ }
+ p->scx.slice = SCX_SLICE_DFL;
+ }
+
+ set_next_task_scx(rq, p, true);
+
+ return p;
+}
+
+#ifdef CONFIG_SMP
+
+static bool test_and_clear_cpu_idle(int cpu)
+{
+ if (cpumask_test_and_clear_cpu(cpu, idle_masks.cpu)) {
+ if (cpumask_empty(idle_masks.cpu))
+ scx_has_idle_cpus = false;
+ return true;
+ } else {
+ return false;
+ }
+}
+
+static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed)
+{
+ int cpu;
+
+ do {
+ cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
+ if (cpu < nr_cpu_ids) {
+ const struct cpumask *sbm = topology_sibling_cpumask(cpu);
+
+ /*
+ * If offline, @cpu is not its own sibling and we can
+ * get caught in an infinite loop as @cpu is never
+ * cleared from idle_masks.smt. Clear @cpu directly in
+ * such cases.
+ */
+ if (likely(cpumask_test_cpu(cpu, sbm)))
+ cpumask_andnot(idle_masks.smt, idle_masks.smt, sbm);
+ else
+ cpumask_andnot(idle_masks.smt, idle_masks.smt, cpumask_of(cpu));
+ } else {
+ cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
+ if (cpu >= nr_cpu_ids)
+ return -EBUSY;
+ }
+ } while (!test_and_clear_cpu_idle(cpu));
+
+ return cpu;
+}
+
+static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags)
+{
+ s32 cpu;
+
+ if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+ scx_ops_error("built-in idle tracking is disabled");
+ return prev_cpu;
+ }
+
+ /*
+ * If WAKE_SYNC and the machine isn't fully saturated, wake up @p to the
+ * local DSQ of the waker.
+ */
+ if ((wake_flags & SCX_WAKE_SYNC) && p->nr_cpus_allowed > 1 &&
+ scx_has_idle_cpus && !(current->flags & PF_EXITING)) {
+ cpu = smp_processor_id();
+ if (cpumask_test_cpu(cpu, p->cpus_ptr)) {
+ p->scx.flags |= SCX_TASK_ENQ_LOCAL;
+ return cpu;
+ }
+ }
+
+ /* if the previous CPU is idle, dispatch directly to it */
+ if (test_and_clear_cpu_idle(prev_cpu)) {
+ p->scx.flags |= SCX_TASK_ENQ_LOCAL;
+ return prev_cpu;
+ }
+
+ if (p->nr_cpus_allowed == 1)
+ return prev_cpu;
+
+ cpu = scx_pick_idle_cpu(p->cpus_ptr);
+ if (cpu >= 0) {
+ p->scx.flags |= SCX_TASK_ENQ_LOCAL;
+ return cpu;
+ }
+
+ return prev_cpu;
+}
+
+static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flags)
+{
+ if (SCX_HAS_OP(select_cpu)) {
+ s32 cpu;
+
+ cpu = scx_ops.select_cpu(p, prev_cpu, wake_flags);
+ if (ops_cpu_valid(cpu)) {
+ return cpu;
+ } else {
+ scx_ops_error("select_cpu returned invalid cpu %d", cpu);
+ return prev_cpu;
+ }
+ } else {
+ return scx_select_cpu_dfl(p, prev_cpu, wake_flags);
+ }
+}
+
+static void set_cpus_allowed_scx(struct task_struct *p,
+ struct affinity_context *ac)
+{
+ set_cpus_allowed_common(p, ac);
+
+ /*
+ * The effective cpumask is stored in @p->cpus_ptr which may temporarily
+ * differ from the configured one in @p->cpus_mask. Always tell the bpf
+ * scheduler the effective one.
+ *
+ * Fine-grained memory write control is enforced by BPF making the const
+ * designation pointless. Cast it away when calling the operation.
+ */
+ if (SCX_HAS_OP(set_cpumask))
+ scx_ops.set_cpumask(p, (struct cpumask *)p->cpus_ptr);
+}
+
+static void reset_idle_masks(void)
+{
+ /* consider all cpus idle, should converge to the actual state quickly */
+ cpumask_setall(idle_masks.cpu);
+ cpumask_setall(idle_masks.smt);
+ scx_has_idle_cpus = true;
+}
+
+void __scx_update_idle(struct rq *rq, bool idle)
+{
+ int cpu = cpu_of(rq);
+ struct cpumask *sib_mask = topology_sibling_cpumask(cpu);
+
+ if (SCX_HAS_OP(update_idle)) {
+ scx_ops.update_idle(cpu_of(rq), idle);
+ if (!static_branch_unlikely(&scx_builtin_idle_enabled))
+ return;
+ }
+
+ if (idle) {
+ cpumask_set_cpu(cpu, idle_masks.cpu);
+ if (!scx_has_idle_cpus)
+ scx_has_idle_cpus = true;
+
+ /*
+ * idle_masks.smt handling is racy but that's fine as it's only
+ * for optimization and self-correcting.
+ */
+ for_each_cpu(cpu, sib_mask) {
+ if (!cpumask_test_cpu(cpu, idle_masks.cpu))
+ return;
+ }
+ cpumask_or(idle_masks.smt, idle_masks.smt, sib_mask);
+ } else {
+ cpumask_clear_cpu(cpu, idle_masks.cpu);
+ if (scx_has_idle_cpus && cpumask_empty(idle_masks.cpu))
+ scx_has_idle_cpus = false;
+
+ cpumask_andnot(idle_masks.smt, idle_masks.smt, sib_mask);
+ }
+}
+
+#else /* !CONFIG_SMP */
+
+static bool test_and_clear_cpu_idle(int cpu) { return false; }
+static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed) { return -EBUSY; }
+static void reset_idle_masks(void) {}
+
+#endif /* CONFIG_SMP */
+
+static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued)
+{
+ update_curr_scx(rq);
+
+ /*
+ * While disabling, always resched as we can't trust the slice
+ * management.
+ */
+ if (scx_ops_disabling())
+ curr->scx.slice = 0;
+
+ if (!curr->scx.slice)
+ resched_curr(rq);
+}
+
+static int scx_ops_prepare_task(struct task_struct *p, struct task_group *tg)
+{
+ int ret;
+
+ WARN_ON_ONCE(p->scx.flags & SCX_TASK_OPS_PREPPED);
+
+ if (SCX_HAS_OP(prep_enable)) {
+ struct scx_enable_args args = { };
+
+ ret = SCX_CALL_OP_RET(SCX_KF_SLEEPABLE, prep_enable, p, &args);
+ if (unlikely(ret)) {
+ ret = ops_sanitize_err("prep_enable", ret);
+ return ret;
+ }
+ }
+
+ p->scx.flags |= SCX_TASK_OPS_PREPPED;
+ return 0;
+}
+
+static void scx_ops_enable_task(struct task_struct *p)
+{
+ lockdep_assert_rq_held(task_rq(p));
+ WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_OPS_PREPPED));
+
+ if (SCX_HAS_OP(enable)) {
+ struct scx_enable_args args = { };
+ scx_ops.enable(p, &args);
+ }
+ p->scx.flags &= ~SCX_TASK_OPS_PREPPED;
+ p->scx.flags |= SCX_TASK_OPS_ENABLED;
+}
+
+static void scx_ops_disable_task(struct task_struct *p)
+{
+ lockdep_assert_rq_held(task_rq(p));
+
+ if (p->scx.flags & SCX_TASK_OPS_PREPPED) {
+ if (SCX_HAS_OP(cancel_enable)) {
+ struct scx_enable_args args = { };
+ scx_ops.cancel_enable(p, &args);
+ }
+ p->scx.flags &= ~SCX_TASK_OPS_PREPPED;
+ } else if (p->scx.flags & SCX_TASK_OPS_ENABLED) {
+ if (SCX_HAS_OP(disable))
+ scx_ops.disable(p);
+ p->scx.flags &= ~SCX_TASK_OPS_ENABLED;
+ }
+}
+
+/**
+ * refresh_scx_weight - Refresh a task's ext weight
+ * @p: task to refresh ext weight for
+ *
+ * @p->scx.weight carries the task's static priority in cgroup weight scale to
+ * enable easy access from the BPF scheduler. To keep it synchronized with the
+ * current task priority, this function should be called when a new task is
+ * created, priority is changed for a task on sched_ext, and a task is switched
+ * to sched_ext from other classes.
+ */
+static void refresh_scx_weight(struct task_struct *p)
+{
+ u32 weight = sched_prio_to_weight[p->static_prio - MAX_RT_PRIO];
+
+ p->scx.weight = sched_weight_to_cgroup(weight);
+}
+
+void scx_pre_fork(struct task_struct *p)
+{
+ /*
+ * BPF scheduler enable/disable paths want to be able to iterate and
+ * update all tasks which can become complex when racing forks. As
+ * enable/disable are very cold paths, let's use a percpu_rwsem to
+ * exclude forks.
+ */
+ percpu_down_read(&scx_fork_rwsem);
+}
+
+int scx_fork(struct task_struct *p)
+{
+ percpu_rwsem_assert_held(&scx_fork_rwsem);
+
+ if (scx_enabled())
+ return scx_ops_prepare_task(p, task_group(p));
+ else
+ return 0;
+}
+
+void scx_post_fork(struct task_struct *p)
+{
+ refresh_scx_weight(p);
+
+ if (scx_enabled()) {
+ struct rq_flags rf;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &rf);
+ scx_ops_enable_task(p);
+ task_rq_unlock(rq, p, &rf);
+ }
+
+ spin_lock_irq(&scx_tasks_lock);
+ list_add_tail(&p->scx.tasks_node, &scx_tasks);
+ spin_unlock_irq(&scx_tasks_lock);
+
+ percpu_up_read(&scx_fork_rwsem);
+}
+
+void scx_cancel_fork(struct task_struct *p)
+{
+ if (scx_enabled())
+ scx_ops_disable_task(p);
+ percpu_up_read(&scx_fork_rwsem);
+}
+
+void sched_ext_free(struct task_struct *p)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&scx_tasks_lock, flags);
+ list_del_init(&p->scx.tasks_node);
+ spin_unlock_irqrestore(&scx_tasks_lock, flags);
+
+ /*
+ * @p is off scx_tasks and wholly ours. scx_ops_enable()'s PREPPED ->
+ * ENABLED transitions can't race us. Disable ops for @p.
+ */
+ if (p->scx.flags & (SCX_TASK_OPS_PREPPED | SCX_TASK_OPS_ENABLED)) {
+ struct rq_flags rf;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &rf);
+ scx_ops_disable_task(p);
+ task_rq_unlock(rq, p, &rf);
+ }
+}
+
+static void reweight_task_scx(struct rq *rq, struct task_struct *p, int newprio)
+{
+ refresh_scx_weight(p);
+}
+
+static void prio_changed_scx(struct rq *rq, struct task_struct *p, int oldprio)
+{
+}
+
+static void switching_to_scx(struct rq *rq, struct task_struct *p)
+{
+ refresh_scx_weight(p);
+
+ /*
+ * set_cpus_allowed_scx() is not called while @p is associated with a
+ * different scheduler class. Keep the BPF scheduler up-to-date.
+ */
+ if (SCX_HAS_OP(set_cpumask))
+ scx_ops.set_cpumask(p, (struct cpumask *)p->cpus_ptr);
+}
+
+static void check_preempt_curr_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
+static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
+
+/*
+ * Omitted operations:
+ *
+ * - check_preempt_curr: NOOP as it isn't useful in the wakeup path because the
+ * task isn't tied to the CPU at that point.
+ *
+ * - migrate_task_rq: Unncessary as task to cpu mapping is transient.
+ *
+ * - task_fork/dead: We need fork/dead notifications for all tasks regardless of
+ * their current sched_class. Call them directly from sched core instead.
+ *
+ * - task_woken, switched_from: Unnecessary.
+ */
+DEFINE_SCHED_CLASS(ext) = {
+ .enqueue_task = enqueue_task_scx,
+ .dequeue_task = dequeue_task_scx,
+ .yield_task = yield_task_scx,
+ .yield_to_task = yield_to_task_scx,
+
+ .check_preempt_curr = check_preempt_curr_scx,
+
+ .pick_next_task = pick_next_task_scx,
+
+ .put_prev_task = put_prev_task_scx,
+ .set_next_task = set_next_task_scx,
+
+#ifdef CONFIG_SMP
+ .balance = balance_scx,
+ .select_task_rq = select_task_rq_scx,
+ .set_cpus_allowed = set_cpus_allowed_scx,
+#endif
+
+ .task_tick = task_tick_scx,
+
+ .switching_to = switching_to_scx,
+ .switched_to = switched_to_scx,
+ .reweight_task = reweight_task_scx,
+ .prio_changed = prio_changed_scx,
+
+ .update_curr = update_curr_scx,
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 0,
+#endif
+};
+
+static void init_dsq(struct scx_dispatch_q *dsq, u64 dsq_id)
+{
+ memset(dsq, 0, sizeof(*dsq));
+
+ raw_spin_lock_init(&dsq->lock);
+ INIT_LIST_HEAD(&dsq->fifo);
+ dsq->id = dsq_id;
+}
+
+static struct scx_dispatch_q *create_dsq(u64 dsq_id, int node)
+{
+ struct scx_dispatch_q *dsq;
+ int ret;
+
+ if (dsq_id & SCX_DSQ_FLAG_BUILTIN)
+ return ERR_PTR(-EINVAL);
+
+ dsq = kmalloc_node(sizeof(*dsq), GFP_KERNEL, node);
+ if (!dsq)
+ return ERR_PTR(-ENOMEM);
+
+ init_dsq(dsq, dsq_id);
+
+ ret = rhashtable_insert_fast(&dsq_hash, &dsq->hash_node,
+ dsq_hash_params);
+ if (ret) {
+ kfree(dsq);
+ return ERR_PTR(ret);
+ }
+ return dsq;
+}
+
+static void free_dsq_irq_workfn(struct irq_work *irq_work)
+{
+ struct llist_node *to_free = llist_del_all(&dsqs_to_free);
+ struct scx_dispatch_q *dsq, *tmp_dsq;
+
+ llist_for_each_entry_safe(dsq, tmp_dsq, to_free, free_node)
+ kfree_rcu(dsq);
+}
+
+static DEFINE_IRQ_WORK(free_dsq_irq_work, free_dsq_irq_workfn);
+
+static void destroy_dsq(u64 dsq_id)
+{
+ struct scx_dispatch_q *dsq;
+ unsigned long flags;
+
+ rcu_read_lock();
+
+ dsq = rhashtable_lookup_fast(&dsq_hash, &dsq_id, dsq_hash_params);
+ if (!dsq)
+ goto out_unlock_rcu;
+
+ raw_spin_lock_irqsave(&dsq->lock, flags);
+
+ if (dsq->nr) {
+ scx_ops_error("attempting to destroy in-use dsq 0x%016llx (nr=%u)",
+ dsq->id, dsq->nr);
+ goto out_unlock_dsq;
+ }
+
+ if (rhashtable_remove_fast(&dsq_hash, &dsq->hash_node, dsq_hash_params))
+ goto out_unlock_dsq;
+
+ /*
+ * Mark dead by invalidating ->id to prevent dispatch_enqueue() from
+ * queueing more tasks. As this function can be called from anywhere,
+ * freeing is bounced through an irq work to avoid nesting RCU
+ * operations inside scheduler locks.
+ */
+ dsq->id = SCX_DSQ_INVALID;
+ llist_add(&dsq->free_node, &dsqs_to_free);
+ irq_work_queue(&free_dsq_irq_work);
+
+out_unlock_dsq:
+ raw_spin_unlock_irqrestore(&dsq->lock, flags);
+out_unlock_rcu:
+ rcu_read_unlock();
+}
+
+/*
+ * Used by sched_fork() and __setscheduler_prio() to pick the matching
+ * sched_class. dl/rt are already handled.
+ */
+bool task_on_scx(struct task_struct *p)
+{
+ if (!scx_enabled() || scx_ops_disabling())
+ return false;
+ return p->policy == SCHED_EXT;
+}
+
+static void scx_ops_fallback_enqueue(struct task_struct *p, u64 enq_flags)
+{
+ if (enq_flags & SCX_ENQ_LAST)
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags);
+ else
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+}
+
+static void scx_ops_fallback_dispatch(s32 cpu, struct task_struct *prev) {}
+
+static void scx_ops_disable_workfn(struct kthread_work *work)
+{
+ struct scx_exit_info *ei = &scx_exit_info;
+ struct scx_task_iter sti;
+ struct task_struct *p;
+ struct rhashtable_iter rht_iter;
+ struct scx_dispatch_q *dsq;
+ const char *reason;
+ int i, type;
+
+ type = atomic_read(&scx_exit_type);
+ while (true) {
+ /*
+ * NONE indicates that a new scx_ops has been registered since
+ * disable was scheduled - don't kill the new ops. DONE
+ * indicates that the ops has already been disabled.
+ */
+ if (type == SCX_EXIT_NONE || type == SCX_EXIT_DONE)
+ return;
+ if (atomic_try_cmpxchg(&scx_exit_type, &type, SCX_EXIT_DONE))
+ break;
+ }
+
+ switch (type) {
+ case SCX_EXIT_UNREG:
+ reason = "BPF scheduler unregistered";
+ break;
+ case SCX_EXIT_ERROR:
+ reason = "runtime error";
+ break;
+ case SCX_EXIT_ERROR_BPF:
+ reason = "scx_bpf_error";
+ break;
+ default:
+ reason = "<UNKNOWN>";
+ }
+
+ ei->type = type;
+ strlcpy(ei->reason, reason, sizeof(ei->reason));
+
+ switch (scx_ops_set_enable_state(SCX_OPS_DISABLING)) {
+ case SCX_OPS_DISABLED:
+ pr_warn("sched_ext: ops error detected without ops (%s)\n",
+ scx_exit_info.msg);
+ WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_DISABLED) !=
+ SCX_OPS_DISABLING);
+ return;
+ case SCX_OPS_PREPPING:
+ goto forward_progress_guaranteed;
+ case SCX_OPS_DISABLING:
+ /* shouldn't happen but handle it like ENABLING if it does */
+ WARN_ONCE(true, "sched_ext: duplicate disabling instance?");
+ fallthrough;
+ case SCX_OPS_ENABLING:
+ case SCX_OPS_ENABLED:
+ break;
+ }
+
+ /*
+ * DISABLING is set and ops was either ENABLING or ENABLED indicating
+ * that the ops and static branches are set.
+ *
+ * We must guarantee that all runnable tasks make forward progress
+ * without trusting the BPF scheduler. We can't grab any mutexes or
+ * rwsems as they might be held by tasks that the BPF scheduler is
+ * forgetting to run, which unfortunately also excludes toggling the
+ * static branches.
+ *
+ * Let's work around by overriding a couple ops and modifying behaviors
+ * based on the DISABLING state and then cycling the tasks through
+ * dequeue/enqueue to force global FIFO scheduling.
+ *
+ * a. ops.enqueue() and .dispatch() are overridden for simple global
+ * FIFO scheduling.
+ *
+ * b. balance_scx() never sets %SCX_TASK_BAL_KEEP as the slice value
+ * can't be trusted. Whenever a tick triggers, the running task is
+ * rotated to the tail of the queue.
+ *
+ * c. pick_next_task() suppresses zero slice warning.
+ */
+ scx_ops.enqueue = scx_ops_fallback_enqueue;
+ scx_ops.dispatch = scx_ops_fallback_dispatch;
+
+ spin_lock_irq(&scx_tasks_lock);
+ scx_task_iter_init(&sti);
+ while ((p = scx_task_iter_next_filtered_locked(&sti))) {
+ if (READ_ONCE(p->__state) != TASK_DEAD) {
+ SCHED_CHANGE_BLOCK(task_rq(p), p,
+ DEQUEUE_SAVE | DEQUEUE_MOVE) {
+ /* cycling deq/enq is enough, see above */
+ }
+ }
+ }
+ scx_task_iter_exit(&sti);
+ spin_unlock_irq(&scx_tasks_lock);
+
+forward_progress_guaranteed:
+ /*
+ * Here, every runnable task is guaranteed to make forward progress and
+ * we can safely use blocking synchronization constructs. Actually
+ * disable ops.
+ */
+ mutex_lock(&scx_ops_enable_mutex);
+
+ /* avoid racing against fork */
+ cpus_read_lock();
+ percpu_down_write(&scx_fork_rwsem);
+
+ spin_lock_irq(&scx_tasks_lock);
+ scx_task_iter_init(&sti);
+ while ((p = scx_task_iter_next_filtered_locked(&sti))) {
+ const struct sched_class *old_class = p->sched_class;
+ struct rq *rq = task_rq(p);
+ bool alive = READ_ONCE(p->__state) != TASK_DEAD;
+
+ update_rq_clock(rq);
+
+ SCHED_CHANGE_BLOCK(rq, p, DEQUEUE_SAVE | DEQUEUE_MOVE |
+ DEQUEUE_NOCLOCK) {
+ p->scx.slice = min_t(u64, p->scx.slice, SCX_SLICE_DFL);
+
+ __setscheduler_prio(p, p->prio);
+ if (alive)
+ check_class_changing(task_rq(p), p, old_class);
+ }
+
+ if (alive)
+ check_class_changed(task_rq(p), p, old_class, p->prio);
+
+ scx_ops_disable_task(p);
+ }
+ scx_task_iter_exit(&sti);
+ spin_unlock_irq(&scx_tasks_lock);
+
+ /* no task is on scx, turn off all the switches and flush in-progress calls */
+ static_branch_disable_cpuslocked(&__scx_ops_enabled);
+ for (i = 0; i < SCX_NR_ONLINE_OPS; i++)
+ static_branch_disable_cpuslocked(&scx_has_op[i]);
+ static_branch_disable_cpuslocked(&scx_ops_enq_last);
+ static_branch_disable_cpuslocked(&scx_ops_enq_exiting);
+ static_branch_disable_cpuslocked(&scx_builtin_idle_enabled);
+ synchronize_rcu();
+
+ percpu_up_write(&scx_fork_rwsem);
+ cpus_read_unlock();
+
+ if (ei->type >= SCX_EXIT_ERROR) {
+ printk(KERN_ERR "sched_ext: BPF scheduler \"%s\" errored, disabling\n", scx_ops.name);
+
+ if (ei->msg[0] == '\0')
+ printk(KERN_ERR "sched_ext: %s\n", ei->reason);
+ else
+ printk(KERN_ERR "sched_ext: %s (%s)\n", ei->reason, ei->msg);
+
+ stack_trace_print(ei->bt, ei->bt_len, 2);
+ }
+
+ if (scx_ops.exit)
+ scx_ops.exit(ei);
+
+ memset(&scx_ops, 0, sizeof(scx_ops));
+
+ rhashtable_walk_enter(&dsq_hash, &rht_iter);
+ do {
+ rhashtable_walk_start(&rht_iter);
+
+ while ((dsq = rhashtable_walk_next(&rht_iter)) && !IS_ERR(dsq))
+ destroy_dsq(dsq->id);
+
+ rhashtable_walk_stop(&rht_iter);
+ } while (dsq == ERR_PTR(-EAGAIN));
+ rhashtable_walk_exit(&rht_iter);
+
+ free_percpu(scx_dsp_buf);
+ scx_dsp_buf = NULL;
+ scx_dsp_max_batch = 0;
+
+ mutex_unlock(&scx_ops_enable_mutex);
+
+ WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_DISABLED) !=
+ SCX_OPS_DISABLING);
+}
+
+static DEFINE_KTHREAD_WORK(scx_ops_disable_work, scx_ops_disable_workfn);
+
+static void schedule_scx_ops_disable_work(void)
+{
+ struct kthread_worker *helper = READ_ONCE(scx_ops_helper);
+
+ /*
+ * We may be called spuriously before the first bpf_sched_ext_reg(). If
+ * scx_ops_helper isn't set up yet, there's nothing to do.
+ */
+ if (helper)
+ kthread_queue_work(helper, &scx_ops_disable_work);
+}
+
+static void scx_ops_disable(enum scx_exit_type type)
+{
+ int none = SCX_EXIT_NONE;
+
+ if (WARN_ON_ONCE(type == SCX_EXIT_NONE || type == SCX_EXIT_DONE))
+ type = SCX_EXIT_ERROR;
+
+ atomic_try_cmpxchg(&scx_exit_type, &none, type);
+
+ schedule_scx_ops_disable_work();
+}
+
+static void scx_ops_error_irq_workfn(struct irq_work *irq_work)
+{
+ schedule_scx_ops_disable_work();
+}
+
+static DEFINE_IRQ_WORK(scx_ops_error_irq_work, scx_ops_error_irq_workfn);
+
+__printf(2, 3) static void scx_ops_error_type(enum scx_exit_type type,
+ const char *fmt, ...)
+{
+ struct scx_exit_info *ei = &scx_exit_info;
+ int none = SCX_EXIT_NONE;
+ va_list args;
+
+ if (!atomic_try_cmpxchg(&scx_exit_type, &none, type))
+ return;
+
+ ei->bt_len = stack_trace_save(ei->bt, ARRAY_SIZE(ei->bt), 1);
+
+ va_start(args, fmt);
+ vscnprintf(ei->msg, ARRAY_SIZE(ei->msg), fmt, args);
+ va_end(args);
+
+ irq_work_queue(&scx_ops_error_irq_work);
+}
+
+static struct kthread_worker *scx_create_rt_helper(const char *name)
+{
+ struct kthread_worker *helper;
+
+ helper = kthread_create_worker(0, name);
+ if (helper)
+ sched_set_fifo(helper->task);
+ return helper;
+}
+
+static int scx_ops_enable(struct sched_ext_ops *ops)
+{
+ struct scx_task_iter sti;
+ struct task_struct *p;
+ int i, ret;
+
+ mutex_lock(&scx_ops_enable_mutex);
+
+ if (!scx_ops_helper) {
+ WRITE_ONCE(scx_ops_helper,
+ scx_create_rt_helper("sched_ext_ops_helper"));
+ if (!scx_ops_helper) {
+ ret = -ENOMEM;
+ goto err_unlock;
+ }
+ }
+
+ if (scx_ops_enable_state() != SCX_OPS_DISABLED) {
+ ret = -EBUSY;
+ goto err_unlock;
+ }
+
+ /*
+ * Set scx_ops, transition to PREPPING and clear exit info to arm the
+ * disable path. Failure triggers full disabling from here on.
+ */
+ scx_ops = *ops;
+
+ WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_PREPPING) !=
+ SCX_OPS_DISABLED);
+
+ memset(&scx_exit_info, 0, sizeof(scx_exit_info));
+ atomic_set(&scx_exit_type, SCX_EXIT_NONE);
+ scx_warned_zero_slice = false;
+
+ atomic64_set(&scx_nr_rejected, 0);
+
+ /*
+ * Keep CPUs stable during enable so that the BPF scheduler can track
+ * online CPUs by watching ->on/offline_cpu() after ->init().
+ */
+ cpus_read_lock();
+
+ if (scx_ops.init) {
+ ret = SCX_CALL_OP_RET(SCX_KF_INIT | SCX_KF_SLEEPABLE, init);
+ if (ret) {
+ ret = ops_sanitize_err("init", ret);
+ goto err_disable;
+ }
+
+ /*
+ * Exit early if ops.init() triggered scx_bpf_error(). Not
+ * strictly necessary as we'll fail transitioning into ENABLING
+ * later but that'd be after calling ops.prep_enable() on all
+ * tasks and with -EBUSY which isn't very intuitive. Let's exit
+ * early with success so that the condition is notified through
+ * ops.exit() like other scx_bpf_error() invocations.
+ */
+ if (atomic_read(&scx_exit_type) != SCX_EXIT_NONE)
+ goto err_disable;
+ }
+
+ WARN_ON_ONCE(scx_dsp_buf);
+ scx_dsp_max_batch = ops->dispatch_max_batch ?: SCX_DSP_DFL_MAX_BATCH;
+ scx_dsp_buf = __alloc_percpu(sizeof(scx_dsp_buf[0]) * scx_dsp_max_batch,
+ __alignof__(scx_dsp_buf[0]));
+ if (!scx_dsp_buf) {
+ ret = -ENOMEM;
+ goto err_disable;
+ }
+
+ /*
+ * Lock out forks before opening the floodgate so that they don't wander
+ * into the operations prematurely.
+ */
+ percpu_down_write(&scx_fork_rwsem);
+
+ for (i = 0; i < SCX_NR_ONLINE_OPS; i++)
+ if (((void (**)(void))ops)[i])
+ static_branch_enable_cpuslocked(&scx_has_op[i]);
+
+ if (ops->flags & SCX_OPS_ENQ_LAST)
+ static_branch_enable_cpuslocked(&scx_ops_enq_last);
+
+ if (ops->flags & SCX_OPS_ENQ_EXITING)
+ static_branch_enable_cpuslocked(&scx_ops_enq_exiting);
+
+ if (!ops->update_idle || (ops->flags & SCX_OPS_KEEP_BUILTIN_IDLE)) {
+ reset_idle_masks();
+ static_branch_enable_cpuslocked(&scx_builtin_idle_enabled);
+ } else {
+ static_branch_disable_cpuslocked(&scx_builtin_idle_enabled);
+ }
+
+ static_branch_enable_cpuslocked(&__scx_ops_enabled);
+
+ /*
+ * Enable ops for every task. Fork is excluded by scx_fork_rwsem
+ * preventing new tasks from being added. No need to exclude tasks
+ * leaving as sched_ext_free() can handle both prepped and enabled
+ * tasks. Prep all tasks first and then enable them with preemption
+ * disabled.
+ */
+ spin_lock_irq(&scx_tasks_lock);
+
+ scx_task_iter_init(&sti);
+ while ((p = scx_task_iter_next_filtered(&sti))) {
+ get_task_struct(p);
+ spin_unlock_irq(&scx_tasks_lock);
+
+ ret = scx_ops_prepare_task(p, task_group(p));
+ if (ret) {
+ put_task_struct(p);
+ spin_lock_irq(&scx_tasks_lock);
+ scx_task_iter_exit(&sti);
+ spin_unlock_irq(&scx_tasks_lock);
+ pr_err("sched_ext: ops.prep_enable() failed (%d) for %s[%d] while loading\n",
+ ret, p->comm, p->pid);
+ goto err_disable_unlock;
+ }
+
+ put_task_struct(p);
+ spin_lock_irq(&scx_tasks_lock);
+ }
+ scx_task_iter_exit(&sti);
+
+ /*
+ * All tasks are prepped but are still ops-disabled. Ensure that
+ * %current can't be scheduled out and switch everyone.
+ * preempt_disable() is necessary because we can't guarantee that
+ * %current won't be starved if scheduled out while switching.
+ */
+ preempt_disable();
+
+ /*
+ * From here on, the disable path must assume that tasks have ops
+ * enabled and need to be recovered.
+ */
+ if (!scx_ops_tryset_enable_state(SCX_OPS_ENABLING, SCX_OPS_PREPPING)) {
+ preempt_enable();
+ spin_unlock_irq(&scx_tasks_lock);
+ ret = -EBUSY;
+ goto err_disable_unlock;
+ }
+
+ /*
+ * We're fully committed and can't fail. The PREPPED -> ENABLED
+ * transitions here are synchronized against sched_ext_free() through
+ * scx_tasks_lock.
+ */
+ scx_task_iter_init(&sti);
+ while ((p = scx_task_iter_next_filtered_locked(&sti))) {
+ if (READ_ONCE(p->__state) != TASK_DEAD) {
+ const struct sched_class *old_class = p->sched_class;
+ struct rq *rq = task_rq(p);
+
+ update_rq_clock(rq);
+
+ SCHED_CHANGE_BLOCK(rq, p, DEQUEUE_SAVE | DEQUEUE_MOVE |
+ DEQUEUE_NOCLOCK) {
+ scx_ops_enable_task(p);
+ __setscheduler_prio(p, p->prio);
+ check_class_changing(task_rq(p), p, old_class);
+ }
+
+ check_class_changed(task_rq(p), p, old_class, p->prio);
+ } else {
+ scx_ops_disable_task(p);
+ }
+ }
+ scx_task_iter_exit(&sti);
+
+ spin_unlock_irq(&scx_tasks_lock);
+ preempt_enable();
+ percpu_up_write(&scx_fork_rwsem);
+
+ if (!scx_ops_tryset_enable_state(SCX_OPS_ENABLED, SCX_OPS_ENABLING)) {
+ ret = -EBUSY;
+ goto err_disable_unlock;
+ }
+
+ cpus_read_unlock();
+ mutex_unlock(&scx_ops_enable_mutex);
+
+ return 0;
+
+err_unlock:
+ mutex_unlock(&scx_ops_enable_mutex);
+ return ret;
+
+err_disable_unlock:
+ percpu_up_write(&scx_fork_rwsem);
+err_disable:
+ cpus_read_unlock();
+ mutex_unlock(&scx_ops_enable_mutex);
+ /* must be fully disabled before returning */
+ scx_ops_disable(SCX_EXIT_ERROR);
+ kthread_flush_work(&scx_ops_disable_work);
+ return ret;
+}
+
+#ifdef CONFIG_SCHED_DEBUG
+static const char *scx_ops_enable_state_str[] = {
+ [SCX_OPS_PREPPING] = "prepping",
+ [SCX_OPS_ENABLING] = "enabling",
+ [SCX_OPS_ENABLED] = "enabled",
+ [SCX_OPS_DISABLING] = "disabling",
+ [SCX_OPS_DISABLED] = "disabled",
+};
+
+static int scx_debug_show(struct seq_file *m, void *v)
+{
+ mutex_lock(&scx_ops_enable_mutex);
+ seq_printf(m, "%-30s: %s\n", "ops", scx_ops.name);
+ seq_printf(m, "%-30s: %ld\n", "enabled", scx_enabled());
+ seq_printf(m, "%-30s: %s\n", "enable_state",
+ scx_ops_enable_state_str[scx_ops_enable_state()]);
+ seq_printf(m, "%-30s: %llu\n", "nr_rejected",
+ atomic64_read(&scx_nr_rejected));
+ mutex_unlock(&scx_ops_enable_mutex);
+ return 0;
+}
+
+static int scx_debug_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, scx_debug_show, NULL);
+}
+
+const struct file_operations sched_ext_fops = {
+ .open = scx_debug_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+#endif
+
+/********************************************************************************
+ * bpf_struct_ops plumbing.
+ */
+#include <linux/bpf_verifier.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
+
+extern struct btf *btf_vmlinux;
+static const struct btf_type *task_struct_type;
+
+static bool bpf_scx_is_valid_access(int off, int size,
+ enum bpf_access_type type,
+ const struct bpf_prog *prog,
+ struct bpf_insn_access_aux *info)
+{
+ if (off < 0 || off >= sizeof(__u64) * MAX_BPF_FUNC_ARGS)
+ return false;
+ if (type != BPF_READ)
+ return false;
+ if (off % size != 0)
+ return false;
+
+ return btf_ctx_access(off, size, type, prog, info);
+}
+
+static int bpf_scx_btf_struct_access(struct bpf_verifier_log *log,
+ const struct bpf_reg_state *reg, int off,
+ int size, enum bpf_access_type atype,
+ u32 *next_btf_id, enum bpf_type_flag *flag)
+{
+ const struct btf_type *t;
+
+ t = btf_type_by_id(reg->btf, reg->btf_id);
+ if (t == task_struct_type) {
+ if (off >= offsetof(struct task_struct, scx.slice) &&
+ off + size <= offsetofend(struct task_struct, scx.slice))
+ return SCALAR_VALUE;
+ }
+
+ if (atype == BPF_READ)
+ return btf_struct_access(log, reg, off, size, atype,
+ next_btf_id, flag);
+
+ bpf_log(log, "only read is supported\n");
+ return -EACCES;
+}
+
+static const struct bpf_func_proto *
+bpf_scx_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+ switch (func_id) {
+ case BPF_FUNC_task_storage_get:
+ return &bpf_task_storage_get_proto;
+ case BPF_FUNC_task_storage_delete:
+ return &bpf_task_storage_delete_proto;
+ default:
+ return bpf_base_func_proto(func_id);
+ }
+}
+
+const struct bpf_verifier_ops bpf_scx_verifier_ops = {
+ .get_func_proto = bpf_scx_get_func_proto,
+ .is_valid_access = bpf_scx_is_valid_access,
+ .btf_struct_access = bpf_scx_btf_struct_access,
+};
+
+static int bpf_scx_init_member(const struct btf_type *t,
+ const struct btf_member *member,
+ void *kdata, const void *udata)
+{
+ const struct sched_ext_ops *uops = udata;
+ struct sched_ext_ops *ops = kdata;
+ u32 moff = __btf_member_bit_offset(t, member) / 8;
+ int ret;
+
+ switch (moff) {
+ case offsetof(struct sched_ext_ops, dispatch_max_batch):
+ if (*(u32 *)(udata + moff) > INT_MAX)
+ return -E2BIG;
+ ops->dispatch_max_batch = *(u32 *)(udata + moff);
+ return 1;
+ case offsetof(struct sched_ext_ops, flags):
+ if (*(u64 *)(udata + moff) & ~SCX_OPS_ALL_FLAGS)
+ return -EINVAL;
+ ops->flags = *(u64 *)(udata + moff);
+ return 1;
+ case offsetof(struct sched_ext_ops, name):
+ ret = bpf_obj_name_cpy(ops->name, uops->name,
+ sizeof(ops->name));
+ if (ret < 0)
+ return ret;
+ if (ret == 0)
+ return -EINVAL;
+ return 1;
+ }
+
+ return 0;
+}
+
+static int bpf_scx_check_member(const struct btf_type *t,
+ const struct btf_member *member,
+ const struct bpf_prog *prog)
+{
+ u32 moff = __btf_member_bit_offset(t, member) / 8;
+
+ switch (moff) {
+ case offsetof(struct sched_ext_ops, prep_enable):
+ case offsetof(struct sched_ext_ops, init):
+ case offsetof(struct sched_ext_ops, exit):
+ break;
+ default:
+ if (prog->aux->sleepable)
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static int bpf_scx_reg(void *kdata)
+{
+ return scx_ops_enable(kdata);
+}
+
+static void bpf_scx_unreg(void *kdata)
+{
+ scx_ops_disable(SCX_EXIT_UNREG);
+ kthread_flush_work(&scx_ops_disable_work);
+}
+
+static int bpf_scx_init(struct btf *btf)
+{
+ u32 type_id;
+
+ type_id = btf_find_by_name_kind(btf, "task_struct", BTF_KIND_STRUCT);
+ if (type_id < 0)
+ return -EINVAL;
+ task_struct_type = btf_type_by_id(btf, type_id);
+
+ return 0;
+}
+
+/* "extern" to avoid sparse warning, only used in this file */
+extern struct bpf_struct_ops bpf_sched_ext_ops;
+
+struct bpf_struct_ops bpf_sched_ext_ops = {
+ .verifier_ops = &bpf_scx_verifier_ops,
+ .reg = bpf_scx_reg,
+ .unreg = bpf_scx_unreg,
+ .check_member = bpf_scx_check_member,
+ .init_member = bpf_scx_init_member,
+ .init = bpf_scx_init,
+ .name = "sched_ext_ops",
+};
+
+void __init init_sched_ext_class(void)
+{
+ int cpu;
+ u32 v;
+
+ /*
+ * The following is to prevent the compiler from optimizing out the enum
+ * definitions so that BPF scheduler implementations can use them
+ * through the generated vmlinux.h.
+ */
+ WRITE_ONCE(v, SCX_WAKE_EXEC | SCX_ENQ_WAKEUP | SCX_DEQ_SLEEP);
+
+ BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params));
+ init_dsq(&scx_dsq_global, SCX_DSQ_GLOBAL);
+#ifdef CONFIG_SMP
+ BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL));
+ BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL));
+#endif
+ for_each_possible_cpu(cpu) {
+ struct rq *rq = cpu_rq(cpu);
+
+ init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL);
+ }
+}
+
+
+/********************************************************************************
+ * Helpers that can be called from the BPF scheduler.
+ */
+#include <linux/btf_ids.h>
+
+/* Disables missing prototype warnings for kfuncs */
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+ "Global functions as their definitions will be in vmlinux BTF");
+
+/**
+ * scx_bpf_create_dsq - Create a custom DSQ
+ * @dsq_id: DSQ to create
+ * @node: NUMA node to allocate from
+ *
+ * Create a custom DSQ identified by @dsq_id. Can be called from ops.init() and
+ * ops.prep_enable().
+ */
+s32 scx_bpf_create_dsq(u64 dsq_id, s32 node)
+{
+ if (!scx_kf_allowed(SCX_KF_SLEEPABLE))
+ return -EINVAL;
+
+ if (unlikely(node >= (int)nr_node_ids ||
+ (node < 0 && node != NUMA_NO_NODE)))
+ return -EINVAL;
+ return PTR_ERR_OR_ZERO(create_dsq(dsq_id, node));
+}
+
+BTF_SET8_START(scx_kfunc_ids_sleepable)
+BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE)
+BTF_SET8_END(scx_kfunc_ids_sleepable)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_sleepable = {
+ .owner = THIS_MODULE,
+ .set = &scx_kfunc_ids_sleepable,
+};
+
+/**
+ * scx_bpf_dispatch - Dispatch a task to a DSQ
+ * @p: task_struct to dispatch
+ * @dsq_id: DSQ to dispatch to
+ * @slice: duration @p can run for in nsecs
+ * @enq_flags: SCX_ENQ_*
+ *
+ * Dispatch @p to the DSQ identified by @dsq_id. It is safe to call this
+ * function spuriously. Can be called from ops.enqueue() and ops.dispatch().
+ *
+ * When called from ops.enqueue(), it's for direct dispatch and @p must match
+ * the task being enqueued. Also, %SCX_DSQ_LOCAL_ON can't be used to target the
+ * local DSQ of a CPU other than the enqueueing one. Use ops.select_cpu() to be
+ * on the target CPU in the first place.
+ *
+ * When called from ops.dispatch(), there are no restrictions on @p or @dsq_id
+ * and this function can be called upto ops.dispatch_max_batch times to dispatch
+ * multiple tasks. scx_bpf_dispatch_nr_slots() returns the number of the
+ * remaining slots. scx_bpf_consume() flushes the batch and resets the counter.
+ *
+ * This function doesn't have any locking restrictions and may be called under
+ * BPF locks (in the future when BPF introduces more flexible locking).
+ *
+ * @p is allowed to run for @slice. The scheduling path is triggered on slice
+ * exhaustion. If zero, the current residual slice is maintained. If
+ * %SCX_SLICE_INF, @p never expires and the BPF scheduler must kick the CPU with
+ * scx_bpf_kick_cpu() to trigger scheduling.
+ */
+void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice,
+ u64 enq_flags)
+{
+ struct task_struct *ddsp_task;
+ int idx;
+
+ if (!scx_kf_allowed(SCX_KF_ENQUEUE_DISPATCH))
+ return;
+
+ lockdep_assert_irqs_disabled();
+
+ if (unlikely(!p)) {
+ scx_ops_error("called with NULL task");
+ return;
+ }
+
+ if (unlikely(enq_flags & __SCX_ENQ_INTERNAL_MASK)) {
+ scx_ops_error("invalid enq_flags 0x%llx", enq_flags);
+ return;
+ }
+
+ if (slice)
+ p->scx.slice = slice;
+ else
+ p->scx.slice = p->scx.slice ?: 1;
+
+ ddsp_task = __this_cpu_read(direct_dispatch_task);
+ if (ddsp_task) {
+ direct_dispatch(ddsp_task, p, dsq_id, enq_flags);
+ return;
+ }
+
+ idx = __this_cpu_read(scx_dsp_ctx.buf_cursor);
+ if (unlikely(idx >= scx_dsp_max_batch)) {
+ scx_ops_error("dispatch buffer overflow");
+ return;
+ }
+
+ this_cpu_ptr(scx_dsp_buf)[idx] = (struct scx_dsp_buf_ent){
+ .task = p,
+ .qseq = atomic64_read(&p->scx.ops_state) & SCX_OPSS_QSEQ_MASK,
+ .dsq_id = dsq_id,
+ .enq_flags = enq_flags,
+ };
+ __this_cpu_inc(scx_dsp_ctx.buf_cursor);
+}
+
+BTF_SET8_START(scx_kfunc_ids_enqueue_dispatch)
+BTF_ID_FLAGS(func, scx_bpf_dispatch)
+BTF_SET8_END(scx_kfunc_ids_enqueue_dispatch)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_enqueue_dispatch = {
+ .owner = THIS_MODULE,
+ .set = &scx_kfunc_ids_enqueue_dispatch,
+};
+
+/**
+ * scx_bpf_dispatch_nr_slots - Return the number of remaining dispatch slots
+ *
+ * Can only be called from ops.dispatch().
+ */
+u32 scx_bpf_dispatch_nr_slots(void)
+{
+ if (!scx_kf_allowed(SCX_KF_DISPATCH))
+ return 0;
+
+ return scx_dsp_max_batch - __this_cpu_read(scx_dsp_ctx.buf_cursor);
+}
+
+/**
+ * scx_bpf_consume - Transfer a task from a DSQ to the current CPU's local DSQ
+ * @dsq_id: DSQ to consume
+ *
+ * Consume a task from the non-local DSQ identified by @dsq_id and transfer it
+ * to the current CPU's local DSQ for execution. Can only be called from
+ * ops.dispatch().
+ *
+ * This function flushes the in-flight dispatches from scx_bpf_dispatch() before
+ * trying to consume the specified DSQ. It may also grab rq locks and thus can't
+ * be called under any BPF locks.
+ *
+ * Returns %true if a task has been consumed, %false if there isn't any task to
+ * consume.
+ */
+bool scx_bpf_consume(u64 dsq_id)
+{
+ struct scx_dsp_ctx *dspc = this_cpu_ptr(&scx_dsp_ctx);
+ struct scx_dispatch_q *dsq;
+
+ if (!scx_kf_allowed(SCX_KF_DISPATCH))
+ return false;
+
+ flush_dispatch_buf(dspc->rq, dspc->rf);
+
+ dsq = find_non_local_dsq(dsq_id);
+ if (unlikely(!dsq)) {
+ scx_ops_error("invalid DSQ ID 0x%016llx", dsq_id);
+ return false;
+ }
+
+ if (consume_dispatch_q(dspc->rq, dspc->rf, dsq)) {
+ /*
+ * A successfully consumed task can be dequeued before it starts
+ * running while the CPU is trying to migrate other dispatched
+ * tasks. Bump nr_tasks to tell balance_scx() to retry on empty
+ * local DSQ.
+ */
+ dspc->nr_tasks++;
+ return true;
+ } else {
+ return false;
+ }
+}
+
+BTF_SET8_START(scx_kfunc_ids_dispatch)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots)
+BTF_ID_FLAGS(func, scx_bpf_consume)
+BTF_SET8_END(scx_kfunc_ids_dispatch)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = {
+ .owner = THIS_MODULE,
+ .set = &scx_kfunc_ids_dispatch,
+};
+
+/**
+ * scx_bpf_dsq_nr_queued - Return the number of queued tasks
+ * @dsq_id: id of the DSQ
+ *
+ * Return the number of tasks in the DSQ matching @dsq_id. If not found,
+ * -%ENOENT is returned. Can be called from any non-sleepable online scx_ops
+ * operations.
+ */
+s32 scx_bpf_dsq_nr_queued(u64 dsq_id)
+{
+ struct scx_dispatch_q *dsq;
+
+ lockdep_assert(rcu_read_lock_any_held());
+
+ if (dsq_id == SCX_DSQ_LOCAL) {
+ return this_rq()->scx.local_dsq.nr;
+ } else if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
+ s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
+
+ if (ops_cpu_valid(cpu))
+ return cpu_rq(cpu)->scx.local_dsq.nr;
+ } else {
+ dsq = find_non_local_dsq(dsq_id);
+ if (dsq)
+ return dsq->nr;
+ }
+ return -ENOENT;
+}
+
+/**
+ * scx_bpf_test_and_clear_cpu_idle - Test and clear @cpu's idle state
+ * @cpu: cpu to test and clear idle for
+ *
+ * Returns %true if @cpu was idle and its idle state was successfully cleared.
+ * %false otherwise.
+ *
+ * Unavailable if ops.update_idle() is implemented and
+ * %SCX_OPS_KEEP_BUILTIN_IDLE is not set.
+ */
+bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
+{
+ if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+ scx_ops_error("built-in idle tracking is disabled");
+ return false;
+ }
+
+ if (ops_cpu_valid(cpu))
+ return test_and_clear_cpu_idle(cpu);
+ else
+ return false;
+}
+
+/**
+ * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu
+ * @cpus_allowed: Allowed cpumask
+ *
+ * Pick and claim an idle cpu which is also in @cpus_allowed. Returns the picked
+ * idle cpu number on success. -%EBUSY if no matching cpu was found.
+ *
+ * Unavailable if ops.update_idle() is implemented and
+ * %SCX_OPS_KEEP_BUILTIN_IDLE is not set.
+ */
+s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed)
+{
+ if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+ scx_ops_error("built-in idle tracking is disabled");
+ return -EBUSY;
+ }
+
+ return scx_pick_idle_cpu(cpus_allowed);
+}
+
+/**
+ * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
+ * per-CPU cpumask.
+ *
+ * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
+ */
+const struct cpumask *scx_bpf_get_idle_cpumask(void)
+{
+ if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+ scx_ops_error("built-in idle tracking is disabled");
+ return cpu_none_mask;
+ }
+
+#ifdef CONFIG_SMP
+ return idle_masks.cpu;
+#else
+ return cpu_none_mask;
+#endif
+}
+
+/**
+ * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
+ * per-physical-core cpumask. Can be used to determine if an entire physical
+ * core is free.
+ *
+ * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
+ */
+const struct cpumask *scx_bpf_get_idle_smtmask(void)
+{
+ if (!static_branch_likely(&scx_builtin_idle_enabled)) {
+ scx_ops_error("built-in idle tracking is disabled");
+ return cpu_none_mask;
+ }
+
+#ifdef CONFIG_SMP
+ return idle_masks.smt;
+#else
+ return cpu_none_mask;
+#endif
+}
+
+/**
+ * scx_bpf_put_idle_cpumask - Release a previously acquired referenced kptr to
+ * either the percpu, or SMT idle-tracking cpumask.
+ */
+void scx_bpf_put_idle_cpumask(const struct cpumask *idle_mask)
+{
+ /*
+ * Empty function body because we aren't actually acquiring or
+ * releasing a reference to a global idle cpumask, which is read-only
+ * in the caller and is never released. The acquire / release semantics
+ * here are just used to make the cpumask is a trusted pointer in the
+ * caller.
+ */
+}
+
+struct scx_bpf_error_bstr_bufs {
+ u64 data[MAX_BPRINTF_VARARGS];
+ char msg[SCX_EXIT_MSG_LEN];
+};
+
+static DEFINE_PER_CPU(struct scx_bpf_error_bstr_bufs, scx_bpf_error_bstr_bufs);
+
+/**
+ * scx_bpf_error_bstr - Indicate fatal error
+ * @fmt: error message format string
+ * @data: format string parameters packaged using ___bpf_fill() macro
+ * @data__sz: @data len, must end in '__sz' for the verifier
+ *
+ * Indicate that the BPF scheduler encountered a fatal error and initiate ops
+ * disabling.
+ */
+void scx_bpf_error_bstr(char *fmt, unsigned long long *data, u32 data__sz)
+{
+ struct bpf_bprintf_data bprintf_data = { .get_bin_args = true };
+ struct scx_bpf_error_bstr_bufs *bufs;
+ unsigned long flags;
+ int ret;
+
+ local_irq_save(flags);
+ bufs = this_cpu_ptr(&scx_bpf_error_bstr_bufs);
+
+ if (data__sz % 8 || data__sz > MAX_BPRINTF_VARARGS * 8 ||
+ (data__sz && !data)) {
+ scx_ops_error("invalid data=%p and data__sz=%u",
+ (void *)data, data__sz);
+ goto out_restore;
+ }
+
+ ret = copy_from_kernel_nofault(bufs->data, data, data__sz);
+ if (ret) {
+ scx_ops_error("failed to read data fields (%d)", ret);
+ goto out_restore;
+ }
+
+ ret = bpf_bprintf_prepare(fmt, UINT_MAX, bufs->data, data__sz / 8,
+ &bprintf_data);
+ if (ret < 0) {
+ scx_ops_error("failed to format prepration (%d)", ret);
+ goto out_restore;
+ }
+
+ ret = bstr_printf(bufs->msg, sizeof(bufs->msg), fmt,
+ bprintf_data.bin_args);
+ bpf_bprintf_cleanup(&bprintf_data);
+ if (ret < 0) {
+ scx_ops_error("scx_ops_error(\"%s\", %p, %u) failed to format",
+ fmt, data, data__sz);
+ goto out_restore;
+ }
+
+ scx_ops_error_type(SCX_EXIT_ERROR_BPF, "%s", bufs->msg);
+out_restore:
+ local_irq_restore(flags);
+}
+
+/**
+ * scx_bpf_destroy_dsq - Destroy a custom DSQ
+ * @dsq_id: DSQ to destroy
+ *
+ * Destroy the custom DSQ identified by @dsq_id. Only DSQs created with
+ * scx_bpf_create_dsq() can be destroyed. The caller must ensure that the DSQ is
+ * empty and no further tasks are dispatched to it. Ignored if called on a DSQ
+ * which doesn't exist. Can be called from any online scx_ops operations.
+ */
+void scx_bpf_destroy_dsq(u64 dsq_id)
+{
+ destroy_dsq(dsq_id);
+}
+
+/**
+ * scx_bpf_task_running - Is task currently running?
+ * @p: task of interest
+ */
+bool scx_bpf_task_running(const struct task_struct *p)
+{
+ return task_rq(p)->curr == p;
+}
+
+/**
+ * scx_bpf_task_cpu - CPU a task is currently associated with
+ * @p: task of interest
+ */
+s32 scx_bpf_task_cpu(const struct task_struct *p)
+{
+ return task_cpu(p);
+}
+
+BTF_SET8_START(scx_kfunc_ids_any)
+BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued)
+BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
+BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE)
+BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_destroy_dsq)
+BTF_ID_FLAGS(func, scx_bpf_task_running, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_TRUSTED_ARGS)
+BTF_SET8_END(scx_kfunc_ids_any)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_any = {
+ .owner = THIS_MODULE,
+ .set = &scx_kfunc_ids_any,
+};
+
+__diag_pop();
+
+/*
+ * This can't be done from init_sched_ext_class() as register_btf_kfunc_id_set()
+ * needs most of the system to be up.
+ */
+static int __init register_ext_kfuncs(void)
+{
+ int ret;
+
+ /*
+ * Some kfuncs are context-sensitive and can only be called from
+ * specific SCX ops. They are grouped into BTF sets accordingly.
+ * Unfortunately, BPF currently doesn't have a way of enforcing such
+ * restrictions. Eventually, the verifier should be able to enforce
+ * them. For now, register them the same and make each kfunc explicitly
+ * check using scx_kf_allowed().
+ */
+ if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+ &scx_kfunc_set_sleepable)) ||
+ (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+ &scx_kfunc_set_enqueue_dispatch)) ||
+ (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+ &scx_kfunc_set_dispatch)) ||
+ (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+ &scx_kfunc_set_any))) {
+ pr_err("sched_ext: failed to register kfunc sets (%d)\n", ret);
+ return ret;
+ }
+
+ return 0;
+}
+__initcall(register_ext_kfuncs);
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 6a93c4825339..f8d5682deacf 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -1,7 +1,94 @@
/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+enum scx_wake_flags {
+ /* expose select WF_* flags as enums */
+ SCX_WAKE_EXEC = WF_EXEC,
+ SCX_WAKE_FORK = WF_FORK,
+ SCX_WAKE_TTWU = WF_TTWU,
+ SCX_WAKE_SYNC = WF_SYNC,
+};
+
+enum scx_enq_flags {
+ /* expose select ENQUEUE_* flags as enums */
+ SCX_ENQ_WAKEUP = ENQUEUE_WAKEUP,
+ SCX_ENQ_HEAD = ENQUEUE_HEAD,
+
+ /* high 32bits are SCX specific */
+
+ /*
+ * The task being enqueued is the only task available for the cpu. By
+ * default, ext core keeps executing such tasks but when
+ * %SCX_OPS_ENQ_LAST is specified, they're ops.enqueue()'d with
+ * %SCX_ENQ_LAST and %SCX_ENQ_LOCAL flags set.
+ *
+ * If the BPF scheduler wants to continue executing the task,
+ * ops.enqueue() should dispatch the task to %SCX_DSQ_LOCAL immediately.
+ * If the task gets queued on a different dsq or the BPF side, the BPF
+ * scheduler is responsible for triggering a follow-up scheduling event.
+ * Otherwise, Execution may stall.
+ */
+ SCX_ENQ_LAST = 1LLU << 41,
+
+ /*
+ * A hint indicating that it's advisable to enqueue the task on the
+ * local dsq of the currently selected CPU. Currently used by
+ * select_cpu_dfl() and together with %SCX_ENQ_LAST.
+ */
+ SCX_ENQ_LOCAL = 1LLU << 42,
+
+ /* high 8 bits are internal */
+ __SCX_ENQ_INTERNAL_MASK = 0xffLLU << 56,
+
+ SCX_ENQ_CLEAR_OPSS = 1LLU << 56,
+};
+
+enum scx_deq_flags {
+ /* expose select DEQUEUE_* flags as enums */
+ SCX_DEQ_SLEEP = DEQUEUE_SLEEP,
+};

#ifdef CONFIG_SCHED_CLASS_EXT
-#error "NOT IMPLEMENTED YET"
+
+extern const struct sched_class ext_sched_class;
+extern const struct bpf_verifier_ops bpf_sched_ext_verifier_ops;
+extern const struct file_operations sched_ext_fops;
+
+DECLARE_STATIC_KEY_FALSE(__scx_ops_enabled);
+#define scx_enabled() static_branch_unlikely(&__scx_ops_enabled)
+
+bool task_on_scx(struct task_struct *p);
+void scx_pre_fork(struct task_struct *p);
+int scx_fork(struct task_struct *p);
+void scx_post_fork(struct task_struct *p);
+void scx_cancel_fork(struct task_struct *p);
+void init_sched_ext_class(void);
+
+static inline const struct sched_class *next_active_class(const struct sched_class *class)
+{
+ class++;
+ if (!scx_enabled() && class == &ext_sched_class)
+ class++;
+ return class;
+}
+
+#define for_active_class_range(class, _from, _to) \
+ for (class = (_from); class != (_to); class = next_active_class(class))
+
+#define for_each_active_class(class) \
+ for_active_class_range(class, __sched_class_highest, __sched_class_lowest)
+
+/*
+ * SCX requires a balance() call before every pick_next_task() call including
+ * when waking up from idle.
+ */
+#define for_balance_class_range(class, prev_class, end_class) \
+ for_active_class_range(class, (prev_class) > &ext_sched_class ? \
+ &ext_sched_class : (prev_class), (end_class))
+
#else /* CONFIG_SCHED_CLASS_EXT */

#define scx_enabled() false
@@ -18,7 +105,13 @@ static inline void init_sched_ext_class(void) {}
#endif /* CONFIG_SCHED_CLASS_EXT */

#if defined(CONFIG_SCHED_CLASS_EXT) && defined(CONFIG_SMP)
-#error "NOT IMPLEMENTED YET"
+void __scx_update_idle(struct rq *rq, bool idle);
+
+static inline void scx_update_idle(struct rq *rq, bool idle)
+{
+ if (scx_enabled())
+ __scx_update_idle(rq, idle);
+}
#else
static inline void scx_update_idle(struct rq *rq, bool idle) {}
#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d57d17d8ea6b..7778f4ce6b5b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -185,6 +185,10 @@ static inline int idle_policy(int policy)

static inline int normal_policy(int policy)
{
+#ifdef CONFIG_SCHED_CLASS_EXT
+ if (policy == SCHED_EXT)
+ return true;
+#endif
return policy == SCHED_NORMAL;
}

@@ -680,6 +684,14 @@ struct cfs_rq {
#endif /* CONFIG_FAIR_GROUP_SCHED */
};

+#ifdef CONFIG_SCHED_CLASS_EXT
+struct scx_rq {
+ struct scx_dispatch_q local_dsq;
+ u64 ops_qseq;
+ u32 nr_running;
+};
+#endif /* CONFIG_SCHED_CLASS_EXT */
+
static inline int rt_bandwidth_enabled(void)
{
return sysctl_sched_rt_runtime >= 0;
@@ -1021,6 +1033,9 @@ struct rq {
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
+#ifdef CONFIG_SCHED_CLASS_EXT
+ struct scx_rq scx;
+#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this CPU: */
--
2.39.1


2023-01-28 00:18:51

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 14/30] sched_ext: Add scx_example_dummy and scx_example_qmap example schedulers

Add two simple example BPF schedulers - dummy and qmap.

* dummy: In terms of scheduling, it behaves identical to not having any
operation implemented at all. The two operations it implements are only to
improve visibility and exit handling. On certain homogeneous
configurations, this actually can perform pretty well.

* qmap: A fixed five level priority scheduler to demonstrate queueing PIDs
on BPF maps for scheduling. While not very practical, this is useful as a
simple example and will be used to demonstrate different features.

v2: Updated with the generic BPF cpumask helpers.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
tools/sched_ext/.gitignore | 5 +
tools/sched_ext/Makefile | 188 +++++++++++++++++++
tools/sched_ext/gnu/stubs.h | 1 +
tools/sched_ext/scx_common.bpf.h | 131 +++++++++++++
tools/sched_ext/scx_example_dummy.bpf.c | 56 ++++++
tools/sched_ext/scx_example_dummy.c | 93 +++++++++
tools/sched_ext/scx_example_qmap.bpf.c | 238 ++++++++++++++++++++++++
tools/sched_ext/scx_example_qmap.c | 84 +++++++++
tools/sched_ext/user_exit_info.h | 50 +++++
9 files changed, 846 insertions(+)
create mode 100644 tools/sched_ext/.gitignore
create mode 100644 tools/sched_ext/Makefile
create mode 100644 tools/sched_ext/gnu/stubs.h
create mode 100644 tools/sched_ext/scx_common.bpf.h
create mode 100644 tools/sched_ext/scx_example_dummy.bpf.c
create mode 100644 tools/sched_ext/scx_example_dummy.c
create mode 100644 tools/sched_ext/scx_example_qmap.bpf.c
create mode 100644 tools/sched_ext/scx_example_qmap.c
create mode 100644 tools/sched_ext/user_exit_info.h

diff --git a/tools/sched_ext/.gitignore b/tools/sched_ext/.gitignore
new file mode 100644
index 000000000000..6734f7fd9324
--- /dev/null
+++ b/tools/sched_ext/.gitignore
@@ -0,0 +1,5 @@
+scx_example_dummy
+scx_example_qmap
+*.skel.h
+*.subskel.h
+/tools/
diff --git a/tools/sched_ext/Makefile b/tools/sched_ext/Makefile
new file mode 100644
index 000000000000..926b0a36c221
--- /dev/null
+++ b/tools/sched_ext/Makefile
@@ -0,0 +1,188 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+include ../build/Build.include
+include ../scripts/Makefile.arch
+include ../scripts/Makefile.include
+
+ifneq ($(LLVM),)
+ifneq ($(filter %/,$(LLVM)),)
+LLVM_PREFIX := $(LLVM)
+else ifneq ($(filter -%,$(LLVM)),)
+LLVM_SUFFIX := $(LLVM)
+endif
+
+CLANG_TARGET_FLAGS_arm := arm-linux-gnueabi
+CLANG_TARGET_FLAGS_arm64 := aarch64-linux-gnu
+CLANG_TARGET_FLAGS_hexagon := hexagon-linux-musl
+CLANG_TARGET_FLAGS_m68k := m68k-linux-gnu
+CLANG_TARGET_FLAGS_mips := mipsel-linux-gnu
+CLANG_TARGET_FLAGS_powerpc := powerpc64le-linux-gnu
+CLANG_TARGET_FLAGS_riscv := riscv64-linux-gnu
+CLANG_TARGET_FLAGS_s390 := s390x-linux-gnu
+CLANG_TARGET_FLAGS_x86 := x86_64-linux-gnu
+CLANG_TARGET_FLAGS := $(CLANG_TARGET_FLAGS_$(ARCH))
+
+ifeq ($(CROSS_COMPILE),)
+ifeq ($(CLANG_TARGET_FLAGS),)
+$(error Specify CROSS_COMPILE or add '--target=' option to lib.mk
+else
+CLANG_FLAGS += --target=$(CLANG_TARGET_FLAGS)
+endif # CLANG_TARGET_FLAGS
+else
+CLANG_FLAGS += --target=$(notdir $(CROSS_COMPILE:%-=%))
+endif # CROSS_COMPILE
+
+CC := $(LLVM_PREFIX)clang$(LLVM_SUFFIX) $(CLANG_FLAGS) -fintegrated-as
+else
+CC := $(CROSS_COMPILE)gcc
+endif # LLVM
+
+CURDIR := $(abspath .)
+TOOLSDIR := $(abspath ..)
+LIBDIR := $(TOOLSDIR)/lib
+BPFDIR := $(LIBDIR)/bpf
+TOOLSINCDIR := $(TOOLSDIR)/include
+BPFTOOLDIR := $(TOOLSDIR)/bpf/bpftool
+APIDIR := $(TOOLSINCDIR)/uapi
+GENDIR := $(abspath ../../include/generated)
+GENHDR := $(GENDIR)/autoconf.h
+
+SCRATCH_DIR := $(CURDIR)/tools
+BUILD_DIR := $(SCRATCH_DIR)/build
+INCLUDE_DIR := $(SCRATCH_DIR)/include
+BPFOBJ_DIR := $(BUILD_DIR)/libbpf
+BPFOBJ := $(BPFOBJ_DIR)/libbpf.a
+ifneq ($(CROSS_COMPILE),)
+HOST_BUILD_DIR := $(BUILD_DIR)/host
+HOST_SCRATCH_DIR := host-tools
+HOST_INCLUDE_DIR := $(HOST_SCRATCH_DIR)/include
+else
+HOST_BUILD_DIR := $(BUILD_DIR)
+HOST_SCRATCH_DIR := $(SCRATCH_DIR)
+HOST_INCLUDE_DIR := $(INCLUDE_DIR)
+endif
+HOST_BPFOBJ := $(HOST_BUILD_DIR)/libbpf/libbpf.a
+RESOLVE_BTFIDS := $(HOST_BUILD_DIR)/resolve_btfids/resolve_btfids
+DEFAULT_BPFTOOL := $(HOST_SCRATCH_DIR)/sbin/bpftool
+
+VMLINUX_BTF_PATHS ?= $(if $(O),$(O)/vmlinux) \
+ $(if $(KBUILD_OUTPUT),$(KBUILD_OUTPUT)/vmlinux) \
+ ../../vmlinux \
+ /sys/kernel/btf/vmlinux \
+ /boot/vmlinux-$(shell uname -r)
+VMLINUX_BTF ?= $(abspath $(firstword $(wildcard $(VMLINUX_BTF_PATHS))))
+ifeq ($(VMLINUX_BTF),)
+$(error Cannot find a vmlinux for VMLINUX_BTF at any of "$(VMLINUX_BTF_PATHS)")
+endif
+
+BPFTOOL ?= $(DEFAULT_BPFTOOL)
+
+ifneq ($(wildcard $(GENHDR)),)
+ GENFLAGS := -DHAVE_GENHDR
+endif
+
+CFLAGS += -g -O2 -rdynamic -pthread -Wall -Werror $(GENFLAGS) \
+ -I$(INCLUDE_DIR) -I$(GENDIR) -I$(LIBDIR) \
+ -I$(TOOLSINCDIR) -I$(APIDIR)
+
+# Silence some warnings when compiled with clang
+ifneq ($(LLVM),)
+CFLAGS += -Wno-unused-command-line-argument
+endif
+
+LDFLAGS = -lelf -lz -lpthread
+
+IS_LITTLE_ENDIAN = $(shell $(CC) -dM -E - </dev/null | \
+ grep 'define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__')
+
+# Get Clang's default includes on this system, as opposed to those seen by
+# '-target bpf'. This fixes "missing" files on some architectures/distros,
+# such as asm/byteorder.h, asm/socket.h, asm/sockios.h, sys/cdefs.h etc.
+#
+# Use '-idirafter': Don't interfere with include mechanics except where the
+# build would have failed anyways.
+define get_sys_includes
+$(shell $(1) -v -E - </dev/null 2>&1 \
+ | sed -n '/<...> search starts here:/,/End of search list./{ s| \(/.*\)|-idirafter \1|p }') \
+$(shell $(1) -dM -E - </dev/null | grep '__riscv_xlen ' | awk '{printf("-D__riscv_xlen=%d -D__BITS_PER_LONG=%d", $$3, $$3)}')
+endef
+
+BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH) \
+ $(if $(IS_LITTLE_ENDIAN),-mlittle-endian,-mbig-endian) \
+ -I$(INCLUDE_DIR) -I$(CURDIR) -I$(APIDIR) \
+ -I../../include \
+ $(call get_sys_includes,$(CLANG)) \
+ -Wno-compare-distinct-pointer-types \
+ -O2 -mcpu=v3
+
+all: scx_example_dummy scx_example_qmap
+
+# sort removes libbpf duplicates when not cross-building
+MAKE_DIRS := $(sort $(BUILD_DIR)/libbpf $(HOST_BUILD_DIR)/libbpf \
+ $(HOST_BUILD_DIR)/bpftool $(HOST_BUILD_DIR)/resolve_btfids \
+ $(INCLUDE_DIR))
+
+$(MAKE_DIRS):
+ $(call msg,MKDIR,,$@)
+ $(Q)mkdir -p $@
+
+$(BPFOBJ): $(wildcard $(BPFDIR)/*.[ch] $(BPFDIR)/Makefile) \
+ $(APIDIR)/linux/bpf.h \
+ | $(BUILD_DIR)/libbpf
+ $(Q)$(MAKE) $(submake_extras) -C $(BPFDIR) OUTPUT=$(BUILD_DIR)/libbpf/ \
+ EXTRA_CFLAGS='-g -O0 -fPIC' \
+ DESTDIR=$(SCRATCH_DIR) prefix= all install_headers
+
+$(DEFAULT_BPFTOOL): $(wildcard $(BPFTOOLDIR)/*.[ch] $(BPFTOOLDIR)/Makefile) \
+ $(HOST_BPFOBJ) | $(HOST_BUILD_DIR)/bpftool
+ $(Q)$(MAKE) $(submake_extras) -C $(BPFTOOLDIR) \
+ ARCH= CROSS_COMPILE= CC=$(HOSTCC) LD=$(HOSTLD) \
+ EXTRA_CFLAGS='-g -O0' \
+ OUTPUT=$(HOST_BUILD_DIR)/bpftool/ \
+ LIBBPF_OUTPUT=$(HOST_BUILD_DIR)/libbpf/ \
+ LIBBPF_DESTDIR=$(HOST_SCRATCH_DIR)/ \
+ prefix= DESTDIR=$(HOST_SCRATCH_DIR)/ install-bin
+
+$(INCLUDE_DIR)/vmlinux.h: $(VMLINUX_BTF) $(BPFTOOL) | $(INCLUDE_DIR)
+ifeq ($(VMLINUX_H),)
+ $(call msg,GEN,,$@)
+ $(Q)$(BPFTOOL) btf dump file $(VMLINUX_BTF) format c > $@
+else
+ $(call msg,CP,,$@)
+ $(Q)cp "$(VMLINUX_H)" $@
+endif
+
+%.bpf.o: %.bpf.c $(INCLUDE_DIR)/vmlinux.h scx_common.bpf.h user_exit_info.h \
+ | $(BPFOBJ)
+ $(call msg,CLNG-BPF,,$@)
+ $(Q)$(CLANG) $(BPF_CFLAGS) -target bpf -c $< -o $@
+
+%.skel.h: %.bpf.o $(BPFTOOL)
+ $(call msg,GEN-SKEL,,$@)
+ $(Q)$(BPFTOOL) gen object $(<:.o=.linked1.o) $<
+ $(Q)$(BPFTOOL) gen object $(<:.o=.linked2.o) $(<:.o=.linked1.o)
+ $(Q)$(BPFTOOL) gen object $(<:.o=.linked3.o) $(<:.o=.linked2.o)
+ $(Q)diff $(<:.o=.linked2.o) $(<:.o=.linked3.o)
+ $(Q)$(BPFTOOL) gen skeleton $(<:.o=.linked3.o) name $(<:.bpf.o=) > $@
+ $(Q)$(BPFTOOL) gen subskeleton $(<:.o=.linked3.o) name $(<:.bpf.o=) > $(@:.skel.h=.subskel.h)
+
+scx_example_dummy: scx_example_dummy.c scx_example_dummy.skel.h user_exit_info.h
+ $(CC) $(CFLAGS) -c $< -o [email protected]
+ $(CC) -o $@ [email protected] $(HOST_BPFOBJ) $(LDFLAGS)
+
+scx_example_qmap: scx_example_qmap.c scx_example_qmap.skel.h user_exit_info.h
+ $(CC) $(CFLAGS) -c $< -o [email protected]
+ $(CC) -o $@ [email protected] $(HOST_BPFOBJ) $(LDFLAGS)
+
+clean:
+ rm -rf $(SCRATCH_DIR) $(HOST_SCRATCH_DIR)
+ rm -f *.o *.bpf.o *.skel.h *.subskel.h
+ rm -f scx_example_dummy scx_example_qmap
+
+.PHONY: all clean
+
+# delete failed targets
+.DELETE_ON_ERROR:
+
+# keep intermediate (.skel.h, .bpf.o, etc) targets
+.SECONDARY:
diff --git a/tools/sched_ext/gnu/stubs.h b/tools/sched_ext/gnu/stubs.h
new file mode 100644
index 000000000000..719225b16626
--- /dev/null
+++ b/tools/sched_ext/gnu/stubs.h
@@ -0,0 +1 @@
+/* dummy .h to trick /usr/include/features.h to work with 'clang -target bpf' */
diff --git a/tools/sched_ext/scx_common.bpf.h b/tools/sched_ext/scx_common.bpf.h
new file mode 100644
index 000000000000..b40a4fc6a159
--- /dev/null
+++ b/tools/sched_ext/scx_common.bpf.h
@@ -0,0 +1,131 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#ifndef __SCHED_EXT_COMMON_BPF_H
+#define __SCHED_EXT_COMMON_BPF_H
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <linux/errno.h>
+#include "user_exit_info.h"
+
+/*
+ * Earlier versions of clang/pahole lost upper 32bits in 64bit enums which can
+ * lead to really confusing misbehaviors. Let's trigger a build failure.
+ */
+static inline void ___vmlinux_h_sanity_check___(void)
+{
+ _Static_assert(SCX_DSQ_FLAG_BUILTIN,
+ "bpftool generated vmlinux.h is missing high bits for 64bit enums, upgrade clang and pahole");
+}
+
+void scx_bpf_error_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym;
+
+static inline __attribute__((format(printf, 1, 2)))
+void ___scx_bpf_error_format_checker(const char *fmt, ...) {}
+
+/*
+ * scx_bpf_error() wraps the scx_bpf_error_bstr() kfunc with variadic arguments
+ * instead of an array of u64. Note that __param[] must have at least one
+ * element to keep the verifier happy.
+ */
+#define scx_bpf_error(fmt, args...) \
+({ \
+ static char ___fmt[] = fmt; \
+ unsigned long long ___param[___bpf_narg(args) ?: 1] = {}; \
+ \
+ _Pragma("GCC diagnostic push") \
+ _Pragma("GCC diagnostic ignored \"-Wint-conversion\"") \
+ ___bpf_fill(___param, args); \
+ _Pragma("GCC diagnostic pop") \
+ \
+ scx_bpf_error_bstr(___fmt, ___param, sizeof(___param)); \
+ \
+ ___scx_bpf_error_format_checker(fmt, ##args); \
+})
+
+/* BPF core task / cgroup kfunc helpers */
+struct task_struct *bpf_task_from_pid(s32 pid) __ksym;
+struct task_struct *bpf_task_acquire(struct task_struct *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* BPF core cpumask kfuncs */
+struct bpf_cpumask *bpf_cpumask_create(void) __ksym;
+struct bpf_cpumask *bpf_cpumask_acquire(struct bpf_cpumask *cpumask) __ksym;
+struct bpf_cpumask *bpf_cpumask_kptr_get(struct bpf_cpumask **map_value) __ksym;
+void bpf_cpumask_release(struct bpf_cpumask *cpumask) __ksym;
+bool bpf_cpumask_test_cpu(u32 cpu, const struct cpumask *cpumask) __ksym;
+void bpf_cpumask_set_cpu(u32 cpu, struct bpf_cpumask *cpumask) __ksym;
+void bpf_cpumask_clear_cpu(u32 cpu, struct bpf_cpumask *cpumask) __ksym;
+void bpf_cpumask_copy(struct bpf_cpumask *dst, const struct cpumask *src) __ksym;
+bool bpf_cpumask_and(struct bpf_cpumask *dst, const struct cpumask *src1,
+ const struct cpumask *src2) __ksym;
+u32 bpf_cpumask_first(const struct cpumask *cpumask) __ksym;
+
+s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) __ksym;
+bool scx_bpf_consume(u64 dsq_id) __ksym;
+u32 scx_bpf_dispatch_nr_slots(void) __ksym;
+void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice, u64 enq_flags) __ksym;
+s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym;
+bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) __ksym;
+s32 scx_bpf_pick_idle_cpu(const cpumask_t *cpus_allowed) __ksym;
+const struct cpumask *scx_bpf_get_idle_percpu_cpumask(void) __ksym;
+const struct cpumask *scx_bpf_get_idle_smt_cpumask(void) __ksym;
+void scx_bpf_put_idle_cpumask(const struct cpumask *cpumask) __ksym;
+void scx_bpf_destroy_dsq(u64 dsq_id) __ksym;
+bool scx_bpf_task_running(const struct task_struct *p) __ksym;
+s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym;
+
+#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
+#define PF_EXITING 0x00000004
+#define CLOCK_MONOTONIC 1
+
+#define BPF_STRUCT_OPS(name, args...) \
+SEC("struct_ops/"#name) \
+BPF_PROG(name, ##args)
+
+#define BPF_STRUCT_OPS_SLEEPABLE(name, args...) \
+SEC("struct_ops.s/"#name) \
+BPF_PROG(name, ##args)
+
+/**
+ * MEMBER_VPTR - Obtain the verified pointer to a struct or array member
+ * @base: struct or array to index
+ * @member: dereferenced member (e.g. ->field, [idx0][idx1], ...)
+ *
+ * The verifier often gets confused by the instruction sequence the compiler
+ * generates for indexing struct fields or arrays. This macro forces the
+ * compiler to generate a code sequence which first calculates the byte offset,
+ * checks it against the struct or array size and add that byte offset to
+ * generate the pointer to the member to help the verifier.
+ *
+ * Ideally, we want to abort if the calculated offset is out-of-bounds. However,
+ * BPF currently doesn't support abort, so evaluate to NULL instead. The caller
+ * must check for NULL and take appropriate action to appease the verifier. To
+ * avoid confusing the verifier, it's best to check for NULL and dereference
+ * immediately.
+ *
+ * vptr = MEMBER_VPTR(my_array, [i][j]);
+ * if (!vptr)
+ * return error;
+ * *vptr = new_value;
+ */
+#define MEMBER_VPTR(base, member) (typeof(base member) *)({ \
+ u64 __base = (u64)base; \
+ u64 __addr = (u64)&(base member) - __base; \
+ asm volatile ( \
+ "if %0 <= %[max] goto +2\n" \
+ "%0 = 0\n" \
+ "goto +1\n" \
+ "%0 += %1\n" \
+ : "+r"(__addr) \
+ : "r"(__base), \
+ [max]"i"(sizeof(base) - sizeof(base member))); \
+ __addr; \
+})
+
+#endif /* __SCHED_EXT_COMMON_BPF_H */
diff --git a/tools/sched_ext/scx_example_dummy.bpf.c b/tools/sched_ext/scx_example_dummy.bpf.c
new file mode 100644
index 000000000000..ac7b490b5a39
--- /dev/null
+++ b/tools/sched_ext/scx_example_dummy.bpf.c
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A minimal dummy scheduler.
+ *
+ * In terms of scheduling, this behaves the same as not specifying any ops at
+ * all - a global FIFO. The only things it adds are the following niceties:
+ *
+ * - Statistics tracking how many are queued to local and global dsq's.
+ * - Termination notification for userspace.
+ *
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#include "scx_common.bpf.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct user_exit_info uei;
+
+struct {
+ __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+ __uint(key_size, sizeof(u32));
+ __uint(value_size, sizeof(u64));
+ __uint(max_entries, 2); /* [local, global] */
+} stats SEC(".maps");
+
+static void stat_inc(u32 idx)
+{
+ u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx);
+ if (cnt_p)
+ (*cnt_p)++;
+}
+
+void BPF_STRUCT_OPS(dummy_enqueue, struct task_struct *p, u64 enq_flags)
+{
+ if (enq_flags & SCX_ENQ_LOCAL) {
+ stat_inc(0);
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags);
+ } else {
+ stat_inc(1);
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+ }
+}
+
+void BPF_STRUCT_OPS(dummy_exit, struct scx_exit_info *ei)
+{
+ uei_record(&uei, ei);
+}
+
+SEC(".struct_ops")
+struct sched_ext_ops dummy_ops = {
+ .enqueue = (void *)dummy_enqueue,
+ .exit = (void *)dummy_exit,
+ .name = "dummy",
+};
diff --git a/tools/sched_ext/scx_example_dummy.c b/tools/sched_ext/scx_example_dummy.c
new file mode 100644
index 000000000000..72881c881830
--- /dev/null
+++ b/tools/sched_ext/scx_example_dummy.c
@@ -0,0 +1,93 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <unistd.h>
+#include <signal.h>
+#include <assert.h>
+#include <libgen.h>
+#include <bpf/bpf.h>
+#include "user_exit_info.h"
+#include "scx_example_dummy.skel.h"
+
+const char help_fmt[] =
+"A minimal dummy sched_ext scheduler.\n"
+"\n"
+"See the top-level comment in .bpf.c for more details.\n"
+"\n"
+"Usage: %s\n"
+"\n"
+" -h Display this help and exit\n";
+
+static volatile int exit_req;
+
+static void sigint_handler(int dummy)
+{
+ exit_req = 1;
+}
+
+static void read_stats(struct scx_example_dummy *skel, u64 *stats)
+{
+ int nr_cpus = libbpf_num_possible_cpus();
+ u64 cnts[2][nr_cpus];
+ u32 idx;
+
+ memset(stats, 0, sizeof(stats[0]) * 2);
+
+ for (idx = 0; idx < 2; idx++) {
+ int ret, cpu;
+
+ ret = bpf_map_lookup_elem(bpf_map__fd(skel->maps.stats),
+ &idx, cnts[idx]);
+ if (ret < 0)
+ continue;
+ for (cpu = 0; cpu < nr_cpus; cpu++)
+ stats[idx] += cnts[idx][cpu];
+ }
+}
+
+int main(int argc, char **argv)
+{
+ struct scx_example_dummy *skel;
+ struct bpf_link *link;
+ u32 opt;
+
+ signal(SIGINT, sigint_handler);
+ signal(SIGTERM, sigint_handler);
+
+ libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
+
+ skel = scx_example_dummy__open();
+ assert(skel);
+
+ while ((opt = getopt(argc, argv, "h")) != -1) {
+ switch (opt) {
+ default:
+ fprintf(stderr, help_fmt, basename(argv[0]));
+ return opt != 'h';
+ }
+ }
+
+ assert(!scx_example_dummy__load(skel));
+
+ link = bpf_map__attach_struct_ops(skel->maps.dummy_ops);
+ assert(link);
+
+ while (!exit_req && !uei_exited(&skel->bss->uei)) {
+ u64 stats[2];
+
+ read_stats(skel, stats);
+ printf("local=%lu global=%lu\n", stats[0], stats[1]);
+ fflush(stdout);
+ sleep(1);
+ }
+
+ bpf_link__destroy(link);
+ uei_print(&skel->bss->uei);
+ scx_example_dummy__destroy(skel);
+ return 0;
+}
diff --git a/tools/sched_ext/scx_example_qmap.bpf.c b/tools/sched_ext/scx_example_qmap.bpf.c
new file mode 100644
index 000000000000..06a07c834b42
--- /dev/null
+++ b/tools/sched_ext/scx_example_qmap.bpf.c
@@ -0,0 +1,238 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A simple five-level FIFO queue scheduler.
+ *
+ * There are five FIFOs implemented using BPF_MAP_TYPE_QUEUE. A task gets
+ * assigned to one depending on its compound weight. Each CPU round robins
+ * through the FIFOs and dispatches more from FIFOs with higher indices - 1 from
+ * queue0, 2 from queue1, 4 from queue2 and so on.
+ *
+ * This scheduler demonstrates:
+ *
+ * - BPF-side queueing using PIDs.
+ * - Sleepable per-task storage allocation using ops.prep_enable().
+ *
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#include "scx_common.bpf.h"
+#include <linux/sched/prio.h>
+
+char _license[] SEC("license") = "GPL";
+
+const volatile u64 slice_ns = SCX_SLICE_DFL;
+
+u32 test_error_cnt;
+
+struct user_exit_info uei;
+
+struct qmap {
+ __uint(type, BPF_MAP_TYPE_QUEUE);
+ __uint(max_entries, 4096);
+ __type(value, u32);
+} queue0 SEC(".maps"),
+ queue1 SEC(".maps"),
+ queue2 SEC(".maps"),
+ queue3 SEC(".maps"),
+ queue4 SEC(".maps");
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
+ __uint(max_entries, 5);
+ __type(key, int);
+ __array(values, struct qmap);
+} queue_arr SEC(".maps") = {
+ .values = {
+ [0] = &queue0,
+ [1] = &queue1,
+ [2] = &queue2,
+ [3] = &queue3,
+ [4] = &queue4,
+ },
+};
+
+/* Per-task scheduling context */
+struct task_ctx {
+ bool force_local; /* Dispatch directly to local_dsq */
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+ __type(key, int);
+ __type(value, struct task_ctx);
+} task_ctx_stor SEC(".maps");
+
+/* Per-cpu dispatch index and remaining count */
+struct {
+ __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+ __uint(max_entries, 2);
+ __type(key, u32);
+ __type(value, u64);
+} dispatch_idx_cnt SEC(".maps");
+
+/* Statistics */
+unsigned long nr_enqueued, nr_dispatched, nr_dequeued;
+
+s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
+ s32 prev_cpu, u64 wake_flags)
+{
+ struct task_ctx *tctx;
+ s32 cpu;
+
+ tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
+ if (!tctx) {
+ scx_bpf_error("task_ctx lookup failed");
+ return -ESRCH;
+ }
+
+ if (p->nr_cpus_allowed == 1 ||
+ scx_bpf_test_and_clear_cpu_idle(prev_cpu)) {
+ tctx->force_local = true;
+ return prev_cpu;
+ }
+
+ cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr);
+ if (cpu >= 0)
+ return cpu;
+
+ return prev_cpu;
+}
+
+static int weight_to_idx(u32 weight)
+{
+ /* Coarsely map the compound weight to a FIFO. */
+ if (weight <= 25)
+ return 0;
+ else if (weight <= 50)
+ return 1;
+ else if (weight < 200)
+ return 2;
+ else if (weight < 400)
+ return 3;
+ else
+ return 4;
+}
+
+void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
+{
+ struct task_ctx *tctx;
+ u32 pid = p->pid;
+ int idx = weight_to_idx(p->scx.weight);
+ void *ring;
+
+ if (test_error_cnt && !--test_error_cnt)
+ scx_bpf_error("test triggering error");
+
+ tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
+ if (!tctx) {
+ scx_bpf_error("task_ctx lookup failed");
+ return;
+ }
+
+ /* Is select_cpu() is telling us to enqueue locally? */
+ if (tctx->force_local) {
+ tctx->force_local = false;
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL, slice_ns, enq_flags);
+ return;
+ }
+
+ ring = bpf_map_lookup_elem(&queue_arr, &idx);
+ if (!ring) {
+ scx_bpf_error("failed to find ring %d", idx);
+ return;
+ }
+
+ /* Queue on the selected FIFO. If the FIFO overflows, punt to global. */
+ if (bpf_map_push_elem(ring, &pid, 0)) {
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, slice_ns, enq_flags);
+ return;
+ }
+
+ __sync_fetch_and_add(&nr_enqueued, 1);
+}
+
+/*
+ * The BPF queue map doesn't support removal and sched_ext can handle spurious
+ * dispatches. qmap_dequeue() is only used to collect statistics.
+ */
+void BPF_STRUCT_OPS(qmap_dequeue, struct task_struct *p, u64 deq_flags)
+{
+ __sync_fetch_and_add(&nr_dequeued, 1);
+}
+
+void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
+{
+ u32 zero = 0, one = 1;
+ u64 *idx = bpf_map_lookup_elem(&dispatch_idx_cnt, &zero);
+ u64 *cnt = bpf_map_lookup_elem(&dispatch_idx_cnt, &one);
+ void *fifo;
+ s32 pid;
+ int i;
+
+ if (!idx || !cnt) {
+ scx_bpf_error("failed to lookup idx[%p], cnt[%p]", idx, cnt);
+ return;
+ }
+
+ for (i = 0; i < 5; i++) {
+ /* Advance the dispatch cursor and pick the fifo. */
+ if (!*cnt) {
+ *idx = (*idx + 1) % 5;
+ *cnt = 1 << *idx;
+ }
+ (*cnt)--;
+
+ fifo = bpf_map_lookup_elem(&queue_arr, idx);
+ if (!fifo) {
+ scx_bpf_error("failed to find ring %llu", *idx);
+ return;
+ }
+
+ /* Dispatch or advance. */
+ if (!bpf_map_pop_elem(fifo, &pid)) {
+ struct task_struct *p;
+
+ p = bpf_task_from_pid(pid);
+ if (p) {
+ __sync_fetch_and_add(&nr_dispatched, 1);
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, slice_ns, 0);
+ bpf_task_release(p);
+ return;
+ }
+ }
+
+ *cnt = 0;
+ }
+}
+
+s32 BPF_STRUCT_OPS(qmap_prep_enable, struct task_struct *p,
+ struct scx_enable_args *args)
+{
+ /*
+ * @p is new. Let's ensure that its task_ctx is available. We can sleep
+ * in this function and the following will automatically use GFP_KERNEL.
+ */
+ if (bpf_task_storage_get(&task_ctx_stor, p, 0,
+ BPF_LOCAL_STORAGE_GET_F_CREATE))
+ return 0;
+ else
+ return -ENOMEM;
+}
+
+void BPF_STRUCT_OPS(qmap_exit, struct scx_exit_info *ei)
+{
+ uei_record(&uei, ei);
+}
+
+SEC(".struct_ops")
+struct sched_ext_ops qmap_ops = {
+ .select_cpu = (void *)qmap_select_cpu,
+ .enqueue = (void *)qmap_enqueue,
+ .dequeue = (void *)qmap_dequeue,
+ .dispatch = (void *)qmap_dispatch,
+ .prep_enable = (void *)qmap_prep_enable,
+ .exit = (void *)qmap_exit,
+ .name = "qmap",
+};
diff --git a/tools/sched_ext/scx_example_qmap.c b/tools/sched_ext/scx_example_qmap.c
new file mode 100644
index 000000000000..c6c74641a182
--- /dev/null
+++ b/tools/sched_ext/scx_example_qmap.c
@@ -0,0 +1,84 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <signal.h>
+#include <assert.h>
+#include <libgen.h>
+#include <bpf/bpf.h>
+#include "user_exit_info.h"
+#include "scx_example_qmap.skel.h"
+
+const char help_fmt[] =
+"A simple five-level FIFO queue sched_ext scheduler.\n"
+"\n"
+"See the top-level comment in .bpf.c for more details.\n"
+"\n"
+"Usage: %s [-s SLICE_US] [-e COUNT]\n"
+"\n"
+" -s SLICE_US Override slice duration\n"
+" -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n"
+" -h Display this help and exit\n";
+
+static volatile int exit_req;
+
+static void sigint_handler(int dummy)
+{
+ exit_req = 1;
+}
+
+int main(int argc, char **argv)
+{
+ struct scx_example_qmap *skel;
+ struct bpf_link *link;
+ int opt;
+
+ signal(SIGINT, sigint_handler);
+ signal(SIGTERM, sigint_handler);
+
+ libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
+
+ skel = scx_example_qmap__open();
+ assert(skel);
+
+ while ((opt = getopt(argc, argv, "hs:e:tTd:")) != -1) {
+ switch (opt) {
+ case 's':
+ skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
+ break;
+ case 'e':
+ skel->bss->test_error_cnt = strtoull(optarg, NULL, 0);
+ break;
+ default:
+ fprintf(stderr, help_fmt, basename(argv[0]));
+ return opt != 'h';
+ }
+ }
+
+ assert(!scx_example_qmap__load(skel));
+
+ link = bpf_map__attach_struct_ops(skel->maps.qmap_ops);
+ assert(link);
+
+ while (!exit_req && !uei_exited(&skel->bss->uei)) {
+ long nr_enqueued = skel->bss->nr_enqueued;
+ long nr_dispatched = skel->bss->nr_dispatched;
+
+ printf("enq=%lu, dsp=%lu, delta=%ld, deq=%lu\n",
+ nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
+ skel->bss->nr_dequeued);
+ fflush(stdout);
+ sleep(1);
+ }
+
+ bpf_link__destroy(link);
+ uei_print(&skel->bss->uei);
+ scx_example_qmap__destroy(skel);
+ return 0;
+}
diff --git a/tools/sched_ext/user_exit_info.h b/tools/sched_ext/user_exit_info.h
new file mode 100644
index 000000000000..e701ef0e0b86
--- /dev/null
+++ b/tools/sched_ext/user_exit_info.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Define struct user_exit_info which is shared between BPF and userspace parts
+ * to communicate exit status and other information.
+ *
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#ifndef __USER_EXIT_INFO_H
+#define __USER_EXIT_INFO_H
+
+struct user_exit_info {
+ int type;
+ char reason[128];
+ char msg[1024];
+};
+
+#ifdef __bpf__
+
+#include "vmlinux.h"
+#include <bpf/bpf_core_read.h>
+
+static inline void uei_record(struct user_exit_info *uei,
+ const struct scx_exit_info *ei)
+{
+ bpf_probe_read_kernel_str(uei->reason, sizeof(uei->reason), ei->reason);
+ bpf_probe_read_kernel_str(uei->msg, sizeof(uei->msg), ei->msg);
+ /* use __sync to force memory barrier */
+ __sync_val_compare_and_swap(&uei->type, uei->type, ei->type);
+}
+
+#else /* !__bpf__ */
+
+static inline bool uei_exited(struct user_exit_info *uei)
+{
+ /* use __sync to force memory barrier */
+ return __sync_val_compare_and_swap(&uei->type, -1, -1);
+}
+
+static inline void uei_print(const struct user_exit_info *uei)
+{
+ fprintf(stderr, "EXIT: %s", uei->reason);
+ if (uei->msg[0] != '\0')
+ fprintf(stderr, " (%s)", uei->msg);
+ fputs("\n", stderr);
+}
+
+#endif /* __bpf__ */
+#endif /* __USER_EXIT_INFO_H */
--
2.39.1


2023-01-28 00:19:14

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 16/30] sched_ext: Implement runnable task stall watchdog

From: David Vernet <[email protected]>

The most common and critical way that a BPF scheduler can misbehave is by
failing to run runnable tasks for too long. This patch implements a
watchdog.

* All tasks record when they become runnable.

* A watchdog work periodically scans all runnable tasks. If any task has
stayed runnable for too long, the BPF scheduler is aborted.

* scheduler_tick() monitors whether the watchdog itself is stuck. If so, the
BPF scheduler is aborted.

Because the watchdog only scans the tasks which are currently runnable and
usually very infrequently, the overhead should be negligible.
scx_example_qmap is updated so that it can be told to stall user and/or
kernel tasks.

A detected task stall looks like the following:

sched_ext: BPF scheduler "qmap" errored, disabling
sched_ext: runnable task stall (dbus-daemon[953] failed to run for 6.478s)
scx_check_timeout_workfn+0x10e/0x1b0
process_one_work+0x287/0x560
worker_thread+0x234/0x420
kthread+0xe9/0x100
ret_from_fork+0x1f/0x30

A detected watchdog stall:

sched_ext: BPF scheduler "qmap" errored, disabling
sched_ext: runnable task stall (watchdog failed to check in for 5.001s)
scheduler_tick+0x2eb/0x340
update_process_times+0x7a/0x90
tick_sched_timer+0xd8/0x130
__hrtimer_run_queues+0x178/0x3b0
hrtimer_interrupt+0xfc/0x390
__sysvec_apic_timer_interrupt+0xb7/0x2b0
sysvec_apic_timer_interrupt+0x90/0xb0
asm_sysvec_apic_timer_interrupt+0x1b/0x20
default_idle+0x14/0x20
arch_cpu_idle+0xf/0x20
default_idle_call+0x50/0x90
do_idle+0xe8/0x240
cpu_startup_entry+0x1d/0x20
kernel_init+0x0/0x190
start_kernel+0x0/0x392
start_kernel+0x324/0x392
x86_64_start_reservations+0x2a/0x2c
x86_64_start_kernel+0x104/0x109
secondary_startup_64_no_verify+0xce/0xdb

Note that this patch exposes scx_ops_error[_type]() in kernel/sched/ext.h to
inline scx_notify_sched_tick().

v2: Julia Lawall noticed that the watchdog code was mixing msecs and
jiffies. Fix by using jiffies for everything.

Signed-off-by: David Vernet <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
Cc: Julia Lawall <[email protected]>
---
include/linux/sched/ext.h | 13 +++
init/init_task.c | 2 +
kernel/sched/core.c | 3 +
kernel/sched/ext.c | 128 +++++++++++++++++++++++--
kernel/sched/ext.h | 25 +++++
kernel/sched/sched.h | 1 +
tools/sched_ext/scx_example_qmap.bpf.c | 12 +++
tools/sched_ext/scx_example_qmap.c | 14 ++-
8 files changed, 186 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 988d1e30e26c..474a8c0a0b12 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -59,6 +59,7 @@ enum scx_exit_type {

SCX_EXIT_ERROR = 1024, /* runtime error, error msg contains details */
SCX_EXIT_ERROR_BPF, /* ERROR but triggered through scx_bpf_error() */
+ SCX_EXIT_ERROR_STALL, /* watchdog detected stalled runnable tasks */
};

/*
@@ -307,6 +308,15 @@ struct sched_ext_ops {
*/
u64 flags;

+ /**
+ * timeout_ms - The maximum amount of time, in milliseconds, that a
+ * runnable task should be able to wait before being scheduled. The
+ * maximum timeout may not exceed the default timeout of 30 seconds.
+ *
+ * Defaults to the maximum allowed timeout value of 30 seconds.
+ */
+ u32 timeout_ms;
+
/**
* name - BPF scheduler's name
*
@@ -340,6 +350,7 @@ enum scx_ent_flags {
SCX_TASK_OPS_PREPPED = 1 << 3, /* prepared for BPF scheduler enable */
SCX_TASK_OPS_ENABLED = 1 << 4, /* task has BPF scheduler enabled */

+ SCX_TASK_WATCHDOG_RESET = 1 << 5, /* task watchdog counter should be reset */
SCX_TASK_DEQD_FOR_SLEEP = 1 << 6, /* last dequeue was for SLEEP */

SCX_TASK_CURSOR = 1 << 7, /* iteration cursor, not a task */
@@ -369,12 +380,14 @@ enum scx_kf_mask {
struct sched_ext_entity {
struct scx_dispatch_q *dsq;
struct list_head dsq_node;
+ struct list_head watchdog_node;
u32 flags; /* protected by rq lock */
u32 weight;
s32 sticky_cpu;
s32 holding_cpu;
u32 kf_mask; /* see scx_kf_mask above */
atomic64_t ops_state;
+ unsigned long runnable_at;

/* BPF scheduler modifiable fields */

diff --git a/init/init_task.c b/init/init_task.c
index bdbc663107bf..913194aab623 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -106,9 +106,11 @@ struct task_struct init_task
#ifdef CONFIG_SCHED_CLASS_EXT
.scx = {
.dsq_node = LIST_HEAD_INIT(init_task.scx.dsq_node),
+ .watchdog_node = LIST_HEAD_INIT(init_task.scx.watchdog_node),
.sticky_cpu = -1,
.holding_cpu = -1,
.ops_state = ATOMIC_INIT(0),
+ .runnable_at = INITIAL_JIFFIES,
.slice = SCX_SLICE_DFL,
},
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 804aa291b837..3f177c161c1e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4424,12 +4424,14 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
#ifdef CONFIG_SCHED_CLASS_EXT
p->scx.dsq = NULL;
INIT_LIST_HEAD(&p->scx.dsq_node);
+ INIT_LIST_HEAD(&p->scx.watchdog_node);
p->scx.flags = 0;
p->scx.weight = 0;
p->scx.sticky_cpu = -1;
p->scx.holding_cpu = -1;
p->scx.kf_mask = 0;
atomic64_set(&p->scx.ops_state, 0);
+ p->scx.runnable_at = INITIAL_JIFFIES;
p->scx.slice = SCX_SLICE_DFL;
#endif

@@ -5584,6 +5586,7 @@ void scheduler_tick(void)
if (sched_feat(LATENCY_WARN) && resched_latency)
resched_latency_warn(cpu, resched_latency);

+ scx_notify_sched_tick();
perf_event_task_tick();

#ifdef CONFIG_SMP
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index ebbf3f3c1cf7..1af74ea8ed42 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -9,6 +9,7 @@
enum scx_internal_consts {
SCX_NR_ONLINE_OPS = SCX_OP_IDX(init),
SCX_DSP_DFL_MAX_BATCH = 32,
+ SCX_WATCHDOG_MAX_TIMEOUT = 30 * HZ,
};

enum scx_ops_enable_state {
@@ -87,6 +88,23 @@ static struct scx_exit_info scx_exit_info;

static atomic64_t scx_nr_rejected = ATOMIC64_INIT(0);

+/*
+ * The maximum amount of time in jiffies that a task may be runnable without
+ * being scheduled on a CPU. If this timeout is exceeded, it will trigger
+ * scx_ops_error().
+ */
+unsigned long scx_watchdog_timeout;
+
+/*
+ * The last time the delayed work was run. This delayed work relies on
+ * ksoftirqd being able to run to service timer interrupts, so it's possible
+ * that this work itself could get wedged. To account for this, we check that
+ * it's not stalled in the timer tick, and trigger an error if it is.
+ */
+unsigned long scx_watchdog_timestamp = INITIAL_JIFFIES;
+
+static struct delayed_work scx_watchdog_work;
+
/* idle tracking */
#ifdef CONFIG_SMP
#ifdef CONFIG_CPUMASK_OFFSTACK
@@ -146,10 +164,6 @@ static DEFINE_PER_CPU(struct scx_dsp_ctx, scx_dsp_ctx);

void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice,
u64 enq_flags);
-__printf(2, 3) static void scx_ops_error_type(enum scx_exit_type type,
- const char *fmt, ...);
-#define scx_ops_error(fmt, args...) \
- scx_ops_error_type(SCX_EXIT_ERROR, fmt, ##args)

struct scx_task_iter {
struct sched_ext_entity cursor;
@@ -668,6 +682,27 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
dispatch_enqueue(&scx_dsq_global, p, enq_flags);
}

+static bool watchdog_task_watched(const struct task_struct *p)
+{
+ return !list_empty(&p->scx.watchdog_node);
+}
+
+static void watchdog_watch_task(struct rq *rq, struct task_struct *p)
+{
+ lockdep_assert_rq_held(rq);
+ if (p->scx.flags & SCX_TASK_WATCHDOG_RESET)
+ p->scx.runnable_at = jiffies;
+ p->scx.flags &= ~SCX_TASK_WATCHDOG_RESET;
+ list_add_tail(&p->scx.watchdog_node, &rq->scx.watchdog_list);
+}
+
+static void watchdog_unwatch_task(struct task_struct *p, bool reset_timeout)
+{
+ list_del_init(&p->scx.watchdog_node);
+ if (reset_timeout)
+ p->scx.flags |= SCX_TASK_WATCHDOG_RESET;
+}
+
static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags)
{
int sticky_cpu = p->scx.sticky_cpu;
@@ -684,9 +719,12 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
if (unlikely(enq_flags & ENQUEUE_RESTORE) && task_current(rq, p))
sticky_cpu = cpu_of(rq);

- if (p->scx.flags & SCX_TASK_QUEUED)
+ if (p->scx.flags & SCX_TASK_QUEUED) {
+ WARN_ON_ONCE(!watchdog_task_watched(p));
return;
+ }

+ watchdog_watch_task(rq, p);
p->scx.flags |= SCX_TASK_QUEUED;
rq->scx.nr_running++;
add_nr_running(rq, 1);
@@ -698,6 +736,8 @@ static void ops_dequeue(struct task_struct *p, u64 deq_flags)
{
u64 opss;

+ watchdog_unwatch_task(p, false);
+
/* acquire ensures that we see the preceding updates on QUEUED */
opss = atomic64_read_acquire(&p->scx.ops_state);

@@ -742,8 +782,10 @@ static void dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags
{
struct scx_rq *scx_rq = &rq->scx;

- if (!(p->scx.flags & SCX_TASK_QUEUED))
+ if (!(p->scx.flags & SCX_TASK_QUEUED)) {
+ WARN_ON_ONCE(watchdog_task_watched(p));
return;
+ }

ops_dequeue(p, deq_flags);

@@ -1256,6 +1298,8 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
}

p->se.exec_start = rq_clock_task(rq);
+
+ watchdog_unwatch_task(p, true);
}

static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
@@ -1299,11 +1343,14 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
*/
if (p->scx.flags & SCX_TASK_BAL_KEEP) {
p->scx.flags &= ~SCX_TASK_BAL_KEEP;
+ watchdog_watch_task(rq, p);
dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
return;
}

if (p->scx.flags & SCX_TASK_QUEUED) {
+ watchdog_watch_task(rq, p);
+
/*
* If @p has slice left and balance_scx() didn't tag it for
* keeping, @p is getting preempted by a higher priority
@@ -1530,6 +1577,49 @@ static void reset_idle_masks(void) {}

#endif /* CONFIG_SMP */

+static bool check_rq_for_timeouts(struct rq *rq)
+{
+ struct task_struct *p;
+ struct rq_flags rf;
+ bool timed_out = false;
+
+ rq_lock_irqsave(rq, &rf);
+ list_for_each_entry(p, &rq->scx.watchdog_list, scx.watchdog_node) {
+ unsigned long last_runnable = p->scx.runnable_at;
+
+ if (unlikely(time_after(jiffies,
+ last_runnable + scx_watchdog_timeout))) {
+ u32 dur_ms = jiffies_to_msecs(jiffies - last_runnable);
+
+ scx_ops_error_type(SCX_EXIT_ERROR_STALL,
+ "%s[%d] failed to run for %u.%03us",
+ p->comm, p->pid,
+ dur_ms / 1000, dur_ms % 1000);
+ timed_out = true;
+ break;
+ }
+ }
+ rq_unlock_irqrestore(rq, &rf);
+
+ return timed_out;
+}
+
+static void scx_watchdog_workfn(struct work_struct *work)
+{
+ int cpu;
+
+ scx_watchdog_timestamp = jiffies;
+
+ for_each_online_cpu(cpu) {
+ if (unlikely(check_rq_for_timeouts(cpu_rq(cpu))))
+ break;
+
+ cond_resched();
+ }
+ queue_delayed_work(system_unbound_wq, to_delayed_work(work),
+ scx_watchdog_timeout / 2);
+}
+
static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued)
{
update_curr_scx(rq);
@@ -1561,7 +1651,7 @@ static int scx_ops_prepare_task(struct task_struct *p, struct task_group *tg)
}
}

- p->scx.flags |= SCX_TASK_OPS_PREPPED;
+ p->scx.flags |= (SCX_TASK_OPS_PREPPED | SCX_TASK_WATCHDOG_RESET);
return 0;
}

@@ -1877,6 +1967,8 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
break;
}

+ cancel_delayed_work_sync(&scx_watchdog_work);
+
switch (type) {
case SCX_EXIT_UNREG:
reason = "BPF scheduler unregistered";
@@ -1890,6 +1982,9 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
case SCX_EXIT_ERROR_BPF:
reason = "scx_bpf_error";
break;
+ case SCX_EXIT_ERROR_STALL:
+ reason = "runnable task stall";
+ break;
default:
reason = "<UNKNOWN>";
}
@@ -2074,8 +2169,8 @@ static void scx_ops_error_irq_workfn(struct irq_work *irq_work)

static DEFINE_IRQ_WORK(scx_ops_error_irq_work, scx_ops_error_irq_workfn);

-__printf(2, 3) static void scx_ops_error_type(enum scx_exit_type type,
- const char *fmt, ...)
+__printf(2, 3) void scx_ops_error_type(enum scx_exit_type type,
+ const char *fmt, ...)
{
struct scx_exit_info *ei = &scx_exit_info;
int none = SCX_EXIT_NONE;
@@ -2174,6 +2269,14 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
goto err_disable;
}

+ scx_watchdog_timeout = SCX_WATCHDOG_MAX_TIMEOUT;
+ if (ops->timeout_ms)
+ scx_watchdog_timeout = msecs_to_jiffies(ops->timeout_ms);
+
+ scx_watchdog_timestamp = jiffies;
+ queue_delayed_work(system_unbound_wq, &scx_watchdog_work,
+ scx_watchdog_timeout / 2);
+
/*
* Lock out forks before opening the floodgate so that they don't wander
* into the operations prematurely.
@@ -2433,6 +2536,11 @@ static int bpf_scx_init_member(const struct btf_type *t,
if (ret == 0)
return -EINVAL;
return 1;
+ case offsetof(struct sched_ext_ops, timeout_ms):
+ if (*(u32 *)(udata + moff) > SCX_WATCHDOG_MAX_TIMEOUT)
+ return -E2BIG;
+ ops->timeout_ms = *(u32 *)(udata + moff);
+ return 1;
}

return 0;
@@ -2530,9 +2638,11 @@ void __init init_sched_ext_class(void)
struct rq *rq = cpu_rq(cpu);

init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL);
+ INIT_LIST_HEAD(&rq->scx.watchdog_list);
}

register_sysrq_key('S', &sysrq_sched_ext_reset_op);
+ INIT_DELAYED_WORK(&scx_watchdog_work, scx_watchdog_workfn);
}


diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index f8d5682deacf..7dfa7b888487 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -56,6 +56,8 @@ enum scx_deq_flags {
extern const struct sched_class ext_sched_class;
extern const struct bpf_verifier_ops bpf_sched_ext_verifier_ops;
extern const struct file_operations sched_ext_fops;
+extern unsigned long scx_watchdog_timeout;
+extern unsigned long scx_watchdog_timestamp;

DECLARE_STATIC_KEY_FALSE(__scx_ops_enabled);
#define scx_enabled() static_branch_unlikely(&__scx_ops_enabled)
@@ -67,6 +69,28 @@ void scx_post_fork(struct task_struct *p);
void scx_cancel_fork(struct task_struct *p);
void init_sched_ext_class(void);

+__printf(2, 3) void scx_ops_error_type(enum scx_exit_type type,
+ const char *fmt, ...);
+#define scx_ops_error(fmt, args...) \
+ scx_ops_error_type(SCX_EXIT_ERROR, fmt, ##args)
+
+static inline void scx_notify_sched_tick(void)
+{
+ unsigned long last_check;
+
+ if (!scx_enabled())
+ return;
+
+ last_check = scx_watchdog_timestamp;
+ if (unlikely(time_after(jiffies, last_check + scx_watchdog_timeout))) {
+ u32 dur_ms = jiffies_to_msecs(jiffies - last_check);
+
+ scx_ops_error_type(SCX_EXIT_ERROR_STALL,
+ "watchdog failed to check in for %u.%03us",
+ dur_ms / 1000, dur_ms % 1000);
+ }
+}
+
static inline const struct sched_class *next_active_class(const struct sched_class *class)
{
class++;
@@ -98,6 +122,7 @@ static inline int scx_fork(struct task_struct *p) { return 0; }
static inline void scx_post_fork(struct task_struct *p) {}
static inline void scx_cancel_fork(struct task_struct *p) {}
static inline void init_sched_ext_class(void) {}
+static inline void scx_notify_sched_tick(void) {}

#define for_each_active_class for_each_class
#define for_balance_class_range for_class_range
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7778f4ce6b5b..112c2f127c95 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -687,6 +687,7 @@ struct cfs_rq {
#ifdef CONFIG_SCHED_CLASS_EXT
struct scx_rq {
struct scx_dispatch_q local_dsq;
+ struct list_head watchdog_list;
u64 ops_qseq;
u32 nr_running;
};
diff --git a/tools/sched_ext/scx_example_qmap.bpf.c b/tools/sched_ext/scx_example_qmap.bpf.c
index 06a07c834b42..b22b8d82846e 100644
--- a/tools/sched_ext/scx_example_qmap.bpf.c
+++ b/tools/sched_ext/scx_example_qmap.bpf.c
@@ -22,6 +22,8 @@
char _license[] SEC("license") = "GPL";

const volatile u64 slice_ns = SCX_SLICE_DFL;
+const volatile u32 stall_user_nth;
+const volatile u32 stall_kernel_nth;

u32 test_error_cnt;

@@ -117,11 +119,20 @@ static int weight_to_idx(u32 weight)

void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
{
+ static u32 user_cnt, kernel_cnt;
struct task_ctx *tctx;
u32 pid = p->pid;
int idx = weight_to_idx(p->scx.weight);
void *ring;

+ if (p->flags & PF_KTHREAD) {
+ if (stall_kernel_nth && !(++kernel_cnt % stall_kernel_nth))
+ return;
+ } else {
+ if (stall_user_nth && !(++user_cnt % stall_user_nth))
+ return;
+ }
+
if (test_error_cnt && !--test_error_cnt)
scx_bpf_error("test triggering error");

@@ -234,5 +245,6 @@ struct sched_ext_ops qmap_ops = {
.dispatch = (void *)qmap_dispatch,
.prep_enable = (void *)qmap_prep_enable,
.exit = (void *)qmap_exit,
+ .timeout_ms = 5000U,
.name = "qmap",
};
diff --git a/tools/sched_ext/scx_example_qmap.c b/tools/sched_ext/scx_example_qmap.c
index c6c74641a182..dd490a146b1a 100644
--- a/tools/sched_ext/scx_example_qmap.c
+++ b/tools/sched_ext/scx_example_qmap.c
@@ -20,10 +20,12 @@ const char help_fmt[] =
"\n"
"See the top-level comment in .bpf.c for more details.\n"
"\n"
-"Usage: %s [-s SLICE_US] [-e COUNT]\n"
+"Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT]\n"
"\n"
" -s SLICE_US Override slice duration\n"
" -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n"
+" -t COUNT Stall every COUNT'th user thread\n"
+" -T COUNT Stall every COUNT'th kernel thread\n"
" -h Display this help and exit\n";

static volatile int exit_req;
@@ -47,13 +49,19 @@ int main(int argc, char **argv)
skel = scx_example_qmap__open();
assert(skel);

- while ((opt = getopt(argc, argv, "hs:e:tTd:")) != -1) {
+ while ((opt = getopt(argc, argv, "hs:e:t:T:d:")) != -1) {
switch (opt) {
case 's':
skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
break;
case 'e':
- skel->bss->test_error_cnt = strtoull(optarg, NULL, 0);
+ skel->bss->test_error_cnt = strtoul(optarg, NULL, 0);
+ break;
+ case 't':
+ skel->rodata->stall_user_nth = strtoul(optarg, NULL, 0);
+ break;
+ case 'T':
+ skel->rodata->stall_kernel_nth = strtoul(optarg, NULL, 0);
break;
default:
fprintf(stderr, help_fmt, basename(argv[0]));
--
2.39.1


2023-01-28 00:19:17

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 17/30] sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT

BPF schedulers might not want to schedule certain tasks - e.g. kernel
threads. This patch adds p->scx.disallow which can be set by BPF schedulers
in such cases. The field can be changed anytime and setting it in
ops.prep_enable() guarantees that the task can never be scheduled by
sched_ext.

scx_example_qmap is updated with the -d option to disallow a specific PID:

# echo $$
1092
# egrep '(policy)|(ext\.enabled)' /proc/self/sched
policy : 0
ext.enabled : 0
# ./set-scx 1092
# egrep '(policy)|(ext\.enabled)' /proc/self/sched
policy : 7
ext.enabled : 0

Run "scx_example_qmap -d 1092" in another terminal.

# grep rejected /sys/kernel/debug/sched/ext
nr_rejected : 1
# egrep '(policy)|(ext\.enabled)' /proc/self/sched
policy : 0
ext.enabled : 0
# ./set-scx 1092
setparam failed for 1092 (Permission denied)

Signed-off-by: Tejun Heo <[email protected]>
Suggested-by: Barret Rhoden <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/sched/ext.h | 12 ++++++++
kernel/sched/core.c | 4 +++
kernel/sched/ext.c | 38 ++++++++++++++++++++++++++
kernel/sched/ext.h | 3 ++
tools/sched_ext/scx_example_qmap.bpf.c | 4 +++
tools/sched_ext/scx_example_qmap.c | 8 +++++-
6 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 474a8c0a0b12..b4c4b83a07f6 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -399,6 +399,18 @@ struct sched_ext_entity {
*/
u64 slice;

+ /*
+ * If set, reject future sched_setscheduler(2) calls updating the policy
+ * to %SCHED_EXT with -%EACCES.
+ *
+ * If set from ops.prep_enable() and the task's policy is already
+ * %SCHED_EXT, which can happen while the BPF scheduler is being loaded
+ * or by inhering the parent's policy during fork, the task's policy is
+ * rejected and forcefully reverted to %SCHED_NORMAL. The number of such
+ * events are reported through /sys/kernel/debug/sched_ext::nr_rejected.
+ */
+ bool disallow; /* reject switching into SCX */
+
/* cold fields */
struct list_head tasks_node;
};
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3f177c161c1e..9e566e72c3f2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7598,6 +7598,10 @@ static int __sched_setscheduler(struct task_struct *p,
goto unlock;
}

+ retval = scx_check_setscheduler(p, policy);
+ if (retval)
+ goto unlock;
+
/*
* If not changing anything there's no need to proceed further,
* but store a possible modification of reset_on_fork.
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 1af74ea8ed42..b9d55c25cec9 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1641,6 +1641,8 @@ static int scx_ops_prepare_task(struct task_struct *p, struct task_group *tg)

WARN_ON_ONCE(p->scx.flags & SCX_TASK_OPS_PREPPED);

+ p->scx.disallow = false;
+
if (SCX_HAS_OP(prep_enable)) {
struct scx_enable_args args = { };

@@ -1651,6 +1653,27 @@ static int scx_ops_prepare_task(struct task_struct *p, struct task_group *tg)
}
}

+ if (p->scx.disallow) {
+ struct rq *rq;
+ struct rq_flags rf;
+
+ rq = task_rq_lock(p, &rf);
+
+ /*
+ * We're either in fork or load path and @p->policy will be
+ * applied right after. Reverting @p->policy here and rejecting
+ * %SCHED_EXT transitions from scx_check_setscheduler()
+ * guarantees that if ops.prep_enable() sets @p->disallow, @p
+ * can never be in SCX.
+ */
+ if (p->policy == SCHED_EXT) {
+ p->policy = SCHED_NORMAL;
+ atomic64_inc(&scx_nr_rejected);
+ }
+
+ task_rq_unlock(rq, p, &rf);
+ }
+
p->scx.flags |= (SCX_TASK_OPS_PREPPED | SCX_TASK_WATCHDOG_RESET);
return 0;
}
@@ -1796,6 +1819,18 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p)
static void check_preempt_curr_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
static void switched_to_scx(struct rq *rq, struct task_struct *p) {}

+int scx_check_setscheduler(struct task_struct *p, int policy)
+{
+ lockdep_assert_rq_held(task_rq(p));
+
+ /* if disallow, reject transitioning into SCX */
+ if (scx_enabled() && READ_ONCE(p->scx.disallow) &&
+ p->policy != policy && policy == SCHED_EXT)
+ return -EACCES;
+
+ return 0;
+}
+
/*
* Omitted operations:
*
@@ -2479,6 +2514,9 @@ static int bpf_scx_btf_struct_access(struct bpf_verifier_log *log,
if (off >= offsetof(struct task_struct, scx.slice) &&
off + size <= offsetofend(struct task_struct, scx.slice))
return SCALAR_VALUE;
+ if (off >= offsetof(struct task_struct, scx.disallow) &&
+ off + size <= offsetofend(struct task_struct, scx.disallow))
+ return SCALAR_VALUE;
}

if (atype == BPF_READ)
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 7dfa7b888487..76c94babd19e 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -67,6 +67,7 @@ void scx_pre_fork(struct task_struct *p);
int scx_fork(struct task_struct *p);
void scx_post_fork(struct task_struct *p);
void scx_cancel_fork(struct task_struct *p);
+int scx_check_setscheduler(struct task_struct *p, int policy);
void init_sched_ext_class(void);

__printf(2, 3) void scx_ops_error_type(enum scx_exit_type type,
@@ -121,6 +122,8 @@ static inline void scx_pre_fork(struct task_struct *p) {}
static inline int scx_fork(struct task_struct *p) { return 0; }
static inline void scx_post_fork(struct task_struct *p) {}
static inline void scx_cancel_fork(struct task_struct *p) {}
+static inline int scx_check_setscheduler(struct task_struct *p,
+ int policy) { return 0; }
static inline void init_sched_ext_class(void) {}
static inline void scx_notify_sched_tick(void) {}

diff --git a/tools/sched_ext/scx_example_qmap.bpf.c b/tools/sched_ext/scx_example_qmap.bpf.c
index b22b8d82846e..46bc16ed301f 100644
--- a/tools/sched_ext/scx_example_qmap.bpf.c
+++ b/tools/sched_ext/scx_example_qmap.bpf.c
@@ -24,6 +24,7 @@ char _license[] SEC("license") = "GPL";
const volatile u64 slice_ns = SCX_SLICE_DFL;
const volatile u32 stall_user_nth;
const volatile u32 stall_kernel_nth;
+const volatile s32 disallow_tgid;

u32 test_error_cnt;

@@ -221,6 +222,9 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
s32 BPF_STRUCT_OPS(qmap_prep_enable, struct task_struct *p,
struct scx_enable_args *args)
{
+ if (p->tgid == disallow_tgid)
+ p->scx.disallow = true;
+
/*
* @p is new. Let's ensure that its task_ctx is available. We can sleep
* in this function and the following will automatically use GFP_KERNEL.
diff --git a/tools/sched_ext/scx_example_qmap.c b/tools/sched_ext/scx_example_qmap.c
index dd490a146b1a..dff9323dfd20 100644
--- a/tools/sched_ext/scx_example_qmap.c
+++ b/tools/sched_ext/scx_example_qmap.c
@@ -20,12 +20,13 @@ const char help_fmt[] =
"\n"
"See the top-level comment in .bpf.c for more details.\n"
"\n"
-"Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT]\n"
+"Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-d PID]\n"
"\n"
" -s SLICE_US Override slice duration\n"
" -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n"
" -t COUNT Stall every COUNT'th user thread\n"
" -T COUNT Stall every COUNT'th kernel thread\n"
+" -d PID Disallow a process from switching into SCHED_EXT (-1 for self)\n"
" -h Display this help and exit\n";

static volatile int exit_req;
@@ -63,6 +64,11 @@ int main(int argc, char **argv)
case 'T':
skel->rodata->stall_kernel_nth = strtoul(optarg, NULL, 0);
break;
+ case 'd':
+ skel->rodata->disallow_tgid = strtol(optarg, NULL, 0);
+ if (skel->rodata->disallow_tgid < 0)
+ skel->rodata->disallow_tgid = getpid();
+ break;
default:
fprintf(stderr, help_fmt, basename(argv[0]));
return opt != 'h';
--
2.39.1


2023-01-28 00:19:22

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 20/30] sched_ext: Make watchdog handle ops.dispatch() looping stall

The dispatch path retries if the local DSQ is still empty after
ops.dispatch() either dispatched or consumed a task. This is both out of
necessity and for convenience. It has to retry because the dispatch path
might lose the tasks to dequeue while the rq lock is released while trying
to migrate tasks across CPUs, and the retry mechanism makes ops.dispatch()
implementation easier as it only needs to make some forward progress each
iteration.

However, this makes it possible for ops.dispatch() to stall CPUs by
repeatedly dispatching ineligible tasks. If all CPUs are stalled that way,
the watchdog or sysrq handler can't run and the system can't be saved. Let's
address the issue by breaking out of the dispatch loop after 32 iterations.

It is unlikely but not impossible for ops.dispatch() to legitimately go over
the iteration limit. We want to come back to the dispatch path in such cases
as not doing so risks stalling the CPU by idling with runnable tasks
pending. As the previous task is still current in balance_scx(),
resched_curr() doesn't do anything - it will just get cleared. Let's instead
use scx_kick_bpf() which will trigger reschedule after switching to the next
task which will likely be the idle task.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
---
kernel/sched/ext.c | 17 +++++++++++++++++
tools/sched_ext/scx_example_qmap.bpf.c | 17 +++++++++++++++++
tools/sched_ext/scx_example_qmap.c | 8 ++++++--
3 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 098edfc56a9b..9bc625676bbc 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -9,6 +9,7 @@
enum scx_internal_consts {
SCX_NR_ONLINE_OPS = SCX_OP_IDX(init),
SCX_DSP_DFL_MAX_BATCH = 32,
+ SCX_DSP_MAX_LOOPS = 32,
SCX_WATCHDOG_MAX_TIMEOUT = 30 * HZ,
};

@@ -168,6 +169,7 @@ static DEFINE_PER_CPU(struct scx_dsp_ctx, scx_dsp_ctx);

void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice,
u64 enq_flags);
+void scx_bpf_kick_cpu(s32 cpu, u64 flags);

struct scx_task_iter {
struct sched_ext_entity cursor;
@@ -1243,6 +1245,7 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
struct scx_rq *scx_rq = &rq->scx;
struct scx_dsp_ctx *dspc = this_cpu_ptr(&scx_dsp_ctx);
bool prev_on_scx = prev->sched_class == &ext_sched_class;
+ int nr_loops = SCX_DSP_MAX_LOOPS;

lockdep_assert_rq_held(rq);

@@ -1297,6 +1300,20 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
return 1;
if (consume_dispatch_q(rq, rf, &scx_dsq_global))
return 1;
+
+ /*
+ * ops.dispatch() can trap us in this loop by repeatedly
+ * dispatching ineligible tasks. Break out once in a while to
+ * allow the watchdog to run. As IRQ can't be enabled in
+ * balance(), we want to complete this scheduling cycle and then
+ * start a new one. IOW, we want to call resched_curr() on the
+ * next, most likely idle, task, not the current one. Use
+ * scx_bpf_kick_cpu() for deferred kicking.
+ */
+ if (unlikely(!--nr_loops)) {
+ scx_bpf_kick_cpu(cpu_of(rq), 0);
+ break;
+ }
} while (dspc->nr_tasks);

return 0;
diff --git a/tools/sched_ext/scx_example_qmap.bpf.c b/tools/sched_ext/scx_example_qmap.bpf.c
index ec8c4ee7ef16..e968a9b341a4 100644
--- a/tools/sched_ext/scx_example_qmap.bpf.c
+++ b/tools/sched_ext/scx_example_qmap.bpf.c
@@ -25,6 +25,7 @@ const volatile u64 slice_ns = SCX_SLICE_DFL;
const volatile bool switch_all;
const volatile u32 stall_user_nth;
const volatile u32 stall_kernel_nth;
+const volatile u32 dsp_inf_loop_after;
const volatile s32 disallow_tgid;

u32 test_error_cnt;
@@ -184,6 +185,22 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
s32 pid;
int i;

+ if (dsp_inf_loop_after && nr_dispatched > dsp_inf_loop_after) {
+ struct task_struct *p;
+
+ /*
+ * PID 2 should be kthreadd which should mostly be idle and off
+ * the scheduler. Let's keep dispatching it to force the kernel
+ * to call this function over and over again.
+ */
+ p = bpf_task_from_pid(2);
+ if (p) {
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, slice_ns, 0);
+ bpf_task_release(p);
+ return;
+ }
+ }
+
if (!idx || !cnt) {
scx_bpf_error("failed to lookup idx[%p], cnt[%p]", idx, cnt);
return;
diff --git a/tools/sched_ext/scx_example_qmap.c b/tools/sched_ext/scx_example_qmap.c
index 30633122e6d5..820fe50bf43c 100644
--- a/tools/sched_ext/scx_example_qmap.c
+++ b/tools/sched_ext/scx_example_qmap.c
@@ -20,12 +20,13 @@ const char help_fmt[] =
"\n"
"See the top-level comment in .bpf.c for more details.\n"
"\n"
-"Usage: %s [-a] [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-d PID]\n"
+"Usage: %s [-a] [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-l COUNT] [-d PID]\n"
"\n"
" -s SLICE_US Override slice duration\n"
" -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n"
" -t COUNT Stall every COUNT'th user thread\n"
" -T COUNT Stall every COUNT'th kernel thread\n"
+" -l COUNT Trigger dispatch infinite looping after COUNT dispatches\n"
" -d PID Disallow a process from switching into SCHED_EXT (-1 for self)\n"
" -h Display this help and exit\n";

@@ -50,7 +51,7 @@ int main(int argc, char **argv)
skel = scx_example_qmap__open();
assert(skel);

- while ((opt = getopt(argc, argv, "ahs:e:t:T:d:")) != -1) {
+ while ((opt = getopt(argc, argv, "ahs:e:t:T:l:d:")) != -1) {
switch (opt) {
case 'a':
skel->rodata->switch_all = true;
@@ -67,6 +68,9 @@ int main(int argc, char **argv)
case 'T':
skel->rodata->stall_kernel_nth = strtoul(optarg, NULL, 0);
break;
+ case 'l':
+ skel->rodata->dsp_inf_loop_after = strtoul(optarg, NULL, 0);
+ break;
case 'd':
skel->rodata->disallow_tgid = strtol(optarg, NULL, 0);
if (skel->rodata->disallow_tgid < 0)
--
2.39.1


2023-01-28 00:19:27

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 22/30] sched_ext: Implement tickless support

Allow BPF schedulers to indicate tickless operation by setting p->scx.slice
to SCX_SLICE_INF. A CPU whose current task has infinte slice goes into
tickless operation.

scx_example_central is updated to use tickless operations for all tasks and
instead use a BPF timer to expire slices. This also uses the SCX_ENQ_PREEMPT
and task state tracking added by the previous patches.

Currently, there is no way to pin the timer on the central CPU, so it may
end up on one of the worker CPUs; however, outside of that, the worker CPUs
can go tickless both while running sched_ext tasks and idling.

With schbench running, scx_example_central shows:

root@test ~# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts
LOC: 142024 656 664 449 Local timer interrupts
LOC: 161663 663 665 449 Local timer interrupts

Without it:

root@test ~ [SIGINT]# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts
LOC: 188778 3142 3793 3993 Local timer interrupts
LOC: 198993 5314 6323 6438 Local timer interrupts

While scx_example_central itself is too barebone to be useful as a
production scheduler, a more featureful central scheduler can be built using
the same approach. Google's experience shows that such an approach can have
significant benefits for certain applications such as VM hosting.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/sched/ext.h | 1 +
kernel/sched/core.c | 9 +-
kernel/sched/ext.c | 43 +++++++-
kernel/sched/ext.h | 2 +
kernel/sched/sched.h | 6 ++
tools/sched_ext/scx_example_central.bpf.c | 118 ++++++++++++++++++++--
tools/sched_ext/scx_example_central.c | 3 +-
7 files changed, 170 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 338b41cd79fa..8f19ee7e5433 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -19,6 +19,7 @@ enum scx_consts {
SCX_EXIT_MSG_LEN = 1024,

SCX_SLICE_DFL = 20 * NSEC_PER_MSEC,
+ SCX_SLICE_INF = U64_MAX, /* infinite, implies nohz */
};

/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b68b822312b..d2419bd4fb5e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1200,13 +1200,16 @@ bool sched_can_stop_tick(struct rq *rq)
return true;

/*
- * If there are no DL,RR/FIFO tasks, there must only be CFS tasks left;
- * if there's more than one we need the tick for involuntary
- * preemption.
+ * If there are no DL,RR/FIFO tasks, there must only be CFS or SCX tasks
+ * left. For CFS, if there's more than one we need the tick for
+ * involuntary preemption. For SCX, ask.
*/
if (!scx_switched_all() && rq->nr_running > 1)
return false;

+ if (scx_enabled() && !scx_can_stop_tick(rq))
+ return false;
+
return true;
}
#endif /* CONFIG_NO_HZ_FULL */
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 4acaf39ea879..5e94f9604e23 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -447,7 +447,8 @@ static void update_curr_scx(struct rq *rq)
account_group_exec_runtime(curr, delta_exec);
cgroup_account_cputime(curr, delta_exec);

- curr->scx.slice -= min(curr->scx.slice, delta_exec);
+ if (curr->scx.slice != SCX_SLICE_INF)
+ curr->scx.slice -= min(curr->scx.slice, delta_exec);
}

static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,
@@ -1356,6 +1357,20 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
scx_ops.running(p);

watchdog_unwatch_task(p, true);
+
+ /*
+ * @p is getting newly scheduled or got kicked after someone updated its
+ * slice. Refresh whether tick can be stopped. See can_stop_tick_scx().
+ */
+ if ((p->scx.slice == SCX_SLICE_INF) !=
+ (bool)(rq->scx.flags & SCX_RQ_CAN_STOP_TICK)) {
+ if (p->scx.slice == SCX_SLICE_INF)
+ rq->scx.flags |= SCX_RQ_CAN_STOP_TICK;
+ else
+ rq->scx.flags &= ~SCX_RQ_CAN_STOP_TICK;
+
+ sched_update_tick_dependency(rq);
+ }
}

static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
@@ -1891,6 +1906,26 @@ int scx_check_setscheduler(struct task_struct *p, int policy)
return 0;
}

+#ifdef CONFIG_NO_HZ_FULL
+bool scx_can_stop_tick(struct rq *rq)
+{
+ struct task_struct *p = rq->curr;
+
+ if (scx_ops_disabling())
+ return false;
+
+ if (p->sched_class != &ext_sched_class)
+ return true;
+
+ /*
+ * @rq can dispatch from different DSQs, so we can't tell whether it
+ * needs the tick or not by looking at nr_running. Allow stopping ticks
+ * iff the BPF scheduler indicated so. See set_next_task_scx().
+ */
+ return rq->scx.flags & SCX_RQ_CAN_STOP_TICK;
+}
+#endif
+
/*
* Omitted operations:
*
@@ -2051,7 +2086,7 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
struct rhashtable_iter rht_iter;
struct scx_dispatch_q *dsq;
const char *reason;
- int i, type;
+ int i, cpu, type;

type = atomic_read(&scx_exit_type);
while (true) {
@@ -2148,6 +2183,10 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
scx_task_iter_exit(&sti);
spin_unlock_irq(&scx_tasks_lock);

+ /* kick all CPUs to restore ticks */
+ for_each_possible_cpu(cpu)
+ resched_cpu(cpu);
+
forward_progress_guaranteed:
/*
* Here, every runnable task is guaranteed to make forward progress and
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 0b04626e8ca2..9c9284f91e38 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -82,6 +82,7 @@ int scx_fork(struct task_struct *p);
void scx_post_fork(struct task_struct *p);
void scx_cancel_fork(struct task_struct *p);
int scx_check_setscheduler(struct task_struct *p, int policy);
+bool scx_can_stop_tick(struct rq *rq);
void init_sched_ext_class(void);

__printf(2, 3) void scx_ops_error_type(enum scx_exit_type type,
@@ -141,6 +142,7 @@ static inline void scx_post_fork(struct task_struct *p) {}
static inline void scx_cancel_fork(struct task_struct *p) {}
static inline int scx_check_setscheduler(struct task_struct *p,
int policy) { return 0; }
+static inline bool scx_can_stop_tick(struct rq *rq) { return true; }
static inline void init_sched_ext_class(void) {}
static inline void scx_notify_sched_tick(void) {}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3c16caecd3a5..2447af22783c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -685,11 +685,17 @@ struct cfs_rq {
};

#ifdef CONFIG_SCHED_CLASS_EXT
+/* scx_rq->flags, protected by the rq lock */
+enum scx_rq_flags {
+ SCX_RQ_CAN_STOP_TICK = 1 << 0,
+};
+
struct scx_rq {
struct scx_dispatch_q local_dsq;
struct list_head watchdog_list;
u64 ops_qseq;
u32 nr_running;
+ u32 flags;
cpumask_var_t cpus_to_kick;
cpumask_var_t cpus_to_preempt;
struct irq_work kick_cpus_irq_work;
diff --git a/tools/sched_ext/scx_example_central.bpf.c b/tools/sched_ext/scx_example_central.bpf.c
index 6b246bf308d0..376486674ddb 100644
--- a/tools/sched_ext/scx_example_central.bpf.c
+++ b/tools/sched_ext/scx_example_central.bpf.c
@@ -13,7 +13,26 @@
* through per-CPU BPF queues. The current design is chosen to maximally
* utilize and verify various scx mechanisms such as LOCAL_ON dispatching.
*
- * b. Preemption
+ * b. Tickless operation
+ *
+ * All tasks are dispatched with the infinite slice which allows stopping the
+ * ticks on CONFIG_NO_HZ_FULL kernels running with the proper nohz_full
+ * parameter. The tickless operation can be observed through
+ * /proc/interrupts.
+ *
+ * Periodic switching is enforced by a periodic timer checking all CPUs and
+ * preempting them as necessary. Unfortunately, BPF timer currently doesn't
+ * have a way to pin to a specific CPU, so the periodic timer isn't pinned to
+ * the central CPU.
+ *
+ * c. Preemption
+ *
+ * Kthreads are unconditionally queued to the head of a matching local dsq
+ * and dispatched with SCX_DSQ_PREEMPT. This ensures that a kthread is always
+ * prioritized over user threads, which is required for ensuring forward
+ * progress as e.g. the periodic timer may run on a ksoftirqd and if the
+ * ksoftirqd gets starved by a user thread, there may not be anything else to
+ * vacate that user thread.
*
* SCX_KICK_PREEMPT is used to trigger scheduling and CPUs to move to the
* next tasks.
@@ -38,7 +57,7 @@ const volatile s32 central_cpu;
const volatile u32 nr_cpu_ids;

u64 nr_total, nr_locals, nr_queued, nr_lost_pids;
-u64 nr_dispatches, nr_mismatches, nr_retries;
+u64 nr_timers, nr_dispatches, nr_mismatches, nr_retries;
u64 nr_overflows;

struct user_exit_info uei;
@@ -51,6 +70,7 @@ struct {

/* can't use percpu map due to bad lookups */
static bool cpu_gimme_task[MAX_CPUS];
+static u64 cpu_started_at[MAX_CPUS];

struct central_timer {
struct bpf_timer timer;
@@ -86,9 +106,22 @@ void BPF_STRUCT_OPS(central_enqueue, struct task_struct *p, u64 enq_flags)

__sync_fetch_and_add(&nr_total, 1);

+ /*
+ * Push per-cpu kthreads at the head of local dsq's and preempt the
+ * corresponding CPU. This ensures that e.g. ksoftirqd isn't blocked
+ * behind other threads which is necessary for forward progress
+ * guarantee as we depend on the BPF timer which may run from ksoftirqd.
+ */
+ if ((p->flags & PF_KTHREAD) && p->nr_cpus_allowed == 1) {
+ __sync_fetch_and_add(&nr_locals, 1);
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_INF,
+ enq_flags | SCX_ENQ_PREEMPT);
+ return;
+ }
+
if (bpf_map_push_elem(&central_q, &pid, 0)) {
__sync_fetch_and_add(&nr_overflows, 1);
- scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_DFL, enq_flags);
+ scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_INF, enq_flags);
return;
}

@@ -122,13 +155,13 @@ static int dispatch_a_task_loopfn(u32 idx, void *data)
*/
if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) {
__sync_fetch_and_add(&nr_mismatches, 1);
- scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_DFL, 0);
+ scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_INF, 0);
bpf_task_release(p);
return 0;
}

/* dispatch to the local and mark that @cpu doesn't need more tasks */
- scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, SCX_SLICE_DFL, 0);
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, SCX_SLICE_INF, 0);

if (cpu != central_cpu)
scx_bpf_kick_cpu(cpu, 0);
@@ -200,12 +233,83 @@ void BPF_STRUCT_OPS(central_dispatch, s32 cpu, struct task_struct *prev)
}
}

+void BPF_STRUCT_OPS(central_running, struct task_struct *p)
+{
+ s32 cpu = scx_bpf_task_cpu(p);
+ u64 *started_at = MEMBER_VPTR(cpu_started_at, [cpu]);
+ if (started_at)
+ *started_at = bpf_ktime_get_ns() ?: 1; /* 0 indicates idle */
+}
+
+void BPF_STRUCT_OPS(central_stopping, struct task_struct *p, bool runnable)
+{
+ s32 cpu = scx_bpf_task_cpu(p);
+ u64 *started_at = MEMBER_VPTR(cpu_started_at, [cpu]);
+ if (started_at)
+ *started_at = 0;
+}
+
+static int kick_cpus_loopfn(u32 idx, void *data)
+{
+ s32 cpu = (nr_timers + idx) % nr_cpu_ids;
+ u64 *nr_to_kick = data;
+ u64 now = bpf_ktime_get_ns();
+ u64 *started_at;
+ s32 pid;
+
+ if (cpu == central_cpu)
+ goto kick;
+
+ /* kick iff there's something pending */
+ if (scx_bpf_dsq_nr_queued(FALLBACK_DSQ_ID) ||
+ scx_bpf_dsq_nr_queued(SCX_DSQ_LOCAL_ON | cpu))
+ ;
+ else if (*nr_to_kick)
+ (*nr_to_kick)--;
+ else
+ return 0;
+
+ /* and the current one exhausted its slice */
+ started_at = MEMBER_VPTR(cpu_started_at, [cpu]);
+ if (started_at && *started_at &&
+ vtime_before(now, *started_at + SCX_SLICE_DFL))
+ return 0;
+kick:
+ scx_bpf_kick_cpu(cpu, SCX_KICK_PREEMPT);
+ return 0;
+}
+
+static int central_timerfn(void *map, int *key, struct bpf_timer *timer)
+{
+ u64 nr_to_kick = nr_queued;
+
+ bpf_loop(nr_cpu_ids, kick_cpus_loopfn, &nr_to_kick, 0);
+ bpf_timer_start(timer, TIMER_INTERVAL_NS, 0);
+ __sync_fetch_and_add(&nr_timers, 1);
+ return 0;
+}
+
int BPF_STRUCT_OPS_SLEEPABLE(central_init)
{
+ u32 key = 0;
+ struct bpf_timer *timer;
+ int ret;
+
if (switch_all)
scx_bpf_switch_all();

- return scx_bpf_create_dsq(FALLBACK_DSQ_ID, -1);
+ ret = scx_bpf_create_dsq(FALLBACK_DSQ_ID, -1);
+ if (ret)
+ return ret;
+
+ timer = bpf_map_lookup_elem(&central_timer, &key);
+ if (!timer)
+ return -ESRCH;
+
+ bpf_timer_init(timer, &central_timer, CLOCK_MONOTONIC);
+ bpf_timer_set_callback(timer, central_timerfn);
+ ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, 0);
+ return ret;
}

void BPF_STRUCT_OPS(central_exit, struct scx_exit_info *ei)
@@ -225,6 +329,8 @@ struct sched_ext_ops central_ops = {
.select_cpu = (void *)central_select_cpu,
.enqueue = (void *)central_enqueue,
.dispatch = (void *)central_dispatch,
+ .running = (void *)central_running,
+ .stopping = (void *)central_stopping,
.init = (void *)central_init,
.exit = (void *)central_exit,
.name = "central",
diff --git a/tools/sched_ext/scx_example_central.c b/tools/sched_ext/scx_example_central.c
index 14f6598c03df..fc01e5149371 100644
--- a/tools/sched_ext/scx_example_central.c
+++ b/tools/sched_ext/scx_example_central.c
@@ -76,7 +76,8 @@ int main(int argc, char **argv)
skel->bss->nr_locals,
skel->bss->nr_queued,
skel->bss->nr_lost_pids);
- printf(" dispatch:%10lu mismatch:%10lu retry:%10lu\n",
+ printf("timer :%10lu dispatch:%10lu mismatch:%10lu retry:%10lu\n",
+ skel->bss->nr_timers,
skel->bss->nr_dispatches,
skel->bss->nr_mismatches,
skel->bss->nr_retries);
--
2.39.1


2023-01-28 00:19:31

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 19/30] sched_ext: Implement scx_bpf_kick_cpu() and task preemption support

It's often useful to wake up and/or trigger reschedule on other CPUs. This
patch adds scx_bpf_kick_cpu() kfunc helper that BPF scheduler can call to
kick the target CPU into the scheduling path.

As a sched_ext task relinquishes its CPU only after its slice is depleted,
this patch also adds SCX_KICK_PREEMPT and SCX_ENQ_PREEMPT which clears the
slice of the target CPU's current task to guarantee that sched_ext's
scheduling path runs on the CPU.

This patch also adds a new example scheduler, scx_example_central, which
demonstrates central scheduling where one CPU is responsible for making all
scheduling decisions in the system. The central CPU makes scheduling
decisions for all CPUs in the system, queues tasks on the appropriate local
dsq's and preempts the worker CPUs. The worker CPUs in turn preempt the
central CPU when it needs tasks to run.

Currently, every CPU depends on its own tick to expire the current task. A
follow-up patch implementing tickless support for sched_ext will allow the
worker CPUs to go full tickless so that they can run completely undisturbed.

v2: * Julia Lawall reported that scx_example_central can overflow the
dispatch buffer and malfunction. As scheduling for other CPUs can't be
handled by the automatic retry mechanism, fix by implementing an
explicit overflow and retry handling.

* Updated to use generic BPF cpumask helpers.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
Cc: Julia Lawall <[email protected]>
---
include/linux/sched/ext.h | 4 +
kernel/sched/ext.c | 82 +++++++-
kernel/sched/ext.h | 12 ++
kernel/sched/sched.h | 3 +
tools/sched_ext/.gitignore | 1 +
tools/sched_ext/Makefile | 8 +-
tools/sched_ext/scx_common.bpf.h | 1 +
tools/sched_ext/scx_example_central.bpf.c | 231 ++++++++++++++++++++++
tools/sched_ext/scx_example_central.c | 93 +++++++++
9 files changed, 430 insertions(+), 5 deletions(-)
create mode 100644 tools/sched_ext/scx_example_central.bpf.c
create mode 100644 tools/sched_ext/scx_example_central.c

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index b4c4b83a07f6..10cd3ede5ae5 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -396,6 +396,10 @@ struct sched_ext_entity {
* scx_bpf_dispatch() but can also be modified directly by the BPF
* scheduler. Automatically decreased by SCX as the task executes. On
* depletion, a scheduling event is triggered.
+ *
+ * This value is cleared to zero if the task is preempted by
+ * %SCX_KICK_PREEMPT and shouldn't be used to determine how long the
+ * task ran. Use p->se.sum_exec_runtime instead.
*/
u64 slice;

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 63f0a3cf2d53..098edfc56a9b 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -466,7 +466,7 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,
}
}

- if (enq_flags & SCX_ENQ_HEAD)
+ if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
list_add(&p->scx.dsq_node, &dsq->fifo);
else
list_add_tail(&p->scx.dsq_node, &dsq->fifo);
@@ -482,8 +482,16 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,

if (is_local) {
struct rq *rq = container_of(dsq, struct rq, scx.local_dsq);
+ bool preempt = false;

- if (sched_class_above(&ext_sched_class, rq->curr->sched_class))
+ if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->curr &&
+ rq->curr->sched_class == &ext_sched_class) {
+ rq->curr->scx.slice = 0;
+ preempt = true;
+ }
+
+ if (preempt || sched_class_above(&ext_sched_class,
+ rq->curr->sched_class))
resched_curr(rq);
} else {
raw_spin_unlock(&dsq->lock);
@@ -1839,7 +1847,9 @@ int scx_check_setscheduler(struct task_struct *p, int policy)
* Omitted operations:
*
* - check_preempt_curr: NOOP as it isn't useful in the wakeup path because the
- * task isn't tied to the CPU at that point.
+ * task isn't tied to the CPU at that point. Preemption is implemented by
+ * resetting the victim task's slice to 0 and triggering reschedule on the
+ * target CPU.
*
* - migrate_task_rq: Unncessary as task to cpu mapping is transient.
*
@@ -2672,6 +2682,32 @@ static const struct sysrq_key_op sysrq_sched_ext_reset_op = {
.enable_mask = SYSRQ_ENABLE_RTNICE,
};

+static void kick_cpus_irq_workfn(struct irq_work *irq_work)
+{
+ struct rq *this_rq = this_rq();
+ int this_cpu = cpu_of(this_rq);
+ int cpu;
+
+ for_each_cpu(cpu, this_rq->scx.cpus_to_kick) {
+ struct rq *rq = cpu_rq(cpu);
+ unsigned long flags;
+
+ raw_spin_rq_lock_irqsave(rq, flags);
+
+ if (cpu_online(cpu) || cpu == this_cpu) {
+ if (cpumask_test_cpu(cpu, this_rq->scx.cpus_to_preempt) &&
+ rq->curr->sched_class == &ext_sched_class)
+ rq->curr->scx.slice = 0;
+ resched_curr(rq);
+ }
+
+ raw_spin_rq_unlock_irqrestore(rq, flags);
+ }
+
+ cpumask_clear(this_rq->scx.cpus_to_kick);
+ cpumask_clear(this_rq->scx.cpus_to_preempt);
+}
+
void __init init_sched_ext_class(void)
{
int cpu;
@@ -2695,6 +2731,10 @@ void __init init_sched_ext_class(void)

init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL);
INIT_LIST_HEAD(&rq->scx.watchdog_list);
+
+ BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick, GFP_KERNEL));
+ BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_preempt, GFP_KERNEL));
+ init_irq_work(&rq->scx.kick_cpus_irq_work, kick_cpus_irq_workfn);
}

register_sysrq_key('S', &sysrq_sched_ext_reset_op);
@@ -2917,6 +2957,41 @@ static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = {
.set = &scx_kfunc_ids_dispatch,
};

+/**
+ * scx_bpf_kick_cpu - Trigger reschedule on a CPU
+ * @cpu: cpu to kick
+ * @flags: SCX_KICK_* flags
+ *
+ * Kick @cpu into rescheduling. This can be used to wake up an idle CPU or
+ * trigger rescheduling on a busy CPU. This can be called from any online
+ * scx_ops operation and the actual kicking is performed asynchronously through
+ * an irq work.
+ */
+void scx_bpf_kick_cpu(s32 cpu, u64 flags)
+{
+ struct rq *rq;
+
+ if (!ops_cpu_valid(cpu)) {
+ scx_ops_error("invalid cpu %d", cpu);
+ return;
+ }
+
+ preempt_disable();
+ rq = this_rq();
+
+ /*
+ * Actual kicking is bounced to kick_cpus_irq_workfn() to avoid nesting
+ * rq locks. We can probably be smarter and avoid bouncing if called
+ * from ops which don't hold a rq lock.
+ */
+ cpumask_set_cpu(cpu, rq->scx.cpus_to_kick);
+ if (flags & SCX_KICK_PREEMPT)
+ cpumask_set_cpu(cpu, rq->scx.cpus_to_preempt);
+
+ irq_work_queue(&rq->scx.kick_cpus_irq_work);
+ preempt_enable();
+}
+
/**
* scx_bpf_dsq_nr_queued - Return the number of queued tasks
* @dsq_id: id of the DSQ
@@ -3138,6 +3213,7 @@ s32 scx_bpf_task_cpu(const struct task_struct *p)
}

BTF_SET8_START(scx_kfunc_ids_any)
+BTF_ID_FLAGS(func, scx_bpf_kick_cpu)
BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued)
BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_TRUSTED_ARGS)
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index a4fe649e649d..0b04626e8ca2 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -19,6 +19,14 @@ enum scx_enq_flags {

/* high 32bits are SCX specific */

+ /*
+ * Set the following to trigger preemption when calling
+ * scx_bpf_dispatch() with a local dsq as the target. The slice of the
+ * current task is cleared to zero and the CPU is kicked into the
+ * scheduling path. Implies %SCX_ENQ_HEAD.
+ */
+ SCX_ENQ_PREEMPT = 1LLU << 32,
+
/*
* The task being enqueued is the only task available for the cpu. By
* default, ext core keeps executing such tasks but when
@@ -51,6 +59,10 @@ enum scx_deq_flags {
SCX_DEQ_SLEEP = DEQUEUE_SLEEP,
};

+enum scx_kick_flags {
+ SCX_KICK_PREEMPT = 1LLU << 0, /* force scheduling on the CPU */
+};
+
#ifdef CONFIG_SCHED_CLASS_EXT

extern const struct sched_class ext_sched_class;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 112c2f127c95..3c16caecd3a5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -690,6 +690,9 @@ struct scx_rq {
struct list_head watchdog_list;
u64 ops_qseq;
u32 nr_running;
+ cpumask_var_t cpus_to_kick;
+ cpumask_var_t cpus_to_preempt;
+ struct irq_work kick_cpus_irq_work;
};
#endif /* CONFIG_SCHED_CLASS_EXT */

diff --git a/tools/sched_ext/.gitignore b/tools/sched_ext/.gitignore
index 6734f7fd9324..389f0e5b0970 100644
--- a/tools/sched_ext/.gitignore
+++ b/tools/sched_ext/.gitignore
@@ -1,5 +1,6 @@
scx_example_dummy
scx_example_qmap
+scx_example_central
*.skel.h
*.subskel.h
/tools/
diff --git a/tools/sched_ext/Makefile b/tools/sched_ext/Makefile
index 926b0a36c221..c6c3669c47b9 100644
--- a/tools/sched_ext/Makefile
+++ b/tools/sched_ext/Makefile
@@ -115,7 +115,7 @@ BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH) \
-Wno-compare-distinct-pointer-types \
-O2 -mcpu=v3

-all: scx_example_dummy scx_example_qmap
+all: scx_example_dummy scx_example_qmap scx_example_central

# sort removes libbpf duplicates when not cross-building
MAKE_DIRS := $(sort $(BUILD_DIR)/libbpf $(HOST_BUILD_DIR)/libbpf \
@@ -174,10 +174,14 @@ scx_example_qmap: scx_example_qmap.c scx_example_qmap.skel.h user_exit_info.h
$(CC) $(CFLAGS) -c $< -o [email protected]
$(CC) -o $@ [email protected] $(HOST_BPFOBJ) $(LDFLAGS)

+scx_example_central: scx_example_central.c scx_example_central.skel.h user_exit_info.h
+ $(CC) $(CFLAGS) -c $< -o [email protected]
+ $(CC) -o $@ [email protected] $(HOST_BPFOBJ) $(LDFLAGS)
+
clean:
rm -rf $(SCRATCH_DIR) $(HOST_SCRATCH_DIR)
rm -f *.o *.bpf.o *.skel.h *.subskel.h
- rm -f scx_example_dummy scx_example_qmap
+ rm -f scx_example_dummy scx_example_qmap scx_example_central

.PHONY: all clean

diff --git a/tools/sched_ext/scx_common.bpf.h b/tools/sched_ext/scx_common.bpf.h
index fec19e8d0681..ff32e4dd30a6 100644
--- a/tools/sched_ext/scx_common.bpf.h
+++ b/tools/sched_ext/scx_common.bpf.h
@@ -71,6 +71,7 @@ s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) __ksym;
bool scx_bpf_consume(u64 dsq_id) __ksym;
u32 scx_bpf_dispatch_nr_slots(void) __ksym;
void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice, u64 enq_flags) __ksym;
+void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym;
s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym;
bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) __ksym;
s32 scx_bpf_pick_idle_cpu(const cpumask_t *cpus_allowed) __ksym;
diff --git a/tools/sched_ext/scx_example_central.bpf.c b/tools/sched_ext/scx_example_central.bpf.c
new file mode 100644
index 000000000000..6b246bf308d0
--- /dev/null
+++ b/tools/sched_ext/scx_example_central.bpf.c
@@ -0,0 +1,231 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A central FIFO sched_ext scheduler which demonstrates the followings:
+ *
+ * a. Making all scheduling decisions from one CPU:
+ *
+ * The central CPU is the only one making scheduling decisions. All other
+ * CPUs kick the central CPU when they run out of tasks to run.
+ *
+ * There is one global BPF queue and the central CPU schedules all CPUs by
+ * dispatching from the global queue to each CPU's local dsq from dispatch().
+ * This isn't the most straightforward. e.g. It'd be easier to bounce
+ * through per-CPU BPF queues. The current design is chosen to maximally
+ * utilize and verify various scx mechanisms such as LOCAL_ON dispatching.
+ *
+ * b. Preemption
+ *
+ * SCX_KICK_PREEMPT is used to trigger scheduling and CPUs to move to the
+ * next tasks.
+ *
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#include "scx_common.bpf.h"
+
+char _license[] SEC("license") = "GPL";
+
+enum {
+ FALLBACK_DSQ_ID = 0,
+ MAX_CPUS = 4096,
+ MS_TO_NS = 1000LLU * 1000,
+ TIMER_INTERVAL_NS = 1 * MS_TO_NS,
+};
+
+const volatile bool switch_all;
+const volatile s32 central_cpu;
+const volatile u32 nr_cpu_ids;
+
+u64 nr_total, nr_locals, nr_queued, nr_lost_pids;
+u64 nr_dispatches, nr_mismatches, nr_retries;
+u64 nr_overflows;
+
+struct user_exit_info uei;
+
+struct {
+ __uint(type, BPF_MAP_TYPE_QUEUE);
+ __uint(max_entries, 4096);
+ __type(value, s32);
+} central_q SEC(".maps");
+
+/* can't use percpu map due to bad lookups */
+static bool cpu_gimme_task[MAX_CPUS];
+
+struct central_timer {
+ struct bpf_timer timer;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 1);
+ __type(key, u32);
+ __type(value, struct central_timer);
+} central_timer SEC(".maps");
+
+static bool vtime_before(u64 a, u64 b)
+{
+ return (s64)(a - b) < 0;
+}
+
+s32 BPF_STRUCT_OPS(central_select_cpu, struct task_struct *p,
+ s32 prev_cpu, u64 wake_flags)
+{
+ /*
+ * Steer wakeups to the central CPU as much as possible to avoid
+ * disturbing other CPUs. It's safe to blindly return the central cpu as
+ * select_cpu() is a hint and if @p can't be on it, the kernel will
+ * automatically pick a fallback CPU.
+ */
+ return central_cpu;
+}
+
+void BPF_STRUCT_OPS(central_enqueue, struct task_struct *p, u64 enq_flags)
+{
+ s32 pid = p->pid;
+
+ __sync_fetch_and_add(&nr_total, 1);
+
+ if (bpf_map_push_elem(&central_q, &pid, 0)) {
+ __sync_fetch_and_add(&nr_overflows, 1);
+ scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_DFL, enq_flags);
+ return;
+ }
+
+ __sync_fetch_and_add(&nr_queued, 1);
+
+ if (!scx_bpf_task_running(p))
+ scx_bpf_kick_cpu(central_cpu, SCX_KICK_PREEMPT);
+}
+
+static int dispatch_a_task_loopfn(u32 idx, void *data)
+{
+ s32 cpu = *(s32 *)data;
+ s32 pid;
+ struct task_struct *p;
+ bool *gimme;
+
+ if (bpf_map_pop_elem(&central_q, &pid))
+ return 1;
+
+ __sync_fetch_and_sub(&nr_queued, 1);
+
+ p = bpf_task_from_pid(pid);
+ if (!p) {
+ __sync_fetch_and_add(&nr_lost_pids, 1);
+ return 0;
+ }
+
+ /*
+ * If we can't run the task at the top, do the dumb thing and bounce it
+ * to the fallback dsq.
+ */
+ if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) {
+ __sync_fetch_and_add(&nr_mismatches, 1);
+ scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_DFL, 0);
+ bpf_task_release(p);
+ return 0;
+ }
+
+ /* dispatch to the local and mark that @cpu doesn't need more tasks */
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, SCX_SLICE_DFL, 0);
+
+ if (cpu != central_cpu)
+ scx_bpf_kick_cpu(cpu, 0);
+
+ gimme = MEMBER_VPTR(cpu_gimme_task, [cpu]);
+ if (gimme)
+ *gimme = false;
+
+ bpf_task_release(p);
+ return 1;
+}
+
+static int dispatch_to_one_cpu_loopfn(u32 idx, void *data)
+{
+ s32 cpu = idx;
+
+ if (!scx_bpf_dispatch_nr_slots())
+ return 1;
+
+ if (cpu >= 0 && cpu < MAX_CPUS) {
+ bool *gimme = MEMBER_VPTR(cpu_gimme_task, [cpu]);
+ if (gimme && !*gimme)
+ return 0;
+ }
+
+ bpf_loop(1 << 23, dispatch_a_task_loopfn, &cpu, 0);
+ return 0;
+}
+
+void BPF_STRUCT_OPS(central_dispatch, s32 cpu, struct task_struct *prev)
+{
+ if (cpu == central_cpu) {
+ /* dispatch for all other CPUs first */
+ __sync_fetch_and_add(&nr_dispatches, 1);
+ bpf_loop(nr_cpu_ids, dispatch_to_one_cpu_loopfn, NULL, 0);
+
+ /*
+ * Retry if we ran out of dispatch buffer slots as we might have
+ * skipped some CPUs and also need to dispatch for self. The ext
+ * core automatically retries if the local dsq is empty but we
+ * can't rely on that as we're dispatching for other CPUs too.
+ * Kick self explicitly to retry.
+ */
+ if (!scx_bpf_dispatch_nr_slots()) {
+ __sync_fetch_and_add(&nr_retries, 1);
+ scx_bpf_kick_cpu(central_cpu, SCX_KICK_PREEMPT);
+ return;
+ }
+
+ /* look for a task to run on the central CPU */
+ if (scx_bpf_consume(FALLBACK_DSQ_ID))
+ return;
+ bpf_loop(1 << 23, dispatch_a_task_loopfn, &cpu, 0);
+ } else {
+ bool *gimme;
+
+ if (scx_bpf_consume(FALLBACK_DSQ_ID))
+ return;
+
+ gimme = MEMBER_VPTR(cpu_gimme_task, [cpu]);
+ if (gimme)
+ *gimme = true;
+
+ /*
+ * Force dispatch on the scheduling CPU so that it finds a task
+ * to run for us.
+ */
+ scx_bpf_kick_cpu(central_cpu, SCX_KICK_PREEMPT);
+ }
+}
+
+int BPF_STRUCT_OPS_SLEEPABLE(central_init)
+{
+ if (switch_all)
+ scx_bpf_switch_all();
+
+ return scx_bpf_create_dsq(FALLBACK_DSQ_ID, -1);
+}
+
+void BPF_STRUCT_OPS(central_exit, struct scx_exit_info *ei)
+{
+ uei_record(&uei, ei);
+}
+
+SEC(".struct_ops")
+struct sched_ext_ops central_ops = {
+ /*
+ * We are offloading all scheduling decisions to the central CPU and
+ * thus being the last task on a given CPU doesn't mean anything
+ * special. Enqueue the last tasks like any other tasks.
+ */
+ .flags = SCX_OPS_ENQ_LAST,
+
+ .select_cpu = (void *)central_select_cpu,
+ .enqueue = (void *)central_enqueue,
+ .dispatch = (void *)central_dispatch,
+ .init = (void *)central_init,
+ .exit = (void *)central_exit,
+ .name = "central",
+};
diff --git a/tools/sched_ext/scx_example_central.c b/tools/sched_ext/scx_example_central.c
new file mode 100644
index 000000000000..14f6598c03df
--- /dev/null
+++ b/tools/sched_ext/scx_example_central.c
@@ -0,0 +1,93 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <unistd.h>
+#include <signal.h>
+#include <assert.h>
+#include <libgen.h>
+#include <bpf/bpf.h>
+#include "user_exit_info.h"
+#include "scx_example_central.skel.h"
+
+const char help_fmt[] =
+"A central FIFO sched_ext scheduler.\n"
+"\n"
+"See the top-level comment in .bpf.c for more details.\n"
+"\n"
+"Usage: %s [-a] [-c CPU]\n"
+"\n"
+" -a Switch all tasks\n"
+" -c CPU Override the central CPU (default: 0)\n"
+" -h Display this help and exit\n";
+
+static volatile int exit_req;
+
+static void sigint_handler(int dummy)
+{
+ exit_req = 1;
+}
+
+int main(int argc, char **argv)
+{
+ struct scx_example_central *skel;
+ struct bpf_link *link;
+ u64 seq = 0;
+ s32 opt;
+
+ signal(SIGINT, sigint_handler);
+ signal(SIGTERM, sigint_handler);
+
+ libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
+
+ skel = scx_example_central__open();
+ assert(skel);
+
+ skel->rodata->central_cpu = 0;
+ skel->rodata->nr_cpu_ids = libbpf_num_possible_cpus();
+
+ while ((opt = getopt(argc, argv, "ahc:")) != -1) {
+ switch (opt) {
+ case 'a':
+ skel->rodata->switch_all = true;
+ break;
+ case 'c':
+ skel->rodata->central_cpu = strtoul(optarg, NULL, 0);
+ break;
+ default:
+ fprintf(stderr, help_fmt, basename(argv[0]));
+ return opt != 'h';
+ }
+ }
+
+ assert(!scx_example_central__load(skel));
+
+ link = bpf_map__attach_struct_ops(skel->maps.central_ops);
+ assert(link);
+
+ while (!exit_req && !uei_exited(&skel->bss->uei)) {
+ printf("[SEQ %lu]\n", seq++);
+ printf("total :%10lu local:%10lu queued:%10lu lost:%10lu\n",
+ skel->bss->nr_total,
+ skel->bss->nr_locals,
+ skel->bss->nr_queued,
+ skel->bss->nr_lost_pids);
+ printf(" dispatch:%10lu mismatch:%10lu retry:%10lu\n",
+ skel->bss->nr_dispatches,
+ skel->bss->nr_mismatches,
+ skel->bss->nr_retries);
+ printf("overflow:%10lu\n",
+ skel->bss->nr_overflows);
+ fflush(stdout);
+ sleep(1);
+ }
+
+ bpf_link__destroy(link);
+ uei_print(&skel->bss->uei);
+ scx_example_central__destroy(skel);
+ return 0;
+}
--
2.39.1


2023-01-28 00:19:38

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 21/30] sched_ext: Add task state tracking operations

Being able to track the task runnable and running state transitions are
useful for a variety of purposes including latency tracking and load factor
calculation.

Currently, BPF schedulers don't have a good way of tracking these
transitions. Becoming runnable can be determined from ops.enqueue() but
becoming quiescent can only be inferred from the lack of subsequent enqueue.
Also, as the local dsq can have multiple tasks and some events are handled
in the sched_ext core, it's difficult to determine when a given task starts
and stops executing.

This patch adds sched_ext_ops.runnable(), .running(), .stopping() and
.quiescent() operations to track the task runnable and running state
transitions. They're mostly self explanatory; however, we want to ensure
that running <-> stopping transitions are always contained within runnable
<-> quiescent transitions which is a bit different from how the scheduler
core behaves. This adds a bit of complication. See the comment in
dequeue_task_scx().

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/sched/ext.h | 65 +++++++++++++++++++++++++++++++++++++++
kernel/sched/ext.c | 31 +++++++++++++++++++
2 files changed, 96 insertions(+)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 10cd3ede5ae5..338b41cd79fa 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -193,6 +193,71 @@ struct sched_ext_ops {
*/
void (*dispatch)(s32 cpu, struct task_struct *prev);

+ /**
+ * runnable - A task is becoming runnable on its associated CPU
+ * @p: task becoming runnable
+ * @enq_flags: %SCX_ENQ_*
+ *
+ * This and the following three functions can be used to track a task's
+ * execution state transitions. A task becomes ->runnable() on a CPU,
+ * and then goes through one or more ->running() and ->stopping() pairs
+ * as it runs on the CPU, and eventually becomes ->quiescent() when it's
+ * done running on the CPU.
+ *
+ * @p is becoming runnable on the CPU because it's
+ *
+ * - waking up (%SCX_ENQ_WAKEUP)
+ * - being moved from another CPU
+ * - being restored after temporarily taken off the queue for an
+ * attribute change.
+ *
+ * This and ->enqueue() are related but not coupled. This operation
+ * notifies @p's state transition and may not be followed by ->enqueue()
+ * e.g. when @p is being dispatched to a remote CPU. Likewise, a task
+ * may be ->enqueue()'d without being preceded by this operation e.g.
+ * after exhausting its slice.
+ */
+ void (*runnable)(struct task_struct *p, u64 enq_flags);
+
+ /**
+ * running - A task is starting to run on its associated CPU
+ * @p: task starting to run
+ *
+ * See ->runnable() for explanation on the task state notifiers.
+ */
+ void (*running)(struct task_struct *p);
+
+ /**
+ * stopping - A task is stopping execution
+ * @p: task stopping to run
+ * @runnable: is task @p still runnable?
+ *
+ * See ->runnable() for explanation on the task state notifiers. If
+ * !@runnable, ->quiescent() will be invoked after this operation
+ * returns.
+ */
+ void (*stopping)(struct task_struct *p, bool runnable);
+
+ /**
+ * quiescent - A task is becoming not runnable on its associated CPU
+ * @p: task becoming not runnable
+ * @deq_flags: %SCX_DEQ_*
+ *
+ * See ->runnable() for explanation on the task state notifiers.
+ *
+ * @p is becoming quiescent on the CPU because it's
+ *
+ * - sleeping (%SCX_DEQ_SLEEP)
+ * - being moved to another CPU
+ * - being temporarily taken off the queue for an attribute change
+ * (%SCX_DEQ_SAVE)
+ *
+ * This and ->dequeue() are related but not coupled. This operation
+ * notifies @p's state transition and may not be preceded by ->dequeue()
+ * e.g. when @p is being dispatched to a remote CPU.
+ */
+ void (*quiescent)(struct task_struct *p, u64 deq_flags);
+
/**
* yield - Yield CPU
* @from: yielding task
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 9bc625676bbc..4acaf39ea879 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -743,6 +743,9 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
rq->scx.nr_running++;
add_nr_running(rq, 1);

+ if (SCX_HAS_OP(runnable))
+ scx_ops.runnable(p, enq_flags);
+
do_enqueue_task(rq, p, enq_flags, sticky_cpu);
}

@@ -803,6 +806,26 @@ static void dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags

ops_dequeue(p, deq_flags);

+ /*
+ * A currently running task which is going off @rq first gets dequeued
+ * and then stops running. As we want running <-> stopping transitions
+ * to be contained within runnable <-> quiescent transitions, trigger
+ * ->stopping() early here instead of in put_prev_task_scx().
+ *
+ * @p may go through multiple stopping <-> running transitions between
+ * here and put_prev_task_scx() if task attribute changes occur while
+ * balance_scx() leaves @rq unlocked. However, they don't contain any
+ * information meaningful to the BPF scheduler and can be suppressed by
+ * skipping the callbacks if the task is !QUEUED.
+ */
+ if (SCX_HAS_OP(stopping) && task_current(rq, p)) {
+ update_curr_scx(rq);
+ scx_ops.stopping(p, false);
+ }
+
+ if (SCX_HAS_OP(quiescent))
+ scx_ops.quiescent(p, deq_flags);
+
if (deq_flags & SCX_DEQ_SLEEP)
p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP;
else
@@ -1328,6 +1351,10 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)

p->se.exec_start = rq_clock_task(rq);

+ /* see dequeue_task_scx() on why we skip when !QUEUED */
+ if (SCX_HAS_OP(running) && (p->scx.flags & SCX_TASK_QUEUED))
+ scx_ops.running(p);
+
watchdog_unwatch_task(p, true);
}

@@ -1366,6 +1393,10 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p)

update_curr_scx(rq);

+ /* see dequeue_task_scx() on why we skip when !QUEUED */
+ if (SCX_HAS_OP(stopping) && (p->scx.flags & SCX_TASK_QUEUED))
+ scx_ops.stopping(p, true);
+
/*
* If we're being called from put_prev_task_balance(), balance_scx() may
* have decided that @p should keep running.
--
2.39.1


2023-01-28 00:19:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 18/30] sched_ext: Allow BPF schedulers to switch all eligible tasks into sched_ext

Currently, to use sched_ext, each task has to be put into sched_ext using
sched_setscheduler(2). However, some BPF schedulers and use cases might
prefer to service all eligible tasks.

This patch adds a new kfunc helper, scx_bpf_switch_all(), that BPF
schedulers can call from ops.init() to switch all SCHED_NORMAL, SCHED_BATCH
and SCHED_IDLE tasks into sched_ext. This has the benefit that the scheduler
swaps are transparent to the users and applications. As we know that CFS is
not being used when scx_bpf_switch_all() is used, we can also disable hot
path entry points with static_branches.

Both the dummy and qmap example schedulers are updated with the '-a' option
which enables the switch_all behavior.

Signed-off-by: Tejun Heo <[email protected]>
Suggested-by: Barret Rhoden <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/core.c | 8 +++--
kernel/sched/ext.c | 45 +++++++++++++++++++++++++
kernel/sched/ext.h | 5 +++
tools/sched_ext/scx_common.bpf.h | 1 +
tools/sched_ext/scx_example_dummy.bpf.c | 11 ++++++
tools/sched_ext/scx_example_dummy.c | 8 +++--
tools/sched_ext/scx_example_qmap.bpf.c | 9 +++++
tools/sched_ext/scx_example_qmap.c | 7 ++--
8 files changed, 87 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9e566e72c3f2..5b68b822312b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1204,7 +1204,7 @@ bool sched_can_stop_tick(struct rq *rq)
* if there's more than one we need the tick for involuntary
* preemption.
*/
- if (rq->nr_running > 1)
+ if (!scx_switched_all() && rq->nr_running > 1)
return false;

return true;
@@ -5590,8 +5590,10 @@ void scheduler_tick(void)
perf_event_task_tick();

#ifdef CONFIG_SMP
- rq->idle_balance = idle_cpu(cpu);
- trigger_load_balance(rq);
+ if (!scx_switched_all()) {
+ rq->idle_balance = idle_cpu(cpu);
+ trigger_load_balance(rq);
+ }
#endif
}

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index b9d55c25cec9..63f0a3cf2d53 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -73,6 +73,10 @@ static DEFINE_MUTEX(scx_ops_enable_mutex);
DEFINE_STATIC_KEY_FALSE(__scx_ops_enabled);
DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem);
static atomic_t scx_ops_enable_state_var = ATOMIC_INIT(SCX_OPS_DISABLED);
+static bool scx_switch_all_req;
+static bool scx_switching_all;
+DEFINE_STATIC_KEY_FALSE(__scx_switched_all);
+
static struct sched_ext_ops scx_ops;
static bool scx_warned_zero_slice;

@@ -1966,6 +1970,8 @@ bool task_on_scx(struct task_struct *p)
{
if (!scx_enabled() || scx_ops_disabling())
return false;
+ if (READ_ONCE(scx_switching_all))
+ return true;
return p->policy == SCHED_EXT;
}

@@ -2092,6 +2098,9 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
*/
mutex_lock(&scx_ops_enable_mutex);

+ static_branch_disable(&__scx_switched_all);
+ WRITE_ONCE(scx_switching_all, false);
+
/* avoid racing against fork */
cpus_read_lock();
percpu_down_write(&scx_fork_rwsem);
@@ -2276,6 +2285,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
*/
cpus_read_lock();

+ scx_switch_all_req = false;
if (scx_ops.init) {
ret = SCX_CALL_OP_RET(SCX_KF_INIT | SCX_KF_SLEEPABLE, init);
if (ret) {
@@ -2391,6 +2401,8 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
* transitions here are synchronized against sched_ext_free() through
* scx_tasks_lock.
*/
+ WRITE_ONCE(scx_switching_all, scx_switch_all_req);
+
scx_task_iter_init(&sti);
while ((p = scx_task_iter_next_filtered_locked(&sti))) {
if (READ_ONCE(p->__state) != TASK_DEAD) {
@@ -2422,6 +2434,9 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
goto err_disable_unlock;
}

+ if (scx_switch_all_req)
+ static_branch_enable_cpuslocked(&__scx_switched_all);
+
cpus_read_unlock();
mutex_unlock(&scx_ops_enable_mutex);

@@ -2456,6 +2471,9 @@ static int scx_debug_show(struct seq_file *m, void *v)
mutex_lock(&scx_ops_enable_mutex);
seq_printf(m, "%-30s: %s\n", "ops", scx_ops.name);
seq_printf(m, "%-30s: %ld\n", "enabled", scx_enabled());
+ seq_printf(m, "%-30s: %d\n", "switching_all",
+ READ_ONCE(scx_switching_all));
+ seq_printf(m, "%-30s: %ld\n", "switched_all", scx_switched_all());
seq_printf(m, "%-30s: %s\n", "enable_state",
scx_ops_enable_state_str[scx_ops_enable_state()]);
seq_printf(m, "%-30s: %llu\n", "nr_rejected",
@@ -2694,6 +2712,31 @@ __diag_push();
__diag_ignore_all("-Wmissing-prototypes",
"Global functions as their definitions will be in vmlinux BTF");

+/**
+ * scx_bpf_switch_all - Switch all tasks into SCX
+ * @into_scx: switch direction
+ *
+ * If @into_scx is %true, all existing and future non-dl/rt tasks are switched
+ * to SCX. If %false, only tasks which have %SCHED_EXT explicitly set are put on
+ * SCX. The actual switching is asynchronous. Can be called from ops.init().
+ */
+void scx_bpf_switch_all(void)
+{
+ if (!scx_kf_allowed(SCX_KF_INIT))
+ return;
+
+ scx_switch_all_req = true;
+}
+
+BTF_SET8_START(scx_kfunc_ids_init)
+BTF_ID_FLAGS(func, scx_bpf_switch_all)
+BTF_SET8_END(scx_kfunc_ids_init)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_init = {
+ .owner = THIS_MODULE,
+ .set = &scx_kfunc_ids_init,
+};
+
/**
* scx_bpf_create_dsq - Create a custom DSQ
* @dsq_id: DSQ to create
@@ -3131,6 +3174,8 @@ static int __init register_ext_kfuncs(void)
* check using scx_kf_allowed().
*/
if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+ &scx_kfunc_set_init)) ||
+ (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
&scx_kfunc_set_sleepable)) ||
(ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
&scx_kfunc_set_enqueue_dispatch)) ||
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 76c94babd19e..a4fe649e649d 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -60,7 +60,9 @@ extern unsigned long scx_watchdog_timeout;
extern unsigned long scx_watchdog_timestamp;

DECLARE_STATIC_KEY_FALSE(__scx_ops_enabled);
+DECLARE_STATIC_KEY_FALSE(__scx_switched_all);
#define scx_enabled() static_branch_unlikely(&__scx_ops_enabled)
+#define scx_switched_all() static_branch_unlikely(&__scx_switched_all)

bool task_on_scx(struct task_struct *p);
void scx_pre_fork(struct task_struct *p);
@@ -95,6 +97,8 @@ static inline void scx_notify_sched_tick(void)
static inline const struct sched_class *next_active_class(const struct sched_class *class)
{
class++;
+ if (scx_switched_all() && class == &fair_sched_class)
+ class++;
if (!scx_enabled() && class == &ext_sched_class)
class++;
return class;
@@ -117,6 +121,7 @@ static inline const struct sched_class *next_active_class(const struct sched_cla
#else /* CONFIG_SCHED_CLASS_EXT */

#define scx_enabled() false
+#define scx_switched_all() false

static inline void scx_pre_fork(struct task_struct *p) {}
static inline int scx_fork(struct task_struct *p) { return 0; }
diff --git a/tools/sched_ext/scx_common.bpf.h b/tools/sched_ext/scx_common.bpf.h
index b40a4fc6a159..fec19e8d0681 100644
--- a/tools/sched_ext/scx_common.bpf.h
+++ b/tools/sched_ext/scx_common.bpf.h
@@ -66,6 +66,7 @@ bool bpf_cpumask_and(struct bpf_cpumask *dst, const struct cpumask *src1,
const struct cpumask *src2) __ksym;
u32 bpf_cpumask_first(const struct cpumask *cpumask) __ksym;

+void scx_bpf_switch_all(void) __ksym;
s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) __ksym;
bool scx_bpf_consume(u64 dsq_id) __ksym;
u32 scx_bpf_dispatch_nr_slots(void) __ksym;
diff --git a/tools/sched_ext/scx_example_dummy.bpf.c b/tools/sched_ext/scx_example_dummy.bpf.c
index ac7b490b5a39..28251373d1c3 100644
--- a/tools/sched_ext/scx_example_dummy.bpf.c
+++ b/tools/sched_ext/scx_example_dummy.bpf.c
@@ -7,6 +7,7 @@
*
* - Statistics tracking how many are queued to local and global dsq's.
* - Termination notification for userspace.
+ * - Support for switch_all.
*
* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
* Copyright (c) 2022 Tejun Heo <[email protected]>
@@ -16,6 +17,8 @@

char _license[] SEC("license") = "GPL";

+const volatile bool switch_all;
+
struct user_exit_info uei;

struct {
@@ -32,6 +35,13 @@ static void stat_inc(u32 idx)
(*cnt_p)++;
}

+s32 BPF_STRUCT_OPS(dummy_init)
+{
+ if (switch_all)
+ scx_bpf_switch_all();
+ return 0;
+}
+
void BPF_STRUCT_OPS(dummy_enqueue, struct task_struct *p, u64 enq_flags)
{
if (enq_flags & SCX_ENQ_LOCAL) {
@@ -51,6 +61,7 @@ void BPF_STRUCT_OPS(dummy_exit, struct scx_exit_info *ei)
SEC(".struct_ops")
struct sched_ext_ops dummy_ops = {
.enqueue = (void *)dummy_enqueue,
+ .init = (void *)dummy_init,
.exit = (void *)dummy_exit,
.name = "dummy",
};
diff --git a/tools/sched_ext/scx_example_dummy.c b/tools/sched_ext/scx_example_dummy.c
index 72881c881830..9229973e8698 100644
--- a/tools/sched_ext/scx_example_dummy.c
+++ b/tools/sched_ext/scx_example_dummy.c
@@ -19,8 +19,9 @@ const char help_fmt[] =
"\n"
"See the top-level comment in .bpf.c for more details.\n"
"\n"
-"Usage: %s\n"
+"Usage: %s [-a]\n"
"\n"
+" -a Switch all tasks\n"
" -h Display this help and exit\n";

static volatile int exit_req;
@@ -64,8 +65,11 @@ int main(int argc, char **argv)
skel = scx_example_dummy__open();
assert(skel);

- while ((opt = getopt(argc, argv, "h")) != -1) {
+ while ((opt = getopt(argc, argv, "ah")) != -1) {
switch (opt) {
+ case 'a':
+ skel->rodata->switch_all = true;
+ break;
default:
fprintf(stderr, help_fmt, basename(argv[0]));
return opt != 'h';
diff --git a/tools/sched_ext/scx_example_qmap.bpf.c b/tools/sched_ext/scx_example_qmap.bpf.c
index 46bc16ed301f..ec8c4ee7ef16 100644
--- a/tools/sched_ext/scx_example_qmap.bpf.c
+++ b/tools/sched_ext/scx_example_qmap.bpf.c
@@ -22,6 +22,7 @@
char _license[] SEC("license") = "GPL";

const volatile u64 slice_ns = SCX_SLICE_DFL;
+const volatile bool switch_all;
const volatile u32 stall_user_nth;
const volatile u32 stall_kernel_nth;
const volatile s32 disallow_tgid;
@@ -236,6 +237,13 @@ s32 BPF_STRUCT_OPS(qmap_prep_enable, struct task_struct *p,
return -ENOMEM;
}

+s32 BPF_STRUCT_OPS(qmap_init)
+{
+ if (switch_all)
+ scx_bpf_switch_all();
+ return 0;
+}
+
void BPF_STRUCT_OPS(qmap_exit, struct scx_exit_info *ei)
{
uei_record(&uei, ei);
@@ -248,6 +256,7 @@ struct sched_ext_ops qmap_ops = {
.dequeue = (void *)qmap_dequeue,
.dispatch = (void *)qmap_dispatch,
.prep_enable = (void *)qmap_prep_enable,
+ .init = (void *)qmap_init,
.exit = (void *)qmap_exit,
.timeout_ms = 5000U,
.name = "qmap",
diff --git a/tools/sched_ext/scx_example_qmap.c b/tools/sched_ext/scx_example_qmap.c
index dff9323dfd20..30633122e6d5 100644
--- a/tools/sched_ext/scx_example_qmap.c
+++ b/tools/sched_ext/scx_example_qmap.c
@@ -20,7 +20,7 @@ const char help_fmt[] =
"\n"
"See the top-level comment in .bpf.c for more details.\n"
"\n"
-"Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-d PID]\n"
+"Usage: %s [-a] [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-d PID]\n"
"\n"
" -s SLICE_US Override slice duration\n"
" -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n"
@@ -50,8 +50,11 @@ int main(int argc, char **argv)
skel = scx_example_qmap__open();
assert(skel);

- while ((opt = getopt(argc, argv, "hs:e:t:T:d:")) != -1) {
+ while ((opt = getopt(argc, argv, "ahs:e:t:T:d:")) != -1) {
switch (opt) {
+ case 'a':
+ skel->rodata->switch_all = true;
+ break;
case 's':
skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
break;
--
2.39.1


2023-01-28 00:19:41

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 24/30] sched_ext: Implement SCX_KICK_WAIT

From: David Vernet <[email protected]>

If set when calling scx_bpf_kick_cpu(), the invoking CPU will busy wait for
the kicked cpu to enter the scheduler. This will be used to improve the
exclusion guarantees in scx_example_pair.

Signed-off-by: David Vernet <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
kernel/sched/core.c | 4 +++-
kernel/sched/ext.c | 33 ++++++++++++++++++++++++++++++++-
kernel/sched/ext.h | 20 ++++++++++++++++++++
kernel/sched/sched.h | 2 ++
4 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a2a241bc24e7..47334e428031 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5952,8 +5952,10 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

for_each_active_class(class) {
p = class->pick_next_task(rq);
- if (p)
+ if (p) {
+ scx_notify_pick_next_task(rq, p, class);
return p;
+ }
}

BUG(); /* The idle class should always have a runnable task. */
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 31e48cff9be6..d1de6a44c4f5 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -126,6 +126,9 @@ static struct {
static bool __cacheline_aligned_in_smp scx_has_idle_cpus;
#endif /* CONFIG_SMP */

+/* for %SCX_KICK_WAIT */
+static u64 __percpu *scx_kick_cpus_pnt_seqs;
+
/*
* Direct dispatch marker.
*
@@ -3093,6 +3096,7 @@ static const struct sysrq_key_op sysrq_sched_ext_reset_op = {
static void kick_cpus_irq_workfn(struct irq_work *irq_work)
{
struct rq *this_rq = this_rq();
+ u64 *pseqs = this_cpu_ptr(scx_kick_cpus_pnt_seqs);
int this_cpu = cpu_of(this_rq);
int cpu;

@@ -3106,14 +3110,32 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
if (cpumask_test_cpu(cpu, this_rq->scx.cpus_to_preempt) &&
rq->curr->sched_class == &ext_sched_class)
rq->curr->scx.slice = 0;
+ pseqs[cpu] = rq->scx.pnt_seq;
resched_curr(rq);
+ } else {
+ cpumask_clear_cpu(cpu, this_rq->scx.cpus_to_wait);
}

raw_spin_rq_unlock_irqrestore(rq, flags);
}

+ for_each_cpu_andnot(cpu, this_rq->scx.cpus_to_wait,
+ cpumask_of(this_cpu)) {
+ /*
+ * Pairs with smp_store_release() issued by this CPU in
+ * scx_notify_pick_next_task() on the resched path.
+ *
+ * We busy-wait here to guarantee that no other task can be
+ * scheduled on our core before the target CPU has entered the
+ * resched path.
+ */
+ while (smp_load_acquire(&cpu_rq(cpu)->scx.pnt_seq) == pseqs[cpu])
+ cpu_relax();
+ }
+
cpumask_clear(this_rq->scx.cpus_to_kick);
cpumask_clear(this_rq->scx.cpus_to_preempt);
+ cpumask_clear(this_rq->scx.cpus_to_wait);
}

void __init init_sched_ext_class(void)
@@ -3127,7 +3149,7 @@ void __init init_sched_ext_class(void)
* through the generated vmlinux.h.
*/
WRITE_ONCE(v, SCX_WAKE_EXEC | SCX_ENQ_WAKEUP | SCX_DEQ_SLEEP |
- SCX_TG_ONLINE);
+ SCX_TG_ONLINE | SCX_KICK_PREEMPT);

BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params));
init_dsq(&scx_dsq_global, SCX_DSQ_GLOBAL);
@@ -3135,6 +3157,12 @@ void __init init_sched_ext_class(void)
BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL));
BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL));
#endif
+ scx_kick_cpus_pnt_seqs =
+ __alloc_percpu(sizeof(scx_kick_cpus_pnt_seqs[0]) *
+ num_possible_cpus(),
+ __alignof__(scx_kick_cpus_pnt_seqs[0]));
+ BUG_ON(!scx_kick_cpus_pnt_seqs);
+
for_each_possible_cpu(cpu) {
struct rq *rq = cpu_rq(cpu);

@@ -3143,6 +3171,7 @@ void __init init_sched_ext_class(void)

BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick, GFP_KERNEL));
BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_preempt, GFP_KERNEL));
+ BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_wait, GFP_KERNEL));
init_irq_work(&rq->scx.kick_cpus_irq_work, kick_cpus_irq_workfn);
}

@@ -3397,6 +3426,8 @@ void scx_bpf_kick_cpu(s32 cpu, u64 flags)
cpumask_set_cpu(cpu, rq->scx.cpus_to_kick);
if (flags & SCX_KICK_PREEMPT)
cpumask_set_cpu(cpu, rq->scx.cpus_to_preempt);
+ if (flags & SCX_KICK_WAIT)
+ cpumask_set_cpu(cpu, rq->scx.cpus_to_wait);

irq_work_queue(&rq->scx.kick_cpus_irq_work);
preempt_enable();
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 9a60b81d787e..39eb1b25ec99 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -66,6 +66,7 @@ enum scx_tg_flags {

enum scx_kick_flags {
SCX_KICK_PREEMPT = 1LLU << 0, /* force scheduling on the CPU */
+ SCX_KICK_WAIT = 1LLU << 1, /* wait for the CPU to be rescheduled */
};

#ifdef CONFIG_SCHED_CLASS_EXT
@@ -95,6 +96,22 @@ __printf(2, 3) void scx_ops_error_type(enum scx_exit_type type,
#define scx_ops_error(fmt, args...) \
scx_ops_error_type(SCX_EXIT_ERROR, fmt, ##args)

+static inline void scx_notify_pick_next_task(struct rq *rq,
+ const struct task_struct *p,
+ const struct sched_class *active)
+{
+#ifdef CONFIG_SMP
+ if (!scx_enabled())
+ return;
+ /*
+ * Pairs with the smp_load_acquire() issued by a CPU in
+ * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
+ * resched.
+ */
+ smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
+#endif
+}
+
static inline void scx_notify_sched_tick(void)
{
unsigned long last_check;
@@ -149,6 +166,9 @@ static inline int scx_check_setscheduler(struct task_struct *p,
int policy) { return 0; }
static inline bool scx_can_stop_tick(struct rq *rq) { return true; }
static inline void init_sched_ext_class(void) {}
+static inline void scx_notify_pick_next_task(struct rq *rq,
+ const struct task_struct *p,
+ const struct sched_class *active) {}
static inline void scx_notify_sched_tick(void) {}

#define for_each_active_class for_each_class
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index abb8ec22b6ef..d31185ecd090 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -708,6 +708,8 @@ struct scx_rq {
u32 flags;
cpumask_var_t cpus_to_kick;
cpumask_var_t cpus_to_preempt;
+ cpumask_var_t cpus_to_wait;
+ u64 pnt_seq;
struct irq_work kick_cpus_irq_work;
};
#endif /* CONFIG_SCHED_CLASS_EXT */
--
2.39.1


2023-01-28 00:19:50

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 23/30] sched_ext: Add cgroup support

Add sched_ext_ops operations to init/exit cgroups, and track task migrations
and config changes. Because different BPF schedulers may implement different
subsets of CPU control features, allow BPF schedulers to pick which cgroup
interface files to enable using SCX_OPS_CGROUP_KNOB_* flags. For now, only
the weight knobs are supported but adding more should be straightforward.

While a BPF scheduler is being enabled and disabled, relevant cgroup
operations are locked out using scx_cgroup_rwsem. This avoids situations
like task prep taking place while the task is being moved across cgroups,
making things easier for BPF schedulers.

This patch also adds scx_example_pair which implements a variant of core
scheduling where a hyperthread pair only run tasks from the same cgroup. The
BPF scheduler achieves this by putting tasks into per-cgroup queues,
time-slicing the cgroup to run for each pair first, and then scheduling
within the cgroup. See the header comment in scx_example_pair.bpf.c for more
details.

Note that scx_example_pair's cgroup-boundary guarantee breaks down for tasks
running in higher priority scheduler classes. This will be addressed by a
followup patch which implements a mechanism to track CPU preemption.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/sched/ext.h | 96 ++++-
init/Kconfig | 5 +
kernel/sched/core.c | 68 ++-
kernel/sched/ext.c | 341 ++++++++++++++-
kernel/sched/ext.h | 23 +
kernel/sched/sched.h | 12 +-
tools/sched_ext/.gitignore | 1 +
tools/sched_ext/Makefile | 9 +-
tools/sched_ext/scx_example_pair.bpf.c | 555 +++++++++++++++++++++++++
tools/sched_ext/scx_example_pair.c | 143 +++++++
tools/sched_ext/scx_example_pair.h | 10 +
11 files changed, 1239 insertions(+), 24 deletions(-)
create mode 100644 tools/sched_ext/scx_example_pair.bpf.c
create mode 100644 tools/sched_ext/scx_example_pair.c
create mode 100644 tools/sched_ext/scx_example_pair.h

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 8f19ee7e5433..11d6902e717d 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -12,6 +12,8 @@
#include <linux/rhashtable.h>
#include <linux/llist.h>

+struct cgroup;
+
enum scx_consts {
SCX_OPS_NAME_LEN = 128,
SCX_EXIT_REASON_LEN = 128,
@@ -109,14 +111,27 @@ enum scx_ops_flags {
*/
SCX_OPS_ENQ_EXITING = 1LLU << 2,

+ /*
+ * CPU cgroup knob enable flags
+ */
+ SCX_OPS_CGROUP_KNOB_WEIGHT = 1LLU << 16, /* cpu.weight */
+
SCX_OPS_ALL_FLAGS = SCX_OPS_KEEP_BUILTIN_IDLE |
SCX_OPS_ENQ_LAST |
- SCX_OPS_ENQ_EXITING,
+ SCX_OPS_ENQ_EXITING |
+ SCX_OPS_CGROUP_KNOB_WEIGHT,
};

/* argument container for ops.enable() and friends */
struct scx_enable_args {
- /* empty for now */
+ /* the cgroup the task is joining */
+ struct cgroup *cgroup;
+};
+
+/* argument container for ops->cgroup_init() */
+struct scx_cgroup_init_args {
+ /* the weight of the cgroup [1..10000] */
+ u32 weight;
};

/**
@@ -325,7 +340,8 @@ struct sched_ext_ops {
* @p: task to enable BPF scheduling for
* @args: enable arguments, see the struct definition
*
- * Enable @p for BPF scheduling. @p will start running soon.
+ * Enable @p for BPF scheduling. @p is now in the cgroup specified for
+ * the preceding prep_enable() and will start running soon.
*/
void (*enable)(struct task_struct *p, struct scx_enable_args *args);

@@ -349,6 +365,77 @@ struct sched_ext_ops {
*/
void (*disable)(struct task_struct *p);

+ /**
+ * cgroup_init - Initialize a cgroup
+ * @cgrp: cgroup being initialized
+ * @args: init arguments, see the struct definition
+ *
+ * Either the BPF scheduler is being loaded or @cgrp created, initialize
+ * @cgrp for sched_ext. This operation may block.
+ *
+ * Return 0 for success, -errno for failure. An error return while
+ * loading will abort loading of the BPF scheduler. During cgroup
+ * creation, it will abort the specific cgroup creation.
+ */
+ s32 (*cgroup_init)(struct cgroup *cgrp,
+ struct scx_cgroup_init_args *args);
+
+ /**
+ * cgroup_exit - Exit a cgroup
+ * @cgrp: cgroup being exited
+ *
+ * Either the BPF scheduler is being unloaded or @cgrp destroyed, exit
+ * @cgrp for sched_ext. This operation my block.
+ */
+ void (*cgroup_exit)(struct cgroup *cgrp);
+
+ /**
+ * cgroup_prep_move - Prepare a task to be moved to a different cgroup
+ * @p: task being moved
+ * @from: cgroup @p is being moved from
+ * @to: cgroup @p is being moved to
+ *
+ * Prepare @p for move from cgroup @from to @to. This operation may
+ * block and can be used for allocations.
+ *
+ * Return 0 for success, -errno for failure. An error return aborts the
+ * migration.
+ */
+ s32 (*cgroup_prep_move)(struct task_struct *p,
+ struct cgroup *from, struct cgroup *to);
+
+ /**
+ * cgroup_move - Commit cgroup move
+ * @p: task being moved
+ * @from: cgroup @p is being moved from
+ * @to: cgroup @p is being moved to
+ *
+ * Commit the move. @p is dequeued during this operation.
+ */
+ void (*cgroup_move)(struct task_struct *p,
+ struct cgroup *from, struct cgroup *to);
+
+ /**
+ * cgroup_cancel_move - Cancel cgroup move
+ * @p: task whose cgroup move is being canceled
+ * @from: cgroup @p was being moved from
+ * @to: cgroup @p was being moved to
+ *
+ * @p was cgroup_prep_move()'d but failed before reaching cgroup_move().
+ * Undo the preparation.
+ */
+ void (*cgroup_cancel_move)(struct task_struct *p,
+ struct cgroup *from, struct cgroup *to);
+
+ /**
+ * cgroup_set_weight - A cgroup's weight is being changed
+ * @cgrp: cgroup whose weight is being updated
+ * @weight: new weight [1..10000]
+ *
+ * Update @tg's weight to @weight.
+ */
+ void (*cgroup_set_weight)(struct cgroup *cgrp, u32 weight);
+
/*
* All online ops must come before ops.init().
*/
@@ -483,6 +570,9 @@ struct sched_ext_entity {

/* cold fields */
struct list_head tasks_node;
+#ifdef CONFIG_EXT_GROUP_SCHED
+ struct cgroup *cgrp_moving_from;
+#endif
};

void sched_ext_free(struct task_struct *p);
diff --git a/init/Kconfig b/init/Kconfig
index c93f19dd872e..13faaed9dc97 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1039,6 +1039,11 @@ config RT_GROUP_SCHED
realtime bandwidth for them.
See Documentation/scheduler/sched-rt-group.rst for more information.

+config EXT_GROUP_SCHED
+ bool
+ depends on SCHED_CLASS_EXT && CGROUP_SCHED
+ default y
+
endif #CGROUP_SCHED

config UCLAMP_TASK_GROUP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d2419bd4fb5e..a2a241bc24e7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9786,6 +9786,9 @@ void __init sched_init(void)
root_task_group.shares = ROOT_TASK_GROUP_LOAD;
init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
#endif /* CONFIG_FAIR_GROUP_SCHED */
+#ifdef CONFIG_EXT_GROUP_SCHED
+ root_task_group.scx_weight = CGROUP_WEIGHT_DFL;
+#endif /* CONFIG_EXT_GROUP_SCHED */
#ifdef CONFIG_RT_GROUP_SCHED
root_task_group.rt_se = (struct sched_rt_entity **)ptr;
ptr += nr_cpu_ids * sizeof(void **);
@@ -10242,6 +10245,7 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;

+ scx_group_set_weight(tg, CGROUP_WEIGHT_DFL);
alloc_uclamp_sched_group(tg, parent);

return tg;
@@ -10345,6 +10349,7 @@ void sched_move_task(struct task_struct *tsk)
SCHED_CHANGE_BLOCK(rq, tsk,
DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK) {
sched_change_group(tsk);
+ scx_move_task(tsk);
}

/*
@@ -10381,6 +10386,11 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
{
struct task_group *tg = css_tg(css);
struct task_group *parent = css_tg(css->parent);
+ int ret;
+
+ ret = scx_tg_online(tg);
+ if (ret)
+ return ret;

if (parent)
sched_online_group(tg, parent);
@@ -10397,6 +10407,13 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
return 0;
}

+static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
+{
+ struct task_group *tg = css_tg(css);
+
+ scx_tg_offline(tg);
+}
+
static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
{
struct task_group *tg = css_tg(css);
@@ -10414,9 +10431,10 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
sched_unregister_group(tg);
}

-#ifdef CONFIG_RT_GROUP_SCHED
+#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_EXT_GROUP_SCHED)
static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
{
+#ifdef CONFIG_RT_GROUP_SCHED
struct task_struct *task;
struct cgroup_subsys_state *css;

@@ -10424,7 +10442,8 @@ static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
if (!sched_rt_can_attach(css_tg(css), task))
return -EINVAL;
}
- return 0;
+#endif
+ return scx_cgroup_can_attach(tset);
}
#endif

@@ -10435,7 +10454,16 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)

cgroup_taskset_for_each(task, css, tset)
sched_move_task(task);
+
+ scx_cgroup_finish_attach();
+}
+
+#ifdef CONFIG_EXT_GROUP_SCHED
+static void cpu_cgroup_cancel_attach(struct cgroup_taskset *tset)
+{
+ scx_cgroup_cancel_attach(tset);
}
+#endif

#ifdef CONFIG_UCLAMP_TASK_GROUP
static void cpu_util_update_eff(struct cgroup_subsys_state *css)
@@ -10618,9 +10646,15 @@ static int cpu_uclamp_max_show(struct seq_file *sf, void *v)
static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 shareval)
{
+ int ret;
+
if (shareval > scale_load_down(ULONG_MAX))
shareval = MAX_SHARES;
- return sched_group_set_shares(css_tg(css), scale_load(shareval));
+ ret = sched_group_set_shares(css_tg(css), scale_load(shareval));
+ if (!ret)
+ scx_group_set_weight(css_tg(css),
+ sched_weight_to_cgroup(shareval));
+ return ret;
}

static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
@@ -11084,11 +11118,15 @@ static int cpu_extra_stat_show(struct seq_file *sf,
return 0;
}

-#ifdef CONFIG_FAIR_GROUP_SCHED
+#if defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_EXT_GROUP_SCHED)

static unsigned long tg_weight(struct task_group *tg)
{
+#ifdef CONFIG_FAIR_GROUP_SCHED
return scale_load_down(tg->shares);
+#else
+ return sched_weight_from_cgroup(tg->cgrp_weight);
+#endif
}

static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css,
@@ -11101,13 +11139,17 @@ static int cpu_weight_write_u64(struct cgroup_subsys_state *css,
struct cftype *cft, u64 cgrp_weight)
{
unsigned long weight;
+ int ret;

if (cgrp_weight < CGROUP_WEIGHT_MIN || cgrp_weight > CGROUP_WEIGHT_MAX)
return -ERANGE;

weight = sched_weight_from_cgroup(cgrp_weight);

- return sched_group_set_shares(css_tg(css), scale_load(weight));
+ ret = sched_group_set_shares(css_tg(css), scale_load(weight));
+ if (!ret)
+ scx_group_set_weight(css_tg(css), cgrp_weight);
+ return ret;
}

static s64 cpu_weight_nice_read_s64(struct cgroup_subsys_state *css,
@@ -11132,7 +11174,7 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
struct cftype *cft, s64 nice)
{
unsigned long weight;
- int idx;
+ int idx, ret;

if (nice < MIN_NICE || nice > MAX_NICE)
return -ERANGE;
@@ -11141,7 +11183,11 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
idx = array_index_nospec(idx, 40);
weight = sched_prio_to_weight[idx];

- return sched_group_set_shares(css_tg(css), scale_load(weight));
+ ret = sched_group_set_shares(css_tg(css), scale_load(weight));
+ if (!ret)
+ scx_group_set_weight(css_tg(css),
+ sched_weight_to_cgroup(weight));
+ return ret;
}
#endif

@@ -11203,7 +11249,7 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
#endif

struct cftype cpu_cftypes[CPU_CFTYPE_CNT + 1] = {
-#ifdef CONFIG_FAIR_GROUP_SCHED
+#if defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_EXT_GROUP_SCHED)
[CPU_CFTYPE_WEIGHT] = {
.name = "weight",
.flags = CFTYPE_NOT_ON_ROOT,
@@ -11257,13 +11303,17 @@ struct cftype cpu_cftypes[CPU_CFTYPE_CNT + 1] = {
struct cgroup_subsys cpu_cgrp_subsys = {
.css_alloc = cpu_cgroup_css_alloc,
.css_online = cpu_cgroup_css_online,
+ .css_offline = cpu_cgroup_css_offline,
.css_released = cpu_cgroup_css_released,
.css_free = cpu_cgroup_css_free,
.css_extra_stat_show = cpu_extra_stat_show,
-#ifdef CONFIG_RT_GROUP_SCHED
+#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_EXT_GROUP_SCHED)
.can_attach = cpu_cgroup_can_attach,
#endif
.attach = cpu_cgroup_attach,
+#ifdef CONFIG_EXT_GROUP_SCHED
+ .cancel_attach = cpu_cgroup_cancel_attach,
+#endif
.legacy_cftypes = cpu_legacy_cftypes,
.dfl_cftypes = cpu_cftypes,
.early_init = true,
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 5e94f9604e23..31e48cff9be6 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1710,6 +1710,19 @@ static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued)
resched_curr(rq);
}

+static struct cgroup *tg_cgrp(struct task_group *tg)
+{
+ /*
+ * If CGROUP_SCHED is disabled, @tg is NULL. If @tg is an autogroup,
+ * @tg->css.cgroup is NULL. In both cases, @tg can be treated as the
+ * root cgroup.
+ */
+ if (tg && tg->css.cgroup)
+ return tg->css.cgroup;
+ else
+ return &cgrp_dfl_root.cgrp;
+}
+
static int scx_ops_prepare_task(struct task_struct *p, struct task_group *tg)
{
int ret;
@@ -1719,7 +1732,7 @@ static int scx_ops_prepare_task(struct task_struct *p, struct task_group *tg)
p->scx.disallow = false;

if (SCX_HAS_OP(prep_enable)) {
- struct scx_enable_args args = { };
+ struct scx_enable_args args = { .cgroup = tg_cgrp(tg) };

ret = SCX_CALL_OP_RET(SCX_KF_SLEEPABLE, prep_enable, p, &args);
if (unlikely(ret)) {
@@ -1759,7 +1772,8 @@ static void scx_ops_enable_task(struct task_struct *p)
WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_OPS_PREPPED));

if (SCX_HAS_OP(enable)) {
- struct scx_enable_args args = { };
+ struct scx_enable_args args =
+ { .cgroup = tg_cgrp(p->sched_task_group) };
scx_ops.enable(p, &args);
}
p->scx.flags &= ~SCX_TASK_OPS_PREPPED;
@@ -1772,7 +1786,8 @@ static void scx_ops_disable_task(struct task_struct *p)

if (p->scx.flags & SCX_TASK_OPS_PREPPED) {
if (SCX_HAS_OP(cancel_enable)) {
- struct scx_enable_args args = { };
+ struct scx_enable_args args =
+ { .cgroup = tg_cgrp(task_group(p)) };
scx_ops.cancel_enable(p, &args);
}
p->scx.flags &= ~SCX_TASK_OPS_PREPPED;
@@ -1926,6 +1941,165 @@ bool scx_can_stop_tick(struct rq *rq)
}
#endif

+#ifdef CONFIG_EXT_GROUP_SCHED
+
+DEFINE_STATIC_PERCPU_RWSEM(scx_cgroup_rwsem);
+
+int scx_tg_online(struct task_group *tg)
+{
+ int ret = 0;
+
+ WARN_ON_ONCE(tg->scx_flags & (SCX_TG_ONLINE | SCX_TG_INITED));
+
+ percpu_down_read(&scx_cgroup_rwsem);
+
+ if (SCX_HAS_OP(cgroup_init)) {
+ struct scx_cgroup_init_args args = { .weight = tg->scx_weight };
+
+ ret = SCX_CALL_OP_RET(SCX_KF_SLEEPABLE, cgroup_init,
+ tg->css.cgroup, &args);
+ if (!ret)
+ tg->scx_flags |= SCX_TG_ONLINE | SCX_TG_INITED;
+ else
+ ret = ops_sanitize_err("cgroup_init", ret);
+ } else {
+ tg->scx_flags |= SCX_TG_ONLINE;
+ }
+
+ percpu_up_read(&scx_cgroup_rwsem);
+ return ret;
+}
+
+void scx_tg_offline(struct task_group *tg)
+{
+ WARN_ON_ONCE(!(tg->scx_flags & SCX_TG_ONLINE));
+
+ percpu_down_read(&scx_cgroup_rwsem);
+
+ if (SCX_HAS_OP(cgroup_exit) && (tg->scx_flags & SCX_TG_INITED))
+ scx_ops.cgroup_exit(tg->css.cgroup);
+ tg->scx_flags &= ~(SCX_TG_ONLINE | SCX_TG_INITED);
+
+ percpu_up_read(&scx_cgroup_rwsem);
+}
+
+int scx_cgroup_can_attach(struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *css;
+ struct task_struct *p;
+ int ret;
+
+ /* released in scx_finish/cancel_attach() */
+ percpu_down_read(&scx_cgroup_rwsem);
+
+ if (!scx_enabled())
+ return 0;
+
+ cgroup_taskset_for_each(p, css, tset) {
+ struct cgroup *from = tg_cgrp(p->sched_task_group);
+
+ if (SCX_HAS_OP(cgroup_prep_move)) {
+ ret = SCX_CALL_OP_RET(SCX_KF_SLEEPABLE, cgroup_prep_move,
+ p, from, css->cgroup);
+ if (ret)
+ goto err;
+ }
+
+ WARN_ON_ONCE(p->scx.cgrp_moving_from);
+ p->scx.cgrp_moving_from = from;
+ }
+
+ return 0;
+
+err:
+ cgroup_taskset_for_each(p, css, tset) {
+ if (!p->scx.cgrp_moving_from)
+ break;
+ if (SCX_HAS_OP(cgroup_cancel_move))
+ scx_ops.cgroup_cancel_move(p, p->scx.cgrp_moving_from,
+ css->cgroup);
+ p->scx.cgrp_moving_from = NULL;
+ }
+
+ percpu_up_read(&scx_cgroup_rwsem);
+ return ops_sanitize_err("cgroup_prep_move", ret);
+}
+
+void scx_move_task(struct task_struct *p)
+{
+ /*
+ * We're called from sched_move_task() which handles both cgroup and
+ * autogroup moves. Ignore the latter.
+ */
+ if (task_group_is_autogroup(p->sched_task_group))
+ return;
+
+ if (!scx_enabled())
+ return;
+
+ if (SCX_HAS_OP(cgroup_move)) {
+ WARN_ON_ONCE(!p->scx.cgrp_moving_from);
+ scx_ops.cgroup_move(p, p->scx.cgrp_moving_from,
+ tg_cgrp(p->sched_task_group));
+ }
+ p->scx.cgrp_moving_from = NULL;
+}
+
+void scx_cgroup_finish_attach(void)
+{
+ percpu_up_read(&scx_cgroup_rwsem);
+}
+
+void scx_cgroup_cancel_attach(struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *css;
+ struct task_struct *p;
+
+ if (!scx_enabled())
+ goto out_unlock;
+
+ cgroup_taskset_for_each(p, css, tset) {
+ if (SCX_HAS_OP(cgroup_cancel_move)) {
+ WARN_ON_ONCE(!p->scx.cgrp_moving_from);
+ scx_ops.cgroup_cancel_move(p, p->scx.cgrp_moving_from,
+ css->cgroup);
+ }
+ p->scx.cgrp_moving_from = NULL;
+ }
+out_unlock:
+ percpu_up_read(&scx_cgroup_rwsem);
+}
+
+void scx_group_set_weight(struct task_group *tg, unsigned long weight)
+{
+ percpu_down_read(&scx_cgroup_rwsem);
+
+ if (tg->scx_weight != weight) {
+ if (SCX_HAS_OP(cgroup_set_weight))
+ scx_ops.cgroup_set_weight(tg_cgrp(tg), weight);
+ tg->scx_weight = weight;
+ }
+
+ percpu_up_read(&scx_cgroup_rwsem);
+}
+
+static void scx_cgroup_lock(void)
+{
+ percpu_down_write(&scx_cgroup_rwsem);
+}
+
+static void scx_cgroup_unlock(void)
+{
+ percpu_up_write(&scx_cgroup_rwsem);
+}
+
+#else /* CONFIG_EXT_GROUP_SCHED */
+
+static inline void scx_cgroup_lock(void) {}
+static inline void scx_cgroup_unlock(void) {}
+
+#endif /* CONFIG_EXT_GROUP_SCHED */
+
/*
* Omitted operations:
*
@@ -2055,6 +2229,131 @@ static void destroy_dsq(u64 dsq_id)
rcu_read_unlock();
}

+#ifdef CONFIG_EXT_GROUP_SCHED
+static void scx_cgroup_exit(void)
+{
+ struct cgroup_subsys_state *css;
+
+ percpu_rwsem_assert_held(&scx_cgroup_rwsem);
+
+ /*
+ * scx_tg_on/offline() are excluded through scx_cgroup_rwsem. If we walk
+ * cgroups and exit all the inited ones, all online cgroups are exited.
+ */
+ rcu_read_lock();
+ css_for_each_descendant_post(css, &root_task_group.css) {
+ struct task_group *tg = css_tg(css);
+
+ if (!(tg->scx_flags & SCX_TG_INITED))
+ continue;
+ tg->scx_flags &= ~SCX_TG_INITED;
+
+ if (!scx_ops.cgroup_exit)
+ continue;
+
+ if (WARN_ON_ONCE(!css_tryget(css)))
+ continue;
+ rcu_read_unlock();
+
+ scx_ops.cgroup_exit(css->cgroup);
+
+ rcu_read_lock();
+ css_put(css);
+ }
+ rcu_read_unlock();
+}
+
+static int scx_cgroup_init(void)
+{
+ struct cgroup_subsys_state *css;
+ int ret;
+
+ percpu_rwsem_assert_held(&scx_cgroup_rwsem);
+
+ /*
+ * scx_tg_on/offline() are excluded thorugh scx_cgroup_rwsem. If we walk
+ * cgroups and init, all online cgroups are initialized.
+ */
+ rcu_read_lock();
+ css_for_each_descendant_pre(css, &root_task_group.css) {
+ struct task_group *tg = css_tg(css);
+ struct scx_cgroup_init_args args = { .weight = tg->scx_weight };
+
+ if ((tg->scx_flags &
+ (SCX_TG_ONLINE | SCX_TG_INITED)) != SCX_TG_ONLINE)
+ continue;
+
+ if (!scx_ops.cgroup_init) {
+ tg->scx_flags |= SCX_TG_INITED;
+ continue;
+ }
+
+ if (WARN_ON_ONCE(!css_tryget(css)))
+ continue;
+ rcu_read_unlock();
+
+ ret = SCX_CALL_OP_RET(SCX_KF_SLEEPABLE, cgroup_init,
+ css->cgroup, &args);
+ if (ret) {
+ css_put(css);
+ return ret;
+ }
+ tg->scx_flags |= SCX_TG_INITED;
+
+ rcu_read_lock();
+ css_put(css);
+ }
+ rcu_read_unlock();
+
+ return 0;
+}
+
+static void scx_cgroup_config_knobs(void)
+{
+ static DEFINE_MUTEX(cgintf_mutex);
+ DECLARE_BITMAP(mask, CPU_CFTYPE_CNT) = { };
+ u64 knob_flags;
+ int i;
+
+ /*
+ * Called from both class switch and ops enable/disable paths,
+ * synchronize internally.
+ */
+ mutex_lock(&cgintf_mutex);
+
+ /* if fair is in use, all knobs should be shown */
+ if (!scx_switched_all()) {
+ bitmap_fill(mask, CPU_CFTYPE_CNT);
+ goto apply;
+ }
+
+ /*
+ * On ext, only show the supported knobs. Otherwise, show all possible
+ * knobs so that configuration attempts succeed and the states are
+ * remembered while ops is not loaded.
+ */
+ if (scx_enabled())
+ knob_flags = scx_ops.flags;
+ else
+ knob_flags = SCX_OPS_ALL_FLAGS;
+
+ if (knob_flags & SCX_OPS_CGROUP_KNOB_WEIGHT) {
+ __set_bit(CPU_CFTYPE_WEIGHT, mask);
+ __set_bit(CPU_CFTYPE_WEIGHT_NICE, mask);
+ }
+apply:
+ for (i = 0; i < CPU_CFTYPE_CNT; i++)
+ cgroup_show_cftype(&cpu_cftypes[i], test_bit(i, mask));
+
+ mutex_unlock(&cgintf_mutex);
+}
+
+#else
+static void scx_cgroup_exit(void) {}
+static int scx_cgroup_init(void) { return 0; }
+static void scx_cgroup_config_knobs(void) {}
+#endif
+
/*
* Used by sched_fork() and __setscheduler_prio() to pick the matching
* sched_class. dl/rt are already handled.
@@ -2198,9 +2497,10 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
static_branch_disable(&__scx_switched_all);
WRITE_ONCE(scx_switching_all, false);

- /* avoid racing against fork */
+ /* avoid racing against fork and cgroup changes */
cpus_read_lock();
percpu_down_write(&scx_fork_rwsem);
+ scx_cgroup_lock();

spin_lock_irq(&scx_tasks_lock);
scx_task_iter_init(&sti);
@@ -2237,6 +2537,9 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
static_branch_disable_cpuslocked(&scx_builtin_idle_enabled);
synchronize_rcu();

+ scx_cgroup_exit();
+
+ scx_cgroup_unlock();
percpu_up_write(&scx_fork_rwsem);
cpus_read_unlock();

@@ -2275,6 +2578,8 @@ static void scx_ops_disable_workfn(struct kthread_work *work)

WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_DISABLED) !=
SCX_OPS_DISABLING);
+
+ scx_cgroup_config_knobs();
}

static DEFINE_KTHREAD_WORK(scx_ops_disable_work, scx_ops_disable_workfn);
@@ -2420,10 +2725,11 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
scx_watchdog_timeout / 2);

/*
- * Lock out forks before opening the floodgate so that they don't wander
- * into the operations prematurely.
+ * Lock out forks, cgroup on/offlining and moves before opening the
+ * floodgate so that they don't wander into the operations prematurely.
*/
percpu_down_write(&scx_fork_rwsem);
+ scx_cgroup_lock();

for (i = 0; i < SCX_NR_ONLINE_OPS; i++)
if (((void (**)(void))ops)[i])
@@ -2442,6 +2748,14 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
static_branch_disable_cpuslocked(&scx_builtin_idle_enabled);
}

+ /*
+ * All cgroups should be initialized before letting in tasks. cgroup
+ * on/offlining and task migrations are already locked out.
+ */
+ ret = scx_cgroup_init();
+ if (ret)
+ goto err_disable_unlock;
+
static_branch_enable_cpuslocked(&__scx_ops_enabled);

/*
@@ -2524,6 +2838,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops)

spin_unlock_irq(&scx_tasks_lock);
preempt_enable();
+ scx_cgroup_unlock();
percpu_up_write(&scx_fork_rwsem);

if (!scx_ops_tryset_enable_state(SCX_OPS_ENABLED, SCX_OPS_ENABLING)) {
@@ -2537,6 +2852,8 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
cpus_read_unlock();
mutex_unlock(&scx_ops_enable_mutex);

+ scx_cgroup_config_knobs();
+
return 0;

err_unlock:
@@ -2544,6 +2861,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
return ret;

err_disable_unlock:
+ scx_cgroup_unlock();
percpu_up_write(&scx_fork_rwsem);
err_disable:
cpus_read_unlock();
@@ -2707,6 +3025,9 @@ static int bpf_scx_check_member(const struct btf_type *t,

switch (moff) {
case offsetof(struct sched_ext_ops, prep_enable):
+ case offsetof(struct sched_ext_ops, cgroup_init):
+ case offsetof(struct sched_ext_ops, cgroup_exit):
+ case offsetof(struct sched_ext_ops, cgroup_prep_move):
case offsetof(struct sched_ext_ops, init):
case offsetof(struct sched_ext_ops, exit):
break;
@@ -2805,7 +3126,8 @@ void __init init_sched_ext_class(void)
* definitions so that BPF scheduler implementations can use them
* through the generated vmlinux.h.
*/
- WRITE_ONCE(v, SCX_WAKE_EXEC | SCX_ENQ_WAKEUP | SCX_DEQ_SLEEP);
+ WRITE_ONCE(v, SCX_WAKE_EXEC | SCX_ENQ_WAKEUP | SCX_DEQ_SLEEP |
+ SCX_TG_ONLINE);

BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params));
init_dsq(&scx_dsq_global, SCX_DSQ_GLOBAL);
@@ -2826,6 +3148,7 @@ void __init init_sched_ext_class(void)

register_sysrq_key('S', &sysrq_sched_ext_reset_op);
INIT_DELAYED_WORK(&scx_watchdog_work, scx_watchdog_workfn);
+ scx_cgroup_config_knobs();
}


@@ -2869,8 +3192,8 @@ static const struct btf_kfunc_id_set scx_kfunc_set_init = {
* @dsq_id: DSQ to create
* @node: NUMA node to allocate from
*
- * Create a custom DSQ identified by @dsq_id. Can be called from ops.init() and
- * ops.prep_enable().
+ * Create a custom DSQ identified by @dsq_id. Can be called from ops.init(),
+ * ops.prep_enable(), ops.cgroup_init() and ops.cgroup_prep_move().
*/
s32 scx_bpf_create_dsq(u64 dsq_id, s32 node)
{
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 9c9284f91e38..9a60b81d787e 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -59,6 +59,11 @@ enum scx_deq_flags {
SCX_DEQ_SLEEP = DEQUEUE_SLEEP,
};

+enum scx_tg_flags {
+ SCX_TG_ONLINE = 1U << 0,
+ SCX_TG_INITED = 1U << 1,
+};
+
enum scx_kick_flags {
SCX_KICK_PREEMPT = 1LLU << 0, /* force scheduling on the CPU */
};
@@ -162,3 +167,21 @@ static inline void scx_update_idle(struct rq *rq, bool idle)
#else
static inline void scx_update_idle(struct rq *rq, bool idle) {}
#endif
+
+#ifdef CONFIG_EXT_GROUP_SCHED
+int scx_tg_online(struct task_group *tg);
+void scx_tg_offline(struct task_group *tg);
+int scx_cgroup_can_attach(struct cgroup_taskset *tset);
+void scx_move_task(struct task_struct *p);
+void scx_cgroup_finish_attach(void);
+void scx_cgroup_cancel_attach(struct cgroup_taskset *tset);
+void scx_group_set_weight(struct task_group *tg, unsigned long cgrp_weight);
+#else /* CONFIG_EXT_GROUP_SCHED */
+static inline int scx_tg_online(struct task_group *tg) { return 0; }
+static inline void scx_tg_offline(struct task_group *tg) {}
+static inline int scx_cgroup_can_attach(struct cgroup_taskset *tset) { return 0; }
+static inline void scx_move_task(struct task_struct *p) {}
+static inline void scx_cgroup_finish_attach(void) {}
+static inline void scx_cgroup_cancel_attach(struct cgroup_taskset *tset) {}
+static inline void scx_group_set_weight(struct task_group *tg, unsigned long cgrp_weight) {}
+#endif /* CONFIG_EXT_GROUP_SCHED */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2447af22783c..abb8ec22b6ef 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -406,6 +406,11 @@ struct task_group {
struct rt_bandwidth rt_bandwidth;
#endif

+#ifdef CONFIG_EXT_GROUP_SCHED
+ u32 scx_flags; /* SCX_TG_* */
+ u32 scx_weight;
+#endif
+
struct rcu_head rcu;
struct list_head list;

@@ -530,6 +535,11 @@ extern void set_task_rq_fair(struct sched_entity *se,
static inline void set_task_rq_fair(struct sched_entity *se,
struct cfs_rq *prev, struct cfs_rq *next) { }
#endif /* CONFIG_SMP */
+#else /* CONFIG_FAIR_GROUP_SCHED */
+static inline int sched_group_set_shares(struct task_group *tg, unsigned long shares)
+{
+ return 0;
+}
#endif /* CONFIG_FAIR_GROUP_SCHED */

#else /* CONFIG_CGROUP_SCHED */
@@ -3372,7 +3382,7 @@ static inline void update_current_exec_runtime(struct task_struct *curr,

#ifdef CONFIG_CGROUP_SCHED
enum cpu_cftype_id {
-#ifdef CONFIG_FAIR_GROUP_SCHED
+#if defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_EXT_GROUP_SCHED)
CPU_CFTYPE_WEIGHT,
CPU_CFTYPE_WEIGHT_NICE,
CPU_CFTYPE_IDLE,
diff --git a/tools/sched_ext/.gitignore b/tools/sched_ext/.gitignore
index 389f0e5b0970..ebc34dcf925b 100644
--- a/tools/sched_ext/.gitignore
+++ b/tools/sched_ext/.gitignore
@@ -1,6 +1,7 @@
scx_example_dummy
scx_example_qmap
scx_example_central
+scx_example_pair
*.skel.h
*.subskel.h
/tools/
diff --git a/tools/sched_ext/Makefile b/tools/sched_ext/Makefile
index c6c3669c47b9..2303736698a2 100644
--- a/tools/sched_ext/Makefile
+++ b/tools/sched_ext/Makefile
@@ -115,7 +115,7 @@ BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH) \
-Wno-compare-distinct-pointer-types \
-O2 -mcpu=v3

-all: scx_example_dummy scx_example_qmap scx_example_central
+all: scx_example_dummy scx_example_qmap scx_example_central scx_example_pair

# sort removes libbpf duplicates when not cross-building
MAKE_DIRS := $(sort $(BUILD_DIR)/libbpf $(HOST_BUILD_DIR)/libbpf \
@@ -178,10 +178,15 @@ scx_example_central: scx_example_central.c scx_example_central.skel.h user_exit_
$(CC) $(CFLAGS) -c $< -o [email protected]
$(CC) -o $@ [email protected] $(HOST_BPFOBJ) $(LDFLAGS)

+scx_example_pair: scx_example_pair.c scx_example_pair.skel.h user_exit_info.h
+ $(CC) $(CFLAGS) -c $< -o [email protected]
+ $(CC) -o $@ [email protected] $(HOST_BPFOBJ) $(LDFLAGS)
+
clean:
rm -rf $(SCRATCH_DIR) $(HOST_SCRATCH_DIR)
rm -f *.o *.bpf.o *.skel.h *.subskel.h
- rm -f scx_example_dummy scx_example_qmap scx_example_central
+ rm -f scx_example_dummy scx_example_qmap scx_example_central \
+ scx_example_pair

.PHONY: all clean

diff --git a/tools/sched_ext/scx_example_pair.bpf.c b/tools/sched_ext/scx_example_pair.bpf.c
new file mode 100644
index 000000000000..8e277225b044
--- /dev/null
+++ b/tools/sched_ext/scx_example_pair.bpf.c
@@ -0,0 +1,555 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A demo sched_ext core-scheduler which always makes every sibling CPU pair
+ * execute from the same CPU cgroup.
+ *
+ * Each CPU in the system is paired with exactly one other CPU, according to a
+ * "stride" value that can be specified when the BPF scheduler program is first
+ * loaded. Throughout the runtime of the scheduler, these CPU pairs guarantee
+ * that they will only ever schedule tasks that belong to the same CPU cgroup.
+ *
+ * Scheduler Initialization
+ * ------------------------
+ *
+ * The scheduler BPF program is first initialized from user space, before it is
+ * enabled. During this initialization process, each CPU on the system is
+ * assigned several values that are constant throughout its runtime:
+ *
+ * 1. *Pair CPU*: The CPU that it synchronizes with when making scheduling
+ * decisions. Paired CPUs always schedule tasks from the same
+ * CPU cgroup, and synchronize with each other to guarantee
+ * that this constraint is not violated.
+ * 2. *Pair ID*: Each CPU pair is assigned a Pair ID, which is used to access
+ * a struct pair_ctx object that is shared between the pair.
+ * 3. *In-pair-index*: An index, 0 or 1, that is assigned to each core in the
+ * pair. Each struct pair_ctx has an active_mask field,
+ * which is a bitmap used to indicate whether each core
+ * in the pair currently has an actively running task.
+ * This index specifies which entry in the bitmap corresponds
+ * to each CPU in the pair.
+ *
+ * During this initialization, the CPUs are paired according to a "stride" that
+ * may be specified when invoking the user space program that initializes and
+ * loads the scheduler. By default, the stride is 1/2 the total number of CPUs.
+ *
+ * Tasks and cgroups
+ * -----------------
+ *
+ * Every cgroup in the system is registered with the scheduler using the
+ * pair_cgroup_init() callback, and every task in the system is associated with
+ * exactly one cgroup. At a high level, the idea with the pair scheduler is to
+ * always schedule tasks from the same cgroup within a given CPU pair. When a
+ * task is enqueued (i.e. passed to the pair_enqueue() callback function), its
+ * cgroup ID is read from its task struct, and then a corresponding queue map
+ * is used to FIFO-enqueue the task for that cgroup.
+ *
+ * If you look through the implementation of the scheduler, you'll notice that
+ * there is quite a bit of complexity involved with looking up the per-cgroup
+ * FIFO queue that we enqueue tasks in. For example, there is a cgrp_q_idx_hash
+ * BPF hash map that is used to map a cgroup ID to a globally unique ID that's
+ * allocated in the BPF program. This is done because we use separate maps to
+ * store the FIFO queue of tasks, and the length of that map, per cgroup. This
+ * complexity is only present because of current deficiencies in BPF that will
+ * soon be addressed. The main point to keep in mind is that newly enqueued
+ * tasks are added to their cgroup's FIFO queue.
+ *
+ * Dispatching tasks
+ * -----------------
+ *
+ * This section will describe how enqueued tasks are dispatched and scheduled.
+ * Tasks are dispatched in pair_dispatch(), and at a high level the workflow is
+ * as follows:
+ *
+ * 1. Fetch the struct pair_ctx for the current CPU. As mentioned above, this is
+ * the structure that's used to synchronize amongst the two pair CPUs in their
+ * scheduling decisions. After any of the following events have occurred:
+ *
+ * - The cgroup's slice run has expired, or
+ * - The cgroup becomes empty, or
+ * - Either CPU in the pair is preempted by a higher priority scheduling class
+ *
+ * The cgroup transitions to the draining state and stops executing new tasks
+ * from the cgroup.
+ *
+ * 2. If the pair is still executing a task, mark the pair_ctx as draining, and
+ * wait for the pair CPU to be preempted.
+ *
+ * 3. Otherwise, if the pair CPU is not running a task, we can move onto
+ * scheduling new tasks. Pop the next cgroup id from the top_q queue.
+ *
+ * 4. Pop a task from that cgroup's FIFO task queue, and begin executing it.
+ *
+ * Note again that this scheduling behavior is simple, but the implementation
+ * is complex mostly because this it hits several BPF shortcomings and has to
+ * work around in often awkward ways. Most of the shortcomings are expected to
+ * be resolved in the near future which should allow greatly simplifying this
+ * scheduler.
+ *
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#include "scx_common.bpf.h"
+#include "scx_example_pair.h"
+
+char _license[] SEC("license") = "GPL";
+
+const volatile bool switch_all;
+
+const volatile u32 nr_cpu_ids;
+
+/* a pair of CPUs stay on a cgroup for this duration */
+const volatile u32 pair_batch_dur_ns = SCX_SLICE_DFL;
+
+/* cpu ID -> pair cpu ID */
+const volatile s32 pair_cpu[MAX_CPUS] = { [0 ... MAX_CPUS - 1] = -1 };
+
+/* cpu ID -> pair_id */
+const volatile u32 pair_id[MAX_CPUS];
+
+/* CPU ID -> CPU # in the pair (0 or 1) */
+const volatile u32 in_pair_idx[MAX_CPUS];
+
+struct pair_ctx {
+ struct bpf_spin_lock lock;
+
+ /* the cgroup the pair is currently executing */
+ u64 cgid;
+
+ /* the pair started executing the current cgroup at */
+ u64 started_at;
+
+ /* whether the current cgroup is draining */
+ bool draining;
+
+ /* the CPUs that are currently active on the cgroup */
+ u32 active_mask;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, MAX_CPUS / 2);
+ __type(key, u32);
+ __type(value, struct pair_ctx);
+} pair_ctx SEC(".maps");
+
+/* queue of cgrp_q's possibly with tasks on them */
+struct {
+ __uint(type, BPF_MAP_TYPE_QUEUE);
+ /*
+ * Because it's difficult to build strong synchronization encompassing
+ * multiple non-trivial operations in BPF, this queue is managed in an
+ * opportunistic way so that we guarantee that a cgroup w/ active tasks
+ * is always on it but possibly multiple times. Once we have more robust
+ * synchronization constructs and e.g. linked list, we should be able to
+ * do this in a prettier way but for now just size it big enough.
+ */
+ __uint(max_entries, 4 * MAX_CGRPS);
+ __type(value, u64);
+} top_q SEC(".maps");
+
+/* per-cgroup q which FIFOs the tasks from the cgroup */
+struct cgrp_q {
+ __uint(type, BPF_MAP_TYPE_QUEUE);
+ __uint(max_entries, MAX_QUEUED);
+ __type(value, u32);
+};
+
+/*
+ * Ideally, we want to allocate cgrp_q and cgrq_q_len in the cgroup local
+ * storage; however, a cgroup local storage can only be accessed from the BPF
+ * progs attached to the cgroup. For now, work around by allocating array of
+ * cgrp_q's and then allocating per-cgroup indices.
+ *
+ * Another caveat: It's difficult to populate a large array of maps statically
+ * or from BPF. Initialize it from userland.
+ */
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
+ __uint(max_entries, MAX_CGRPS);
+ __type(key, s32);
+ __array(values, struct cgrp_q);
+} cgrp_q_arr SEC(".maps");
+
+static u64 cgrp_q_len[MAX_CGRPS];
+
+/*
+ * This and cgrp_q_idx_hash combine into a poor man's IDR. This likely would be
+ * useful to have as a map type.
+ */
+static u32 cgrp_q_idx_cursor;
+static u64 cgrp_q_idx_busy[MAX_CGRPS];
+
+/*
+ * All added up, the following is what we do:
+ *
+ * 1. When a cgroup is enabled, RR cgroup_q_idx_busy array doing cmpxchg looking
+ * for a free ID. If not found, fail cgroup creation with -EBUSY.
+ *
+ * 2. Hash the cgroup ID to the allocated cgrp_q_idx in the following
+ * cgrp_q_idx_hash.
+ *
+ * 3. Whenever a cgrp_q needs to be accessed, first look up the cgrp_q_idx from
+ * cgrp_q_idx_hash and then access the corresponding entry in cgrp_q_arr.
+ *
+ * This is sadly complicated for something pretty simple. Hopefully, we should
+ * be able to simplify in the future.
+ */
+struct {
+ __uint(type, BPF_MAP_TYPE_HASH);
+ __uint(max_entries, MAX_CGRPS);
+ __uint(key_size, sizeof(u64)); /* cgrp ID */
+ __uint(value_size, sizeof(s32)); /* cgrp_q idx */
+} cgrp_q_idx_hash SEC(".maps");
+
+/* statistics */
+u64 nr_total, nr_dispatched, nr_missing, nr_kicks, nr_preemptions;
+u64 nr_exps, nr_exp_waits, nr_exp_empty;
+u64 nr_cgrp_next, nr_cgrp_coll, nr_cgrp_empty;
+
+struct user_exit_info uei;
+
+static bool time_before(u64 a, u64 b)
+{
+ return (s64)(a - b) < 0;
+}
+
+void BPF_STRUCT_OPS(pair_enqueue, struct task_struct *p, u64 enq_flags)
+{
+ s32 pid = p->pid;
+ u64 cgid = p->sched_task_group->css.cgroup->kn->id;
+ u32 *q_idx;
+ struct cgrp_q *cgq;
+ u64 *cgq_len;
+
+ __sync_fetch_and_add(&nr_total, 1);
+
+ /* find the cgroup's q and push @p into it */
+ q_idx = bpf_map_lookup_elem(&cgrp_q_idx_hash, &cgid);
+ if (!q_idx) {
+ scx_bpf_error("failed to lookup q_idx for cgroup[%llu]", cgid);
+ return;
+ }
+
+ cgq = bpf_map_lookup_elem(&cgrp_q_arr, q_idx);
+ if (!cgq) {
+ scx_bpf_error("failed to lookup q_arr for cgroup[%llu] q_idx[%u]",
+ cgid, *q_idx);
+ return;
+ }
+
+ if (bpf_map_push_elem(cgq, &pid, 0)) {
+ scx_bpf_error("cgroup[%llu] queue overflow", cgid);
+ return;
+ }
+
+ /* bump q len, if going 0 -> 1, queue cgroup into the top_q */
+ cgq_len = MEMBER_VPTR(cgrp_q_len, [*q_idx]);
+ if (!cgq_len) {
+ scx_bpf_error("MEMBER_VTPR malfunction");
+ return;
+ }
+
+ if (!__sync_fetch_and_add(cgq_len, 1) &&
+ bpf_map_push_elem(&top_q, &cgid, 0)) {
+ scx_bpf_error("top_q overflow");
+ return;
+ }
+}
+
+/* find the next cgroup to execute and return it in *data */
+static int next_cgid_loopfn(u32 idx, void *data)
+{
+ u64 cgid;
+ u32 *q_idx;
+ u64 *cgq_len;
+
+ if (bpf_map_pop_elem(&top_q, &cgid))
+ return 1;
+
+ q_idx = bpf_map_lookup_elem(&cgrp_q_idx_hash, &cgid);
+ if (!q_idx)
+ return 0;
+
+ /* this is the only place where empty cgroups are taken off the top_q */
+ cgq_len = MEMBER_VPTR(cgrp_q_len, [*q_idx]);
+ if (!cgq_len || !*cgq_len)
+ return 0;
+
+ /* if it has any tasks, requeue as we may race and not execute it */
+ bpf_map_push_elem(&top_q, &cgid, 0);
+ *(u64 *)data = cgid;
+ return 1;
+}
+
+struct claim_task_loopctx {
+ u32 q_idx;
+ bool claimed;
+};
+
+/* claim one task from the specified cgq */
+static int claim_task_loopfn(u32 idx, void *data)
+{
+ struct claim_task_loopctx *claimc = data;
+ u64 *cgq_len;
+ u64 len;
+
+ cgq_len = MEMBER_VPTR(cgrp_q_len, [claimc->q_idx]);
+ if (!cgq_len)
+ return 1;
+
+ len = *cgq_len;
+ if (!len)
+ return 1;
+
+ if (__sync_val_compare_and_swap(cgq_len, len, len - 1) != len)
+ return 0;
+
+ claimc->claimed = true;
+ return 1;
+}
+
+static int lookup_pairc_and_mask(s32 cpu, struct pair_ctx **pairc, u32 *mask)
+{
+ u32 *vptr, in_pair_mask;
+ int err;
+
+ vptr = (u32 *)MEMBER_VPTR(pair_id, [cpu]);
+ if (!vptr)
+ return -EINVAL;
+
+ *pairc = bpf_map_lookup_elem(&pair_ctx, vptr);
+ if (!(*pairc))
+ return -EINVAL;
+
+ vptr = (u32 *)MEMBER_VPTR(in_pair_idx, [cpu]);
+ if (!vptr)
+ return -EINVAL;
+
+ *mask = 1U << *vptr;
+
+ return 0;
+}
+
+static int dispatch_loopfn(u32 idx, void *data)
+{
+ s32 cpu = *(s32 *)data;
+ struct pair_ctx *pairc;
+ struct bpf_map *cgq_map;
+ struct claim_task_loopctx claimc;
+ struct task_struct *p;
+ u64 now = bpf_ktime_get_ns();
+ bool kick_pair = false;
+ bool expired;
+ u32 *vptr, in_pair_mask;
+ s32 pid;
+ u64 cgid;
+ int ret;
+
+ ret = lookup_pairc_and_mask(cpu, &pairc, &in_pair_mask);
+ if (ret) {
+ scx_bpf_error("failed to lookup pairc and in_pair_mask for cpu[%d]",
+ cpu);
+ return 1;
+ }
+
+ bpf_spin_lock(&pairc->lock);
+ pairc->active_mask &= ~in_pair_mask;
+
+ expired = time_before(pairc->started_at + pair_batch_dur_ns, now);
+ if (expired || pairc->draining) {
+ u64 new_cgid = 0;
+
+ __sync_fetch_and_add(&nr_exps, 1);
+
+ /*
+ * We're done with the current cgid. An obvious optimization
+ * would be not draining if the next cgroup is the current one.
+ * For now, be dumb and always expire.
+ */
+ pairc->draining = true;
+
+ if (pairc->active_mask) {
+ /*
+ * The other CPU is still active We want to wait until
+ * this cgroup expires.
+ *
+ * If the pair controls its CPU, and the time already
+ * expired, kick. When the other CPU arrives at
+ * dispatch and clears its active mask, it'll push the
+ * pair to the next cgroup and kick this CPU.
+ */
+ __sync_fetch_and_add(&nr_exp_waits, 1);
+ bpf_spin_unlock(&pairc->lock);
+ if (expired)
+ kick_pair = true;
+ goto out_maybe_kick;
+ }
+
+ bpf_spin_unlock(&pairc->lock);
+
+ /*
+ * Pick the next cgroup. It'd be easier / cleaner to not drop
+ * pairc->lock and use stronger synchronization here especially
+ * given that we'll be switching cgroups significantly less
+ * frequently than tasks. Unfortunately, bpf_spin_lock can't
+ * really protect anything non-trivial. Let's do opportunistic
+ * operations instead.
+ */
+ bpf_loop(1 << 23, next_cgid_loopfn, &new_cgid, 0);
+ /* no active cgroup, go idle */
+ if (!new_cgid) {
+ __sync_fetch_and_add(&nr_exp_empty, 1);
+ return 1;
+ }
+
+ bpf_spin_lock(&pairc->lock);
+
+ /*
+ * The other CPU may already have started on a new cgroup while
+ * we dropped the lock. Make sure that we're still draining and
+ * start on the new cgroup.
+ */
+ if (pairc->draining && !pairc->active_mask) {
+ __sync_fetch_and_add(&nr_cgrp_next, 1);
+ pairc->cgid = new_cgid;
+ pairc->started_at = now;
+ pairc->draining = false;
+ kick_pair = true;
+ } else {
+ __sync_fetch_and_add(&nr_cgrp_coll, 1);
+ }
+ }
+
+ cgid = pairc->cgid;
+ pairc->active_mask |= in_pair_mask;
+ bpf_spin_unlock(&pairc->lock);
+
+ /* again, it'd be better to do all these with the lock held, oh well */
+ vptr = bpf_map_lookup_elem(&cgrp_q_idx_hash, &cgid);
+ if (!vptr) {
+ scx_bpf_error("failed to lookup q_idx for cgroup[%llu]", cgid);
+ return 1;
+ }
+
+ claimc = (struct claim_task_loopctx){ .q_idx = *vptr };
+ bpf_loop(1 << 23, claim_task_loopfn, &claimc, 0);
+ if (!claimc.claimed) {
+ /* the cgroup must be empty, expire and repeat */
+ __sync_fetch_and_add(&nr_cgrp_empty, 1);
+ bpf_spin_lock(&pairc->lock);
+ pairc->draining = true;
+ pairc->active_mask &= ~in_pair_mask;
+ bpf_spin_unlock(&pairc->lock);
+ return 0;
+ }
+
+ cgq_map = bpf_map_lookup_elem(&cgrp_q_arr, &claimc.q_idx);
+ if (!cgq_map) {
+ scx_bpf_error("failed to lookup cgq_map for cgroup[%llu] q_idx[%d]",
+ cgid, claimc.q_idx);
+ return 1;
+ }
+
+ if (bpf_map_pop_elem(cgq_map, &pid)) {
+ scx_bpf_error("cgq_map is empty for cgroup[%llu] q_idx[%d]",
+ cgid, claimc.q_idx);
+ return 1;
+ }
+
+ p = bpf_task_from_pid(pid);
+ if (p) {
+ __sync_fetch_and_add(&nr_dispatched, 1);
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0);
+ bpf_task_release(p);
+ } else {
+ /* we don't handle dequeues, retry on lost tasks */
+ __sync_fetch_and_add(&nr_missing, 1);
+ return 0;
+ }
+
+out_maybe_kick:
+ if (kick_pair) {
+ s32 *pair = (s32 *)MEMBER_VPTR(pair_cpu, [cpu]);
+ if (pair) {
+ __sync_fetch_and_add(&nr_kicks, 1);
+ scx_bpf_kick_cpu(*pair, SCX_KICK_PREEMPT);
+ }
+ }
+ return 1;
+}
+
+void BPF_STRUCT_OPS(pair_dispatch, s32 cpu, struct task_struct *prev)
+{
+ s32 cpu_on_stack = cpu;
+
+ bpf_loop(1 << 23, dispatch_loopfn, &cpu_on_stack, 0);
+}
+
+static int alloc_cgrp_q_idx_loopfn(u32 idx, void *data)
+{
+ u32 q_idx;
+
+ q_idx = __sync_fetch_and_add(&cgrp_q_idx_cursor, 1) % MAX_CGRPS;
+ if (!__sync_val_compare_and_swap(&cgrp_q_idx_busy[q_idx], 0, 1)) {
+ *(s32 *)data = q_idx;
+ return 1;
+ }
+ return 0;
+}
+
+s32 BPF_STRUCT_OPS(pair_cgroup_init, struct cgroup *cgrp)
+{
+ u64 cgid = cgrp->kn->id;
+ s32 q_idx = -1;
+
+ bpf_loop(MAX_CGRPS, alloc_cgrp_q_idx_loopfn, &q_idx, 0);
+ if (q_idx < 0)
+ return -EBUSY;
+
+ if (bpf_map_update_elem(&cgrp_q_idx_hash, &cgid, &q_idx, BPF_ANY)) {
+ u64 *busy = MEMBER_VPTR(cgrp_q_idx_busy, [q_idx]);
+ if (busy)
+ *busy = 0;
+ return -EBUSY;
+ }
+
+ return 0;
+}
+
+void BPF_STRUCT_OPS(pair_cgroup_exit, struct cgroup *cgrp)
+{
+ u64 cgid = cgrp->kn->id;
+ s32 *q_idx;
+
+ q_idx = bpf_map_lookup_elem(&cgrp_q_idx_hash, &cgid);
+ if (q_idx) {
+ u64 *busy = MEMBER_VPTR(cgrp_q_idx_busy, [*q_idx]);
+ if (busy)
+ *busy = 0;
+ bpf_map_delete_elem(&cgrp_q_idx_hash, &cgid);
+ }
+}
+
+s32 BPF_STRUCT_OPS(pair_init)
+{
+ if (switch_all)
+ scx_bpf_switch_all();
+ return 0;
+}
+
+void BPF_STRUCT_OPS(pair_exit, struct scx_exit_info *ei)
+{
+ uei_record(&uei, ei);
+}
+
+SEC(".struct_ops")
+struct sched_ext_ops pair_ops = {
+ .enqueue = (void *)pair_enqueue,
+ .dispatch = (void *)pair_dispatch,
+ .cgroup_init = (void *)pair_cgroup_init,
+ .cgroup_exit = (void *)pair_cgroup_exit,
+ .init = (void *)pair_init,
+ .exit = (void *)pair_exit,
+ .name = "pair",
+};
diff --git a/tools/sched_ext/scx_example_pair.c b/tools/sched_ext/scx_example_pair.c
new file mode 100644
index 000000000000..255ea7b1235d
--- /dev/null
+++ b/tools/sched_ext/scx_example_pair.c
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <unistd.h>
+#include <signal.h>
+#include <assert.h>
+#include <libgen.h>
+#include <bpf/bpf.h>
+#include "user_exit_info.h"
+#include "scx_example_pair.h"
+#include "scx_example_pair.skel.h"
+
+const char help_fmt[] =
+"A demo sched_ext core-scheduler which always makes every sibling CPU pair\n"
+"execute from the same CPU cgroup.\n"
+"\n"
+"See the top-level comment in .bpf.c for more details.\n"
+"\n"
+"Usage: %s [-a] [-S STRIDE]\n"
+"\n"
+" -a Switch all tasks\n"
+" -S STRIDE Override CPU pair stride (default: nr_cpus_ids / 2)\n"
+" -h Display this help and exit\n";
+
+static volatile int exit_req;
+
+static void sigint_handler(int dummy)
+{
+ exit_req = 1;
+}
+
+int main(int argc, char **argv)
+{
+ struct scx_example_pair *skel;
+ struct bpf_link *link;
+ u64 seq = 0;
+ s32 stride, i, opt, outer_fd;
+
+ signal(SIGINT, sigint_handler);
+ signal(SIGTERM, sigint_handler);
+
+ libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
+
+ skel = scx_example_pair__open();
+ assert(skel);
+
+ skel->rodata->nr_cpu_ids = libbpf_num_possible_cpus();
+
+ /* pair up the earlier half to the latter by default, override with -s */
+ stride = skel->rodata->nr_cpu_ids / 2;
+
+ while ((opt = getopt(argc, argv, "ahS:")) != -1) {
+ switch (opt) {
+ case 'a':
+ skel->rodata->switch_all = true;
+ break;
+ case 'S':
+ stride = strtoul(optarg, NULL, 0);
+ break;
+ default:
+ fprintf(stderr, help_fmt, basename(argv[0]));
+ return opt != 'h';
+ }
+ }
+
+ for (i = 0; i < skel->rodata->nr_cpu_ids; i++) {
+ if (skel->rodata->pair_cpu[i] < 0) {
+ skel->rodata->pair_cpu[i] = i + stride;
+ skel->rodata->pair_cpu[i + stride] = i;
+ skel->rodata->pair_id[i] = i;
+ skel->rodata->pair_id[i + stride] = i;
+ skel->rodata->in_pair_idx[i] = 0;
+ skel->rodata->in_pair_idx[i + stride] = 1;
+ }
+ }
+
+ assert(!scx_example_pair__load(skel));
+
+ /*
+ * Populate the cgrp_q_arr map which is an array containing per-cgroup
+ * queues. It'd probably be better to do this from BPF but there are too
+ * many to initialize statically and there's no way to dynamically
+ * populate from BPF.
+ */
+ outer_fd = bpf_map__fd(skel->maps.cgrp_q_arr);
+ assert(outer_fd >= 0);
+
+ printf("Initializing");
+ for (i = 0; i < MAX_CGRPS; i++) {
+ s32 inner_fd;
+
+ if (exit_req)
+ break;
+
+ inner_fd = bpf_map_create(BPF_MAP_TYPE_QUEUE, NULL, 0,
+ sizeof(u32), MAX_QUEUED, NULL);
+ assert(inner_fd >= 0);
+ assert(!bpf_map_update_elem(outer_fd, &i, &inner_fd, BPF_ANY));
+ close(inner_fd);
+
+ if (!(i % 10))
+ printf(".");
+ fflush(stdout);
+ }
+ printf("\n");
+
+ /*
+ * Fully initialized, attach and run.
+ */
+ link = bpf_map__attach_struct_ops(skel->maps.pair_ops);
+ assert(link);
+
+ while (!exit_req && !uei_exited(&skel->bss->uei)) {
+ printf("[SEQ %lu]\n", seq++);
+ printf(" total:%10lu dispatch:%10lu missing:%10lu\n",
+ skel->bss->nr_total,
+ skel->bss->nr_dispatched,
+ skel->bss->nr_missing);
+ printf(" kicks:%10lu preemptions:%7lu\n",
+ skel->bss->nr_kicks,
+ skel->bss->nr_preemptions);
+ printf(" exp:%10lu exp_wait:%10lu exp_empty:%10lu\n",
+ skel->bss->nr_exps,
+ skel->bss->nr_exp_waits,
+ skel->bss->nr_exp_empty);
+ printf("cgnext:%10lu cgcoll:%10lu cgempty:%10lu\n",
+ skel->bss->nr_cgrp_next,
+ skel->bss->nr_cgrp_coll,
+ skel->bss->nr_cgrp_empty);
+ fflush(stdout);
+ sleep(1);
+ }
+
+ bpf_link__destroy(link);
+ uei_print(&skel->bss->uei);
+ scx_example_pair__destroy(skel);
+ return 0;
+}
diff --git a/tools/sched_ext/scx_example_pair.h b/tools/sched_ext/scx_example_pair.h
new file mode 100644
index 000000000000..f60b824272f7
--- /dev/null
+++ b/tools/sched_ext/scx_example_pair.h
@@ -0,0 +1,10 @@
+#ifndef __SCX_EXAMPLE_PAIR_H
+#define __SCX_EXAMPLE_PAIR_H
+
+enum {
+ MAX_CPUS = 4096,
+ MAX_QUEUED = 4096,
+ MAX_CGRPS = 4096,
+};
+
+#endif /* __SCX_EXAMPLE_PAIR_H */
--
2.39.1


2023-01-28 00:20:09

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 25/30] sched_ext: Implement sched_ext_ops.cpu_acquire/release()

From: David Vernet <[email protected]>

Scheduler classes are strictly ordered and when a higher priority class has
tasks to run, the lower priority ones lose access to the CPU. Being able to
monitor and act on these events are necessary for use cases includling
strict core-scheduling and latency management.

This patch adds two operations ops.cpu_acquire() and .cpu_release(). The
former is invoked when a CPU becomes available to the BPF scheduler and the
opposite for the latter. This patch also implements
scx_bpf_reenqueue_local() which can be called from .cpu_release() to trigger
requeueing of all tasks in the local dsq of the CPU so that the tasks can be
reassigned to other available CPUs.

scx_example_pair is updated to use .cpu_acquire/release() along with
%SCX_KICK_WAIT to make the pair scheduling guarantee strict even when a CPU
is preempted by a higher priority scheduler class.

scx_example_qmap is updated to use .cpu_acquire/release() to empty the local
dsq of a preempted CPU. A similar approach can be adopted by BPF schedulers
that want to have a tight control over latency.

v2: Add p->scx.kf_mask annotation to allow calling scx_bpf_reenqueue_local()
from ops.cpu_release() nested inside ops.init() and other sleepable
operations.

Signed-off-by: David Vernet <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/sched/ext.h | 52 +++++++++
kernel/sched/ext.c | 139 ++++++++++++++++++++++++-
kernel/sched/ext.h | 22 +++-
kernel/sched/sched.h | 1 +
tools/sched_ext/scx_common.bpf.h | 1 +
tools/sched_ext/scx_example_pair.bpf.c | 101 +++++++++++++++++-
tools/sched_ext/scx_example_qmap.bpf.c | 37 ++++++-
tools/sched_ext/scx_example_qmap.c | 4 +-
8 files changed, 346 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 11d6902e717d..82ead36d1136 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -134,6 +134,32 @@ struct scx_cgroup_init_args {
u32 weight;
};

+enum scx_cpu_preempt_reason {
+ /* next task is being scheduled by &sched_class_rt */
+ SCX_CPU_PREEMPT_RT,
+ /* next task is being scheduled by &sched_class_dl */
+ SCX_CPU_PREEMPT_DL,
+ /* next task is being scheduled by &sched_class_stop */
+ SCX_CPU_PREEMPT_STOP,
+ /* unknown reason for SCX being preempted */
+ SCX_CPU_PREEMPT_UNKNOWN,
+};
+
+/*
+ * Argument container for ops->cpu_acquire(). Currently empty, but may be
+ * expanded in the future.
+ */
+struct scx_cpu_acquire_args {};
+
+/* argument container for ops->cpu_release() */
+struct scx_cpu_release_args {
+ /* the reason the CPU was preempted */
+ enum scx_cpu_preempt_reason reason;
+
+ /* the task that's going to be scheduled on the CPU */
+ const struct task_struct *task;
+};
+
/**
* struct sched_ext_ops - Operation table for BPF scheduler implementation
*
@@ -320,6 +346,28 @@ struct sched_ext_ops {
*/
void (*update_idle)(s32 cpu, bool idle);

+ /**
+ * cpu_acquire - A CPU is becoming available to the BPF scheduler
+ * @cpu: The CPU being acquired by the BPF scheduler.
+ * @args: Acquire arguments, see the struct definition.
+ *
+ * A CPU that was previously released from the BPF scheduler is now once
+ * again under its control.
+ */
+ void (*cpu_acquire)(s32 cpu, struct scx_cpu_acquire_args *args);
+
+ /**
+ * cpu_release - A CPU is taken away from the BPF scheduler
+ * @cpu: The CPU being released by the BPF scheduler.
+ * @args: Release arguments, see the struct definition.
+ *
+ * The specified CPU is no longer under the control of the BPF
+ * scheduler. This could be because it was preempted by a higher
+ * priority sched_class, though there may be other reasons as well. The
+ * caller should consult @args->reason to determine the cause.
+ */
+ void (*cpu_release)(s32 cpu, struct scx_cpu_release_args *args);
+
/**
* prep_enable - Prepare to enable BPF scheduling for a task
* @p: task to prepare BPF scheduling for
@@ -522,8 +570,12 @@ enum scx_kf_mask {
SCX_KF_INIT = 1 << 0, /* allowed from ops.init() */
SCX_KF_SLEEPABLE = 1 << 1, /* from sleepable init operations */

+ /* ENQUEUE_DISPATCH may be nested inside CPU_RELEASE */
+ SCX_KF_CPU_RELEASE = 1 << 2, /* from ops.cpu_release() */
+
SCX_KF_ENQUEUE_DISPATCH = 1 << 3, /* from ops.enqueue() or .dispatch() */
SCX_KF_DISPATCH = 1 << 4, /* from ops.dispatch() */
+ __SCX_KF_TERMINAL = SCX_KF_ENQUEUE_DISPATCH | SCX_KF_DISPATCH,
};

/*
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index d1de6a44c4f5..072082968f0f 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -83,6 +83,7 @@ static bool scx_warned_zero_slice;

static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_last);
static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_exiting);
+DEFINE_STATIC_KEY_FALSE(scx_ops_cpu_preempt);
static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);

struct static_key_false scx_has_op[SCX_NR_ONLINE_OPS] =
@@ -193,9 +194,14 @@ static void scx_kf_allow(u32 mask)
{
u32 allowed_nesters = 0;

- /* INIT|SLEEPABLE can nest others but not themselves */
+ /*
+ * INIT|SLEEPABLE can nest others but not themselves. CPU_RELEASE can
+ * additionally nest ENQUEUE_DISPATCH.
+ */
if (!(mask & (SCX_KF_INIT | SCX_KF_SLEEPABLE)))
allowed_nesters |= SCX_KF_INIT | SCX_KF_SLEEPABLE;
+ if (mask & SCX_KF_ENQUEUE_DISPATCH)
+ allowed_nesters |= SCX_KF_CPU_RELEASE;

WARN_ONCE(current->scx.kf_mask & ~allowed_nesters,
"invalid nesting current->scx.kf_mask=0x%x mask=0x%x allowed_nesters=0x%x\n",
@@ -238,6 +244,12 @@ static bool scx_kf_allowed(u32 mask)
return false;
}

+ if (unlikely((mask & SCX_KF_CPU_RELEASE) &&
+ (current->scx.kf_mask & __SCX_KF_TERMINAL))) {
+ scx_ops_error("cpu_release kfunc called from terminal operations");
+ return false;
+ }
+
return true;
}

@@ -1276,6 +1288,19 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,

lockdep_assert_rq_held(rq);

+ if (static_branch_unlikely(&scx_ops_cpu_preempt) &&
+ unlikely(rq->scx.cpu_released)) {
+ /*
+ * If the previous sched_class for the current CPU was not SCX,
+ * notify the BPF scheduler that it again has control of the
+ * core. This callback complements ->cpu_release(), which is
+ * emitted in scx_notify_pick_next_task().
+ */
+ if (SCX_HAS_OP(cpu_acquire))
+ scx_ops.cpu_acquire(cpu_of(rq), NULL);
+ rq->scx.cpu_released = false;
+ }
+
if (prev_on_scx) {
WARN_ON_ONCE(prev->scx.flags & SCX_TASK_BAL_KEEP);
update_curr_scx(rq);
@@ -1283,7 +1308,9 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
/*
* If @prev is runnable & has slice left, it has priority and
* fetching more just increases latency for the fetched tasks.
- * Tell put_prev_task_scx() to put @prev on local_dsq.
+ * Tell put_prev_task_scx() to put @prev on local_dsq. If the
+ * BPF scheduler wants to handle this explicitly, it should
+ * implement ->cpu_released().
*
* See scx_ops_disable_workfn() for the explanation on the
* disabling() test.
@@ -1489,6 +1516,59 @@ static struct task_struct *pick_next_task_scx(struct rq *rq)
return p;
}

+static enum scx_cpu_preempt_reason
+preempt_reason_from_class(const struct sched_class *class)
+{
+#ifdef CONFIG_SMP
+ if (class == &stop_sched_class)
+ return SCX_CPU_PREEMPT_STOP;
+#endif
+ if (class == &dl_sched_class)
+ return SCX_CPU_PREEMPT_DL;
+ if (class == &rt_sched_class)
+ return SCX_CPU_PREEMPT_RT;
+ return SCX_CPU_PREEMPT_UNKNOWN;
+}
+
+void __scx_notify_pick_next_task(struct rq *rq,
+ const struct task_struct *task,
+ const struct sched_class *active)
+{
+ lockdep_assert_rq_held(rq);
+
+ /*
+ * The callback is conceptually meant to convey that the CPU is no
+ * longer under the control of SCX. Therefore, don't invoke the
+ * callback if the CPU is is staying on SCX, or going idle (in which
+ * case the SCX scheduler has actively decided not to schedule any
+ * tasks on the CPU).
+ */
+ if (likely(active >= &ext_sched_class))
+ return;
+
+ /*
+ * At this point we know that SCX was preempted by a higher priority
+ * sched_class, so invoke the ->cpu_release() callback if we have not
+ * done so already. We only send the callback once between SCX being
+ * preempted, and it regaining control of the CPU.
+ *
+ * ->cpu_release() complements ->cpu_acquire(), which is emitted the
+ * next time that balance_scx() is invoked.
+ */
+ if (!rq->scx.cpu_released) {
+ if (SCX_HAS_OP(cpu_release)) {
+ struct scx_cpu_release_args args = {
+ .reason = preempt_reason_from_class(active),
+ .task = task,
+ };
+
+ SCX_CALL_OP(SCX_KF_CPU_RELEASE,
+ cpu_release, cpu_of(rq), &args);
+ }
+ rq->scx.cpu_released = true;
+ }
+}
+
#ifdef CONFIG_SMP

static bool test_and_clear_cpu_idle(int cpu)
@@ -2537,6 +2617,7 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
static_branch_disable_cpuslocked(&scx_has_op[i]);
static_branch_disable_cpuslocked(&scx_ops_enq_last);
static_branch_disable_cpuslocked(&scx_ops_enq_exiting);
+ static_branch_disable_cpuslocked(&scx_ops_cpu_preempt);
static_branch_disable_cpuslocked(&scx_builtin_idle_enabled);
synchronize_rcu();

@@ -2743,6 +2824,8 @@ static int scx_ops_enable(struct sched_ext_ops *ops)

if (ops->flags & SCX_OPS_ENQ_EXITING)
static_branch_enable_cpuslocked(&scx_ops_enq_exiting);
+ if (scx_ops.cpu_acquire || scx_ops.cpu_release)
+ static_branch_enable_cpuslocked(&scx_ops_cpu_preempt);

if (!ops->update_idle || (ops->flags & SCX_OPS_KEEP_BUILTIN_IDLE)) {
reset_idle_masks();
@@ -3396,6 +3479,56 @@ static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = {
.set = &scx_kfunc_ids_dispatch,
};

+/**
+ * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ
+ *
+ * Iterate over all of the tasks currently enqueued on the local DSQ of the
+ * caller's CPU, and re-enqueue them in the BPF scheduler. Returns the number of
+ * processed tasks. Can only be called from ops.cpu_release().
+ */
+u32 scx_bpf_reenqueue_local(void)
+{
+ u32 nr_enqueued, i;
+ struct rq *rq;
+ struct scx_rq *scx_rq;
+
+ if (!scx_kf_allowed(SCX_KF_CPU_RELEASE))
+ return 0;
+
+ rq = cpu_rq(smp_processor_id());
+ lockdep_assert_rq_held(rq);
+ scx_rq = &rq->scx;
+
+ /*
+ * Get the number of tasks on the local DSQ before iterating over it to
+ * pull off tasks. The enqueue callback below can signal that it wants
+ * the task to stay on the local DSQ, and we want to prevent the BPF
+ * scheduler from causing us to loop indefinitely.
+ */
+ nr_enqueued = scx_rq->local_dsq.nr;
+ for (i = 0; i < nr_enqueued; i++) {
+ struct task_struct *p;
+
+ p = first_local_task(rq);
+ WARN_ON_ONCE(atomic64_read(&p->scx.ops_state) != SCX_OPSS_NONE);
+ WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_QUEUED));
+ WARN_ON_ONCE(p->scx.holding_cpu != -1);
+ dispatch_dequeue(scx_rq, p);
+ do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
+ }
+
+ return nr_enqueued;
+}
+
+BTF_SET8_START(scx_kfunc_ids_cpu_release)
+BTF_ID_FLAGS(func, scx_bpf_reenqueue_local)
+BTF_SET8_END(scx_kfunc_ids_cpu_release)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_cpu_release = {
+ .owner = THIS_MODULE,
+ .set = &scx_kfunc_ids_cpu_release,
+};
+
/**
* scx_bpf_kick_cpu - Trigger reschedule on a CPU
* @cpu: cpu to kick
@@ -3698,6 +3831,8 @@ static int __init register_ext_kfuncs(void)
&scx_kfunc_set_enqueue_dispatch)) ||
(ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
&scx_kfunc_set_dispatch)) ||
+ (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+ &scx_kfunc_set_cpu_release)) ||
(ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
&scx_kfunc_set_any))) {
pr_err("sched_ext: failed to register kfunc sets (%d)\n", ret);
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 39eb1b25ec99..099e17e92228 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -27,6 +27,17 @@ enum scx_enq_flags {
*/
SCX_ENQ_PREEMPT = 1LLU << 32,

+ /*
+ * The task being enqueued was previously enqueued on the current CPU's
+ * %SCX_DSQ_LOCAL, but was removed from it in a call to the
+ * bpf_scx_reenqueue_local() kfunc. If bpf_scx_reenqueue_local() was
+ * invoked in a ->cpu_release() callback, and the task is again
+ * dispatched back to %SCX_LOCAL_DSQ by this current ->enqueue(), the
+ * task will not be scheduled on the CPU until at least the next invocation
+ * of the ->cpu_acquire() callback.
+ */
+ SCX_ENQ_REENQ = 1LLU << 40,
+
/*
* The task being enqueued is the only task available for the cpu. By
* default, ext core keeps executing such tasks but when
@@ -82,6 +93,8 @@ DECLARE_STATIC_KEY_FALSE(__scx_switched_all);
#define scx_enabled() static_branch_unlikely(&__scx_ops_enabled)
#define scx_switched_all() static_branch_unlikely(&__scx_switched_all)

+DECLARE_STATIC_KEY_FALSE(scx_ops_cpu_preempt);
+
bool task_on_scx(struct task_struct *p);
void scx_pre_fork(struct task_struct *p);
int scx_fork(struct task_struct *p);
@@ -96,13 +109,17 @@ __printf(2, 3) void scx_ops_error_type(enum scx_exit_type type,
#define scx_ops_error(fmt, args...) \
scx_ops_error_type(SCX_EXIT_ERROR, fmt, ##args)

+void __scx_notify_pick_next_task(struct rq *rq,
+ const struct task_struct *p,
+ const struct sched_class *active);
+
static inline void scx_notify_pick_next_task(struct rq *rq,
const struct task_struct *p,
const struct sched_class *active)
{
-#ifdef CONFIG_SMP
if (!scx_enabled())
return;
+#ifdef CONFIG_SMP
/*
* Pairs with the smp_load_acquire() issued by a CPU in
* kick_cpus_irq_workfn() who is waiting for this CPU to perform a
@@ -110,6 +127,9 @@ static inline void scx_notify_pick_next_task(struct rq *rq,
*/
smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
#endif
+ if (!static_branch_unlikely(&scx_ops_cpu_preempt))
+ return;
+ __scx_notify_pick_next_task(rq, p, active);
}

static inline void scx_notify_sched_tick(void)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d31185ecd090..578b88f1dfac 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -706,6 +706,7 @@ struct scx_rq {
u64 ops_qseq;
u32 nr_running;
u32 flags;
+ bool cpu_released;
cpumask_var_t cpus_to_kick;
cpumask_var_t cpus_to_preempt;
cpumask_var_t cpus_to_wait;
diff --git a/tools/sched_ext/scx_common.bpf.h b/tools/sched_ext/scx_common.bpf.h
index ff32e4dd30a6..7d01db7d5e9f 100644
--- a/tools/sched_ext/scx_common.bpf.h
+++ b/tools/sched_ext/scx_common.bpf.h
@@ -81,6 +81,7 @@ void scx_bpf_put_idle_cpumask(const struct cpumask *cpumask) __ksym;
void scx_bpf_destroy_dsq(u64 dsq_id) __ksym;
bool scx_bpf_task_running(const struct task_struct *p) __ksym;
s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym;
+u32 scx_bpf_reenqueue_local(void) __ksym;

#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_EXITING 0x00000004
diff --git a/tools/sched_ext/scx_example_pair.bpf.c b/tools/sched_ext/scx_example_pair.bpf.c
index 8e277225b044..6d9f5b97cbeb 100644
--- a/tools/sched_ext/scx_example_pair.bpf.c
+++ b/tools/sched_ext/scx_example_pair.bpf.c
@@ -85,6 +85,28 @@
* be resolved in the near future which should allow greatly simplifying this
* scheduler.
*
+ * Dealing with preemption
+ * -----------------------
+ *
+ * SCX is the lowest priority sched_class, and could be preempted by them at
+ * any time. To address this, the scheduler implements pair_cpu_release() and
+ * pair_cpu_acquire() callbacks which are invoked by the core scheduler when
+ * the scheduler loses and gains control of the CPU respectively.
+ *
+ * In pair_cpu_release(), we mark the pair_ctx as having been preempted, and
+ * then invoke:
+ *
+ * scx_bpf_kick_cpu(pair_cpu, SCX_KICK_PREEMPT | SCX_KICK_WAIT);
+ *
+ * This preempts the pair CPU, and waits until it has re-entered the scheduler
+ * before returning. This is necessary to ensure that the higher priority
+ * sched_class that preempted our scheduler does not schedule a task
+ * concurrently with our pair CPU.
+ *
+ * When the CPU is re-acquired in pair_cpu_acquire(), we unmark the preemption
+ * in the pair_ctx, and send another resched IPI to the pair CPU to re-enable
+ * pair scheduling.
+ *
* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
* Copyright (c) 2022 Tejun Heo <[email protected]>
* Copyright (c) 2022 David Vernet <[email protected]>
@@ -124,6 +146,12 @@ struct pair_ctx {

/* the CPUs that are currently active on the cgroup */
u32 active_mask;
+
+ /*
+ * the CPUs that are currently preempted and running tasks in a
+ * different scheduler.
+ */
+ u32 preempted_mask;
};

struct {
@@ -340,7 +368,7 @@ static int dispatch_loopfn(u32 idx, void *data)
struct task_struct *p;
u64 now = bpf_ktime_get_ns();
bool kick_pair = false;
- bool expired;
+ bool expired, pair_preempted;
u32 *vptr, in_pair_mask;
s32 pid;
u64 cgid;
@@ -369,10 +397,14 @@ static int dispatch_loopfn(u32 idx, void *data)
*/
pairc->draining = true;

- if (pairc->active_mask) {
+ pair_preempted = pairc->preempted_mask;
+ if (pairc->active_mask || pair_preempted) {
/*
- * The other CPU is still active We want to wait until
- * this cgroup expires.
+ * The other CPU is still active, or is no longer under
+ * our control due to e.g. being preempted by a higher
+ * priority sched_class. We want to wait until this
+ * cgroup expires, or until control of our pair CPU has
+ * been returned to us.
*
* If the pair controls its CPU, and the time already
* expired, kick. When the other CPU arrives at
@@ -381,7 +413,7 @@ static int dispatch_loopfn(u32 idx, void *data)
*/
__sync_fetch_and_add(&nr_exp_waits, 1);
bpf_spin_unlock(&pairc->lock);
- if (expired)
+ if (expired && !pair_preempted)
kick_pair = true;
goto out_maybe_kick;
}
@@ -486,6 +518,63 @@ void BPF_STRUCT_OPS(pair_dispatch, s32 cpu, struct task_struct *prev)
bpf_loop(1 << 23, dispatch_loopfn, &cpu_on_stack, 0);
}

+void BPF_STRUCT_OPS(pair_cpu_acquire, s32 cpu, struct scx_cpu_acquire_args *args)
+{
+ int ret;
+ u32 in_pair_mask;
+ struct pair_ctx *pairc;
+ bool kick_pair;
+
+ ret = lookup_pairc_and_mask(cpu, &pairc, &in_pair_mask);
+ if (ret)
+ return;
+
+ bpf_spin_lock(&pairc->lock);
+ pairc->preempted_mask &= ~in_pair_mask;
+ /* Kick the pair CPU, unless it was also preempted. */
+ kick_pair = !pairc->preempted_mask;
+ bpf_spin_unlock(&pairc->lock);
+
+ if (kick_pair) {
+ s32 *pair = (s32 *)MEMBER_VPTR(pair_cpu, [cpu]);
+
+ if (pair) {
+ __sync_fetch_and_add(&nr_kicks, 1);
+ scx_bpf_kick_cpu(*pair, SCX_KICK_PREEMPT);
+ }
+ }
+}
+
+void BPF_STRUCT_OPS(pair_cpu_release, s32 cpu, struct scx_cpu_release_args *args)
+{
+ int ret;
+ u32 in_pair_mask;
+ struct pair_ctx *pairc;
+ bool kick_pair;
+
+ ret = lookup_pairc_and_mask(cpu, &pairc, &in_pair_mask);
+ if (ret)
+ return;
+
+ bpf_spin_lock(&pairc->lock);
+ pairc->preempted_mask |= in_pair_mask;
+ pairc->active_mask &= ~in_pair_mask;
+ /* Kick the pair CPU if it's still running. */
+ kick_pair = pairc->active_mask;
+ pairc->draining = true;
+ bpf_spin_unlock(&pairc->lock);
+
+ if (kick_pair) {
+ s32 *pair = (s32 *)MEMBER_VPTR(pair_cpu, [cpu]);
+
+ if (pair) {
+ __sync_fetch_and_add(&nr_kicks, 1);
+ scx_bpf_kick_cpu(*pair, SCX_KICK_PREEMPT | SCX_KICK_WAIT);
+ }
+ }
+ __sync_fetch_and_add(&nr_preemptions, 1);
+}
+
static int alloc_cgrp_q_idx_loopfn(u32 idx, void *data)
{
u32 q_idx;
@@ -547,6 +636,8 @@ SEC(".struct_ops")
struct sched_ext_ops pair_ops = {
.enqueue = (void *)pair_enqueue,
.dispatch = (void *)pair_dispatch,
+ .cpu_acquire = (void *)pair_cpu_acquire,
+ .cpu_release = (void *)pair_cpu_release,
.cgroup_init = (void *)pair_cgroup_init,
.cgroup_exit = (void *)pair_cgroup_exit,
.init = (void *)pair_init,
diff --git a/tools/sched_ext/scx_example_qmap.bpf.c b/tools/sched_ext/scx_example_qmap.bpf.c
index e968a9b341a4..7e670986542b 100644
--- a/tools/sched_ext/scx_example_qmap.bpf.c
+++ b/tools/sched_ext/scx_example_qmap.bpf.c
@@ -11,6 +11,8 @@
*
* - BPF-side queueing using PIDs.
* - Sleepable per-task storage allocation using ops.prep_enable().
+ * - Using ops.cpu_release() to handle a higher priority scheduling class taking
+ * the CPU away.
*
* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
* Copyright (c) 2022 Tejun Heo <[email protected]>
@@ -78,7 +80,7 @@ struct {
} dispatch_idx_cnt SEC(".maps");

/* Statistics */
-unsigned long nr_enqueued, nr_dispatched, nr_dequeued;
+unsigned long nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued;

s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
s32 prev_cpu, u64 wake_flags)
@@ -152,6 +154,22 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
return;
}

+ /*
+ * If the task was re-enqueued due to the CPU being preempted by a
+ * higher priority scheduling class, just re-enqueue the task directly
+ * on the global DSQ. As we want another CPU to pick it up, find and
+ * kick an idle CPU.
+ */
+ if (enq_flags & SCX_ENQ_REENQ) {
+ s32 cpu;
+
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, 0, enq_flags);
+ cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr);
+ if (cpu >= 0)
+ scx_bpf_kick_cpu(cpu, 0);
+ return;
+ }
+
ring = bpf_map_lookup_elem(&queue_arr, &idx);
if (!ring) {
scx_bpf_error("failed to find ring %d", idx);
@@ -237,6 +255,22 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
}
}

+void BPF_STRUCT_OPS(qmap_cpu_release, s32 cpu, struct scx_cpu_release_args *args)
+{
+ u32 cnt;
+
+ /*
+ * Called when @cpu is taken by a higher priority scheduling class. This
+ * makes @cpu no longer available for executing sched_ext tasks. As we
+ * don't want the tasks in @cpu's local dsq to sit there until @cpu
+ * becomes available again, re-enqueue them into the global dsq. See
+ * %SCX_ENQ_REENQ handling in qmap_enqueue().
+ */
+ cnt = scx_bpf_reenqueue_local();
+ if (cnt)
+ __sync_fetch_and_add(&nr_reenqueued, cnt);
+}
+
s32 BPF_STRUCT_OPS(qmap_prep_enable, struct task_struct *p,
struct scx_enable_args *args)
{
@@ -272,6 +306,7 @@ struct sched_ext_ops qmap_ops = {
.enqueue = (void *)qmap_enqueue,
.dequeue = (void *)qmap_dequeue,
.dispatch = (void *)qmap_dispatch,
+ .cpu_release = (void *)qmap_cpu_release,
.prep_enable = (void *)qmap_prep_enable,
.init = (void *)qmap_init,
.exit = (void *)qmap_exit,
diff --git a/tools/sched_ext/scx_example_qmap.c b/tools/sched_ext/scx_example_qmap.c
index 820fe50bf43c..de6f03ccb233 100644
--- a/tools/sched_ext/scx_example_qmap.c
+++ b/tools/sched_ext/scx_example_qmap.c
@@ -91,9 +91,9 @@ int main(int argc, char **argv)
long nr_enqueued = skel->bss->nr_enqueued;
long nr_dispatched = skel->bss->nr_dispatched;

- printf("enq=%lu, dsp=%lu, delta=%ld, deq=%lu\n",
+ printf("enq=%lu, dsp=%lu, delta=%ld, reenq=%lu, deq=%lu\n",
nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
- skel->bss->nr_dequeued);
+ skel->bss->nr_reenqueued, skel->bss->nr_dequeued);
fflush(stdout);
sleep(1);
}
--
2.39.1


2023-01-28 00:20:10

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 26/30] sched_ext: Implement sched_ext_ops.cpu_online/offline()

Add ops.cpu_online/offline() which are invoked when CPUs come online and
offline respectively. As the enqueue path already automatically bypasses
tasks to the local dsq on a deactivated CPU, BPF schedulers are guaranteed
to see tasks only on CPUs which are between online() and offline().

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
---
include/linux/sched/ext.h | 18 ++++++++++++++++++
kernel/sched/ext.c | 15 +++++++++++++++
2 files changed, 33 insertions(+)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 82ead36d1136..01c846445243 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -368,6 +368,24 @@ struct sched_ext_ops {
*/
void (*cpu_release)(s32 cpu, struct scx_cpu_release_args *args);

+ /**
+ * cpu_online - A CPU became online
+ * @cpu: CPU which just came up
+ *
+ * @cpu just came online. @cpu doesn't call ops.enqueue() or run tasks
+ * associated with other CPUs beforehand.
+ */
+ void (*cpu_online)(s32 cpu);
+
+ /**
+ * cpu_offline - A CPU is going offline
+ * @cpu: CPU which is going offline
+ *
+ * @cpu is going offline. @cpu doesn't call ops.enqueue() or run tasks
+ * associated with other CPUs afterwards.
+ */
+ void (*cpu_offline)(s32 cpu);
+
/**
* prep_enable - Prepare to enable BPF scheduling for a task
* @p: task to prepare BPF scheduling for
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 072082968f0f..e981b7111e0a 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1727,6 +1727,18 @@ void __scx_update_idle(struct rq *rq, bool idle)
}
}

+static void rq_online_scx(struct rq *rq, enum rq_onoff_reason reason)
+{
+ if (SCX_HAS_OP(cpu_online) && reason == RQ_ONOFF_HOTPLUG)
+ scx_ops.cpu_online(cpu_of(rq));
+}
+
+static void rq_offline_scx(struct rq *rq, enum rq_onoff_reason reason)
+{
+ if (SCX_HAS_OP(cpu_offline) && reason == RQ_ONOFF_HOTPLUG)
+ scx_ops.cpu_offline(cpu_of(rq));
+}
+
#else /* !CONFIG_SMP */

static bool test_and_clear_cpu_idle(int cpu) { return false; }
@@ -2215,6 +2227,9 @@ DEFINE_SCHED_CLASS(ext) = {
.balance = balance_scx,
.select_task_rq = select_task_rq_scx,
.set_cpus_allowed = set_cpus_allowed_scx,
+
+ .rq_online = rq_online_scx,
+ .rq_offline = rq_offline_scx,
#endif

.task_tick = task_tick_scx,
--
2.39.1


2023-01-28 00:20:15

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 27/30] sched_ext: Implement core-sched support

The core-sched support is composed of the following parts:

* task_struct->scx.core_sched_at is added. This is a timestamp which can be
used to order tasks. Depending on whether the BPF scheduler implements
custom ordering, it tracks either global FIFO ordering of all tasks or
local-DSQ ordering within the dispatched tasks on a CPU.

* prio_less() is updated to call scx_prio_less() when comparing SCX tasks.
scx_prio_less() calls ops.core_sched_before() if available or uses the
core_sched_at timestamp. For global FIFO ordering, the BPF scheduler
doesn't need to do anything. Otherwise, it should implement
ops.core_sched_before() which reflects the ordering.

* When core-sched is enabled, balance_scx() balances all SMT siblings so
that they all have tasks dispatched if necessary before pick_task_scx() is
called. pick_task_scx() picks between the current task and the first
dispatched task on the local DSQ based on availability and the
core_sched_at timestamps. Note that FIFO ordering is expected among the
already dispatched tasks whether running or on the local DSQ, so this path
always compares core_sched_at instead of calling into
ops.core_sched_before().

qmap_core_sched_before() is added to scx_example_qmap. It scales the
distances from the heads of the queues to compare the tasks across different
priority queues and seems to behave as expected.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
---
include/linux/sched/ext.h | 21 +++
kernel/Kconfig.preempt | 2 +-
kernel/sched/core.c | 12 +-
kernel/sched/ext.c | 196 +++++++++++++++++++++++--
kernel/sched/ext.h | 12 ++
tools/sched_ext/scx_example_qmap.bpf.c | 87 ++++++++++-
tools/sched_ext/scx_example_qmap.c | 5 +-
7 files changed, 319 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 01c846445243..d3c2701bb4b4 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -315,6 +315,24 @@ struct sched_ext_ops {
*/
bool (*yield)(struct task_struct *from, struct task_struct *to);

+ /**
+ * core_sched_before - Task ordering for core-sched
+ * @a: task A
+ * @b: task B
+ *
+ * Used by core-sched to determine the ordering between two tasks. See
+ * Documentation/admin-guide/hw-vuln/core-scheduling.rst for details on
+ * core-sched.
+ *
+ * Both @a and @b are runnable and may or may not currently be queued on
+ * the BPF scheduler. Should return %true if @a should run before @b.
+ * %false if there's no required ordering or @b should run before @a.
+ *
+ * If not specified, the default is ordering them according to when they
+ * became runnable.
+ */
+ bool (*core_sched_before)(struct task_struct *a, struct task_struct *b);
+
/**
* set_cpumask - Set CPU affinity
* @p: task to set CPU affinity for
@@ -611,6 +629,9 @@ struct sched_ext_entity {
u32 kf_mask; /* see scx_kf_mask above */
atomic64_t ops_state;
unsigned long runnable_at;
+#ifdef CONFIG_SCHED_CORE
+ u64 core_sched_at; /* see scx_prio_less() */
+#endif

/* BPF scheduler modifiable fields */

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 0afcda19bc50..e12a057ead7b 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -135,7 +135,7 @@ config SCHED_CORE

config SCHED_CLASS_EXT
bool "Extensible Scheduling Class"
- depends on BPF_SYSCALL && BPF_JIT && !SCHED_CORE
+ depends on BPF_SYSCALL && BPF_JIT
help
This option enables a new scheduler class sched_ext (SCX), which
allows scheduling policies to be implemented as BPF programs to
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 47334e428031..a40b74a2fdbd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -163,7 +163,12 @@ static inline int __task_prio(struct task_struct *p)
if (p->sched_class == &idle_sched_class)
return MAX_RT_PRIO + NICE_WIDTH; /* 140 */

- return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */
+#ifdef CONFIG_SCHED_CLASS_EXT
+ if (p->sched_class == &ext_sched_class)
+ return MAX_RT_PRIO + MAX_NICE + 1; /* 120, squash ext */
+#endif
+
+ return MAX_RT_PRIO + MAX_NICE; /* 119, squash fair */
}

/*
@@ -191,6 +196,11 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b, bool
if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
return cfs_prio_less(a, b, in_fi);

+#ifdef CONFIG_SCHED_CLASS_EXT
+ if (pa == MAX_RT_PRIO + MAX_NICE + 1) /* ext */
+ return scx_prio_less(a, b, in_fi);
+#endif
+
return false;
}

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index e981b7111e0a..8619eb2dcbd5 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -447,6 +447,44 @@ static int ops_sanitize_err(const char *ops_name, s32 err)
return -EPROTO;
}

+/**
+ * touch_core_sched - Update timestamp used for core-sched task ordering
+ * @rq: rq to read clock from, must be locked
+ * @p: task to update the timestamp for
+ *
+ * Update @p->scx.core_sched_at timestamp. This is used by scx_prio_less() to
+ * implement global or local-DSQ FIFO ordering for core-sched. Should be called
+ * when a task becomes runnable and its turn on the CPU ends (e.g. slice
+ * exhaustion).
+ */
+static void touch_core_sched(struct rq *rq, struct task_struct *p)
+{
+#ifdef CONFIG_SCHED_CORE
+ p->scx.core_sched_at = rq_clock_task(rq);
+#endif
+}
+
+/**
+ * touch_core_sched_dispatch - Update core-sched timestamp on dispatch
+ * @rq: rq to read clock from, must be locked
+ * @p: task being dispatched
+ *
+ * If the BPF scheduler implements custom core-sched ordering via
+ * ops.core_sched_before(), @p->scx.core_sched_at is used to implement FIFO
+ * ordering within each local DSQ. This function is called from dispatch paths
+ * and updates @p->scx.core_sched_at if custom core-sched ordering is in effect.
+ */
+static void touch_core_sched_dispatch(struct rq *rq, struct task_struct *p)
+{
+ lockdep_assert_rq_held(rq);
+ assert_clock_updated(rq);
+
+#ifdef CONFIG_SCHED_CORE
+ if (SCX_HAS_OP(core_sched_before))
+ touch_core_sched(rq, p);
+#endif
+}
+
static void update_curr_scx(struct rq *rq)
{
struct task_struct *curr = rq->curr;
@@ -462,8 +500,11 @@ static void update_curr_scx(struct rq *rq)
account_group_exec_runtime(curr, delta_exec);
cgroup_account_cputime(curr, delta_exec);

- if (curr->scx.slice != SCX_SLICE_INF)
+ if (curr->scx.slice != SCX_SLICE_INF) {
curr->scx.slice -= min(curr->scx.slice, delta_exec);
+ if (!curr->scx.slice)
+ touch_core_sched(rq, curr);
+ }
}

static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,
@@ -619,6 +660,8 @@ static void direct_dispatch(struct task_struct *ddsp_task, struct task_struct *p
return;
}

+ touch_core_sched_dispatch(task_rq(p), p);
+
dsq = find_dsq_for_dispatch(task_rq(p), dsq_id, p);
dispatch_enqueue(dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);

@@ -702,12 +745,19 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
return;

local:
+ /*
+ * For task-ordering, slice refill must be treated as implying the end
+ * of the current slice. Otherwise, the longer @p stays on the CPU, the
+ * higher priority it becomes from scx_prio_less()'s POV.
+ */
+ touch_core_sched(rq, p);
p->scx.slice = SCX_SLICE_DFL;
local_norefill:
dispatch_enqueue(&rq->scx.local_dsq, p, enq_flags);
return;

global:
+ touch_core_sched(rq, p); /* see the comment in local: */
p->scx.slice = SCX_SLICE_DFL;
dispatch_enqueue(&scx_dsq_global, p, enq_flags);
}
@@ -762,6 +812,9 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
if (SCX_HAS_OP(runnable))
scx_ops.runnable(p, enq_flags);

+ if (enq_flags & SCX_ENQ_WAKEUP)
+ touch_core_sched(rq, p);
+
do_enqueue_task(rq, p, enq_flags, sticky_cpu);
}

@@ -1201,6 +1254,7 @@ static void finish_dispatch(struct rq *rq, struct rq_flags *rf,
struct scx_dispatch_q *dsq;
u64 opss;

+ touch_core_sched_dispatch(rq, p);
retry:
/*
* No need for _acquire here. @p is accessed only after a successful
@@ -1278,8 +1332,8 @@ static void flush_dispatch_buf(struct rq *rq, struct rq_flags *rf)
dspc->buf_cursor = 0;
}

-static int balance_scx(struct rq *rq, struct task_struct *prev,
- struct rq_flags *rf)
+static int balance_one(struct rq *rq, struct task_struct *prev,
+ struct rq_flags *rf, bool local)
{
struct scx_rq *scx_rq = &rq->scx;
struct scx_dsp_ctx *dspc = this_cpu_ptr(&scx_dsp_ctx);
@@ -1302,7 +1356,7 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
}

if (prev_on_scx) {
- WARN_ON_ONCE(prev->scx.flags & SCX_TASK_BAL_KEEP);
+ WARN_ON_ONCE(local && (prev->scx.flags & SCX_TASK_BAL_KEEP));
update_curr_scx(rq);

/*
@@ -1314,10 +1368,16 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
*
* See scx_ops_disable_workfn() for the explanation on the
* disabling() test.
+ *
+ * When balancing a remote CPU for core-sched, there won't be a
+ * following put_prev_task_scx() call and we don't own
+ * %SCX_TASK_BAL_KEEP. Instead, pick_task_scx() will test the
+ * same conditions later and pick @rq->curr accordingly.
*/
if ((prev->scx.flags & SCX_TASK_QUEUED) &&
prev->scx.slice && !scx_ops_disabling()) {
- prev->scx.flags |= SCX_TASK_BAL_KEEP;
+ if (local)
+ prev->scx.flags |= SCX_TASK_BAL_KEEP;
return 1;
}
}
@@ -1373,10 +1433,55 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
return 0;
}

+static int balance_scx(struct rq *rq, struct task_struct *prev,
+ struct rq_flags *rf)
+{
+ int ret;
+
+ ret = balance_one(rq, prev, rf, true);
+
+ /*
+ * When core-sched is enabled, this ops.balance() call will be followed
+ * by put_prev_scx() and pick_task_scx() on this CPU and pick_task_scx()
+ * on the SMT siblings. Balance the siblings too.
+ */
+ if (sched_core_enabled(rq)) {
+ const struct cpumask *smt_mask = cpu_smt_mask(cpu_of(rq));
+ int scpu;
+
+ for_each_cpu_andnot(scpu, smt_mask, cpumask_of(cpu_of(rq))) {
+ struct rq *srq = cpu_rq(scpu);
+ struct rq_flags srf;
+ struct task_struct *sprev = srq->curr;
+
+ /*
+ * While core-scheduling, rq lock is shared among
+ * siblings but the debug annotations and rq clock
+ * aren't. Do pinning dance to transfer the ownership.
+ */
+ WARN_ON_ONCE(__rq_lockp(rq) != __rq_lockp(srq));
+ rq_unpin_lock(rq, rf);
+ rq_pin_lock(srq, &srf);
+
+ update_rq_clock(srq);
+ balance_one(srq, sprev, &srf, false);
+
+ rq_unpin_lock(srq, &srf);
+ rq_repin_lock(rq, rf);
+ }
+ }
+
+ return ret;
+}
+
static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
{
if (p->scx.flags & SCX_TASK_QUEUED) {
- WARN_ON_ONCE(atomic64_read(&p->scx.ops_state) != SCX_OPSS_NONE);
+ /*
+ * Core-sched might decide to execute @p before it is
+ * dispatched. Call ops_dequeue() to notify the BPF scheduler.
+ */
+ ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC);
dispatch_dequeue(&rq->scx, p);
}

@@ -1516,6 +1621,69 @@ static struct task_struct *pick_next_task_scx(struct rq *rq)
return p;
}

+#ifdef CONFIG_SCHED_CORE
+/**
+ * scx_prio_less - Task ordering for core-sched
+ * @a: task A
+ * @b: task B
+ *
+ * Core-sched is implemented as an additional scheduling layer on top of the
+ * usual sched_class'es and needs to find out the expected task ordering. For
+ * SCX, core-sched calls this function to interrogate the task ordering.
+ *
+ * Unless overridden by ops.core_sched_before(), @p->scx.core_sched_at is used
+ * to implement the default task ordering. The older the timestamp, the higher
+ * prority the task - the global FIFO ordering matching the default scheduling
+ * behavior.
+ *
+ * When ops.core_sched_before() is enabled, @p->scx.core_sched_at is used to
+ * implement FIFO ordering within each local DSQ. See pick_task_scx().
+ */
+bool scx_prio_less(struct task_struct *a, struct task_struct *b, bool in_fi)
+{
+ if (SCX_HAS_OP(core_sched_before) && !scx_ops_disabling())
+ return scx_ops.core_sched_before(a, b);
+ else
+ return time_after64(a->scx.core_sched_at, b->scx.core_sched_at);
+}
+
+/**
+ * pick_task_scx - Pick a candidate task for core-sched
+ * @rq: rq to pick the candidate task from
+ *
+ * Core-sched calls this function on each SMT sibling to determine the next
+ * tasks to run on the SMT siblings. balance_one() has been called on all
+ * siblings and put_prev_task_scx() has been called only for the current CPU.
+ *
+ * As put_prev_task_scx() hasn't been called on remote CPUs, we can't just look
+ * at the first task in the local dsq. @rq->curr has to be considered explicitly
+ * to mimic %SCX_TASK_BAL_KEEP.
+ */
+static struct task_struct *pick_task_scx(struct rq *rq)
+{
+ struct task_struct *curr = rq->curr;
+ struct task_struct *first = first_local_task(rq);
+
+ if (curr->scx.flags & SCX_TASK_QUEUED) {
+ /* is curr the only runnable task? */
+ if (!first)
+ return curr;
+
+ /*
+ * Does curr trump first? We can always go by core_sched_at for
+ * this comparison as it represents global FIFO ordering when
+ * the default core-sched ordering is in used and local-DSQ FIFO
+ * ordering otherwise.
+ */
+ if (curr->scx.slice && time_before64(curr->scx.core_sched_at,
+ first->scx.core_sched_at))
+ return curr;
+ }
+
+ return first; /* this may be %NULL */
+}
+#endif /* CONFIG_SCHED_CORE */
+
static enum scx_cpu_preempt_reason
preempt_reason_from_class(const struct sched_class *class)
{
@@ -1795,11 +1963,13 @@ static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued)
update_curr_scx(rq);

/*
- * While disabling, always resched as we can't trust the slice
- * management.
+ * While disabling, always resched and refresh core-sched timestamp as
+ * we can't trust the slice management or ops.core_sched_before().
*/
- if (scx_ops_disabling())
+ if (scx_ops_disabling()) {
curr->scx.slice = 0;
+ touch_core_sched(rq, curr);
+ }

if (!curr->scx.slice)
resched_curr(rq);
@@ -2232,6 +2402,10 @@ DEFINE_SCHED_CLASS(ext) = {
.rq_offline = rq_offline_scx,
#endif

+#ifdef CONFIG_SCHED_CORE
+ .pick_task = pick_task_scx,
+#endif
+
.task_tick = task_tick_scx,

.switching_to = switching_to_scx,
@@ -2560,9 +2734,11 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
*
* b. balance_scx() never sets %SCX_TASK_BAL_KEEP as the slice value
* can't be trusted. Whenever a tick triggers, the running task is
- * rotated to the tail of the queue.
+ * rotated to the tail of the queue with core_sched_at touched.
*
* c. pick_next_task() suppresses zero slice warning.
+ *
+ * d. scx_prio_less() reverts to the default core_sched_at order.
*/
scx_ops.enqueue = scx_ops_fallback_enqueue;
scx_ops.dispatch = scx_ops_fallback_dispatch;
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 099e17e92228..c3df39984fc9 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -68,6 +68,14 @@ enum scx_enq_flags {
enum scx_deq_flags {
/* expose select DEQUEUE_* flags as enums */
SCX_DEQ_SLEEP = DEQUEUE_SLEEP,
+
+ /* high 32bits are SCX specific */
+
+ /*
+ * The generic core-sched layer decided to execute the task even though
+ * it hasn't been dispatched yet. Dequeue from the BPF side.
+ */
+ SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32,
};

enum scx_tg_flags {
@@ -173,6 +181,10 @@ static inline const struct sched_class *next_active_class(const struct sched_cla
for_active_class_range(class, (prev_class) > &ext_sched_class ? \
&ext_sched_class : (prev_class), (end_class))

+#ifdef CONFIG_SCHED_CORE
+bool scx_prio_less(struct task_struct *a, struct task_struct *b, bool in_fi);
+#endif
+
#else /* CONFIG_SCHED_CLASS_EXT */

#define scx_enabled() false
diff --git a/tools/sched_ext/scx_example_qmap.bpf.c b/tools/sched_ext/scx_example_qmap.bpf.c
index 7e670986542b..7d851fd987ac 100644
--- a/tools/sched_ext/scx_example_qmap.bpf.c
+++ b/tools/sched_ext/scx_example_qmap.bpf.c
@@ -13,6 +13,7 @@
* - Sleepable per-task storage allocation using ops.prep_enable().
* - Using ops.cpu_release() to handle a higher priority scheduling class taking
* the CPU away.
+ * - Core-sched support.
*
* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
* Copyright (c) 2022 Tejun Heo <[email protected]>
@@ -59,9 +60,21 @@ struct {
},
};

+/*
+ * Per-queue sequence numbers to implement core-sched ordering.
+ *
+ * Tail seq is assigned to each queued task and incremented. Head seq tracks the
+ * sequence number of the latest dispatched task. The distance between the a
+ * task's seq and the associated queue's head seq is called the queue distance
+ * and used when comparing two tasks for ordering. See qmap_core_sched_before().
+ */
+static u64 core_sched_head_seqs[5];
+static u64 core_sched_tail_seqs[5];
+
/* Per-task scheduling context */
struct task_ctx {
bool force_local; /* Dispatch directly to local_dsq */
+ u64 core_sched_seq;
};

struct {
@@ -81,6 +94,7 @@ struct {

/* Statistics */
unsigned long nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued;
+unsigned long nr_core_sched_execed;

s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
s32 prev_cpu, u64 wake_flags)
@@ -147,8 +161,18 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
return;
}

- /* Is select_cpu() is telling us to enqueue locally? */
- if (tctx->force_local) {
+ /*
+ * All enqueued tasks must have their core_sched_seq updated for correct
+ * core-sched ordering, which is why %SCX_OPS_ENQ_LAST is specified in
+ * qmap_ops.flags.
+ */
+ tctx->core_sched_seq = core_sched_tail_seqs[idx]++;
+
+ /*
+ * If qmap_select_cpu() is telling us to or this is the last runnable
+ * task on the CPU, enqueue locally.
+ */
+ if (tctx->force_local || (enq_flags & SCX_ENQ_LAST)) {
tctx->force_local = false;
scx_bpf_dispatch(p, SCX_DSQ_LOCAL, slice_ns, enq_flags);
return;
@@ -192,6 +216,19 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
void BPF_STRUCT_OPS(qmap_dequeue, struct task_struct *p, u64 deq_flags)
{
__sync_fetch_and_add(&nr_dequeued, 1);
+ if (deq_flags & SCX_DEQ_CORE_SCHED_EXEC)
+ __sync_fetch_and_add(&nr_core_sched_execed, 1);
+}
+
+static void update_core_sched_head_seq(struct task_struct *p)
+{
+ struct task_ctx *tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
+ int idx = weight_to_idx(p->scx.weight);
+
+ if (tctx)
+ core_sched_head_seqs[idx] = tctx->core_sched_seq;
+ else
+ scx_bpf_error("task_ctx lookup failed");
}

void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
@@ -244,6 +281,7 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)

p = bpf_task_from_pid(pid);
if (p) {
+ update_core_sched_head_seq(p);
__sync_fetch_and_add(&nr_dispatched, 1);
scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, slice_ns, 0);
bpf_task_release(p);
@@ -255,6 +293,49 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
}
}

+/*
+ * The distance from the head of the queue scaled by the weight of the queue.
+ * The lower the number, the older the task and the higher the priority.
+ */
+static s64 task_qdist(struct task_struct *p)
+{
+ int idx = weight_to_idx(p->scx.weight);
+ struct task_ctx *tctx;
+ s64 qdist;
+
+ tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
+ if (!tctx) {
+ scx_bpf_error("task_ctx lookup failed");
+ return 0;
+ }
+
+ qdist = tctx->core_sched_seq - core_sched_head_seqs[idx];
+
+ /*
+ * As queue index increments, the priority doubles. The queue w/ index 3
+ * is dispatched twice more frequently than 2. Reflect the difference by
+ * scaling qdists accordingly. Note that the shift amount needs to be
+ * flipped depending on the sign to avoid flipping priority direction.
+ */
+ if (qdist >= 0)
+ return qdist << (4 - idx);
+ else
+ return qdist << idx;
+}
+
+/*
+ * This is called to determine the task ordering when core-sched is picking
+ * tasks to execute on SMT siblings and should encode about the same ordering as
+ * the regular scheduling path. Use the priority-scaled distances from the head
+ * of the queues to compare the two tasks which should be consistent with the
+ * dispatch path behavior.
+ */
+bool BPF_STRUCT_OPS(qmap_core_sched_before,
+ struct task_struct *a, struct task_struct *b)
+{
+ return task_qdist(a) > task_qdist(b);
+}
+
void BPF_STRUCT_OPS(qmap_cpu_release, s32 cpu, struct scx_cpu_release_args *args)
{
u32 cnt;
@@ -306,10 +387,12 @@ struct sched_ext_ops qmap_ops = {
.enqueue = (void *)qmap_enqueue,
.dequeue = (void *)qmap_dequeue,
.dispatch = (void *)qmap_dispatch,
+ .core_sched_before = (void *)qmap_core_sched_before,
.cpu_release = (void *)qmap_cpu_release,
.prep_enable = (void *)qmap_prep_enable,
.init = (void *)qmap_init,
.exit = (void *)qmap_exit,
+ .flags = SCX_OPS_ENQ_LAST,
.timeout_ms = 5000U,
.name = "qmap",
};
diff --git a/tools/sched_ext/scx_example_qmap.c b/tools/sched_ext/scx_example_qmap.c
index de6f03ccb233..02fabe97ac9f 100644
--- a/tools/sched_ext/scx_example_qmap.c
+++ b/tools/sched_ext/scx_example_qmap.c
@@ -91,9 +91,10 @@ int main(int argc, char **argv)
long nr_enqueued = skel->bss->nr_enqueued;
long nr_dispatched = skel->bss->nr_dispatched;

- printf("enq=%lu, dsp=%lu, delta=%ld, reenq=%lu, deq=%lu\n",
+ printf("enq=%lu, dsp=%lu, delta=%ld, reenq=%lu, deq=%lu, core=%lu\n",
nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
- skel->bss->nr_reenqueued, skel->bss->nr_dequeued);
+ skel->bss->nr_reenqueued, skel->bss->nr_dequeued,
+ skel->bss->nr_core_sched_execed);
fflush(stdout);
sleep(1);
}
--
2.39.1


2023-01-28 00:20:25

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 28/30] sched_ext: Documentation: scheduler: Document extensible scheduler class

Add Documentation/scheduler/sched-ext.rst which gives a high-level overview
and pointers to the examples.

v2: Apply minor edits suggested by Bagas. Caveats section dropped as all of
them are addressed.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
Cc: Bagas Sanjaya <[email protected]>
---
Documentation/scheduler/index.rst | 1 +
Documentation/scheduler/sched-ext.rst | 224 ++++++++++++++++++++++++++
include/linux/sched/ext.h | 2 +
kernel/Kconfig.preempt | 2 +
kernel/sched/ext.c | 2 +
kernel/sched/ext.h | 2 +
6 files changed, 233 insertions(+)
create mode 100644 Documentation/scheduler/sched-ext.rst

diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index b430d856056a..8a27a9967284 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -18,6 +18,7 @@ Linux Scheduler
sched-nice-design
sched-rt-group
sched-stats
+ sched-ext
sched-debug

text_files
diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
new file mode 100644
index 000000000000..8a3626c884e7
--- /dev/null
+++ b/Documentation/scheduler/sched-ext.rst
@@ -0,0 +1,224 @@
+==========================
+Extensible Scheduler Class
+==========================
+
+sched_ext is a scheduler class whose behavior can be defined by a set of BPF
+programs - the BPF scheduler.
+
+* sched_ext exports a full scheduling interface so that any scheduling
+ algorithm can be implemented on top.
+
+* The BPF scheduler can group CPUs however it sees fit and schedule them
+ together, as tasks aren't tied to specific CPUs at the time of wakeup.
+
+* The BPF scheduler can be turned on and off dynamically anytime.
+
+* The system integrity is maintained no matter what the BPF scheduler does.
+ The default scheduling behavior is restored anytime an error is detected,
+ a runnable task stalls, or on invoking the SysRq key sequence
+ :kbd:`SysRq-S`.
+
+Switching to and from sched_ext
+===============================
+
+``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and
+``tools/sched_ext`` contains the example schedulers.
+
+sched_ext is used only when the BPF scheduler is loaded and running.
+
+If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be
+treated as ``SCHED_NORMAL`` and scheduled by CFS until the BPF scheduler is
+loaded. On load, such tasks will be switched to and scheduled by sched_ext.
+
+The BPF scheduler can choose to schedule all normal and lower class tasks by
+calling ``scx_bpf_switch_all()`` from its ``init()`` operation. In this
+case, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE`` and
+``SCHED_EXT`` tasks are scheduled by sched_ext. In the example schedulers,
+this mode can be selected with the ``-a`` option.
+
+Terminating the sched_ext scheduler program, triggering :kbd:`SysRq-S`, or
+detection of any internal error including stalled runnable tasks aborts the
+BPF scheduler and reverts all tasks back to CFS.
+
+.. code-block:: none
+
+ # make -j16 -C tools/sched_ext
+ # tools/sched_ext/scx_example_dummy -a
+ local=0 global=3
+ local=5 global=24
+ local=9 global=44
+ local=13 global=56
+ local=17 global=72
+ ^CEXIT: BPF scheduler unregistered
+
+If ``CONFIG_SCHED_DEBUG`` is set, the current status of the BPF scheduler
+and whether a given task is on sched_ext can be determined as follows:
+
+.. code-block:: none
+
+ # cat /sys/kernel/debug/sched/ext
+ ops : dummy
+ enabled : 1
+ switching_all : 1
+ switched_all : 1
+ enable_state : enabled
+
+ # grep ext /proc/self/sched
+ ext.enabled : 1
+
+The Basics
+==========
+
+Userspace can implement an arbitrary BPF scheduler by loading a set of BPF
+programs that implement ``struct sched_ext_ops``. The only mandatory field
+is ``ops.name`` which must be a valid BPF object name. All operations are
+optional. The following modified excerpt is from
+``tools/sched/scx_example_dummy.bpf.c`` showing a minimal global FIFO
+scheduler.
+
+.. code-block:: c
+
+ s32 BPF_STRUCT_OPS(dummy_init)
+ {
+ if (switch_all)
+ scx_bpf_switch_all();
+ return 0;
+ }
+
+ void BPF_STRUCT_OPS(dummy_enqueue, struct task_struct *p, u64 enq_flags)
+ {
+ if (enq_flags & SCX_ENQ_LOCAL)
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL, enq_flags);
+ else
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, enq_flags);
+ }
+
+ void BPF_STRUCT_OPS(dummy_exit, struct scx_exit_info *ei)
+ {
+ exit_type = ei->type;
+ }
+
+ SEC(".struct_ops")
+ struct sched_ext_ops dummy_ops = {
+ .enqueue = (void *)dummy_enqueue,
+ .init = (void *)dummy_init,
+ .exit = (void *)dummy_exit,
+ .name = "dummy",
+ };
+
+Dispatch Queues
+---------------
+
+To match the impedance between the scheduler core and the BPF scheduler,
+sched_ext uses simple FIFOs called DSQs (dispatch queues). By default, there
+is one global FIFO (``SCX_DSQ_GLOBAL``), and one local dsq per CPU
+(``SCX_DSQ_LOCAL``). The BPF scheduler can manage an arbitrary number of
+dsq's using ``scx_bpf_create_dsq()`` and ``scx_bpf_destroy_dsq()``.
+
+A CPU always executes a task from its local DSQ. A task is "dispatched" to a
+DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's
+local DSQ.
+
+When a CPU is looking for the next task to run, if the local DSQ is not
+empty, the first task is picked. Otherwise, the CPU tries to consume the
+global DSQ. If that doesn't yield a runnable task either, ``ops.dispatch()``
+is invoked.
+
+Scheduling Cycle
+----------------
+
+The following briefly shows how a waking task is scheduled and executed.
+
+1. When a task is waking up, ``ops.select_cpu()`` is the first operation
+ invoked. This serves two purposes. First, CPU selection optimization
+ hint. Second, waking up the selected CPU if idle.
+
+ The CPU selected by ``ops.select_cpu()`` is an optimization hint and not
+ binding. The actual decision is made at the last step of scheduling.
+ However, there is a small performance gain if the CPU
+ ``ops.select_cpu()`` returns matches the CPU the task eventually runs on.
+
+ A side-effect of selecting a CPU is waking it up from idle. While a BPF
+ scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper,
+ using ``ops.select_cpu()`` judiciously can be simpler and more efficient.
+
+ Note that the scheduler core will ignore an invalid CPU selection, for
+ example, if it's outside the allowed cpumask of the task.
+
+2. Once the target CPU is selected, ``ops.enqueue()`` is invoked. It can
+ make one of the following decisions:
+
+ * Immediately dispatch the task to either the global or local DSQ by
+ calling ``scx_bpf_dispatch()`` with ``SCX_DSQ_GLOBAL`` or
+ ``SCX_DSQ_LOCAL``, respectively.
+
+ * Immediately dispatch the task to a custom DSQ by calling
+ ``scx_bpf_dispatch()`` with a DSQ ID which is smaller than 2^63.
+
+ * Queue the task on the BPF side.
+
+3. When a CPU is ready to schedule, it first looks at its local DSQ. If
+ empty, it then looks at the global DSQ. If there still isn't a task to
+ run, ``ops.dispatch()`` is invoked which can use the following two
+ functions to populate the local DSQ.
+
+ * ``scx_bpf_dispatch()`` dispatches a task to a DSQ. Any target DSQ can
+ be used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``,
+ ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dispatch()``
+ currently can't be called with BPF locks held, this is being worked on
+ and will be supported. ``scx_bpf_dispatch()`` schedules dispatching
+ rather than performing them immediately. There can be up to
+ ``ops.dispatch_max_batch`` pending tasks.
+
+ * ``scx_bpf_consume()`` tranfers a task from the specified non-local DSQ
+ to the dispatching DSQ. This function cannot be called with any BPF
+ locks held. ``scx_bpf_consume()`` flushes the pending dispatched tasks
+ before trying to consume the specified DSQ.
+
+4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ,
+ the CPU runs the first one. If empty, the following steps are taken:
+
+ * Try to consume the global DSQ. If successful, run the task.
+
+ * If ``ops.dispatch()`` has dispatched any tasks, retry #3.
+
+ * If the previous task is an SCX task and still runnable, keep executing
+ it (see ``SCX_OPS_ENQ_LAST``).
+
+ * Go idle.
+
+Note that the BPF scheduler can always choose to dispatch tasks immediately
+in ``ops.enqueue()`` as illustrated in the above dummy example. If only the
+built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as
+a task is never queued on the BPF scheduler and both the local and global
+DSQs are consumed automatically.
+
+Where to Look
+=============
+
+* ``include/linux/sched/ext.h`` defines the core data structures, ops table
+ and constants.
+
+* ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers.
+ The functions prefixed with ``scx_bpf_`` can be called from the BPF
+ scheduler.
+
+* ``tools/sched_ext/`` hosts example BPF scheduler implementations.
+
+ * ``scx_example_dummy[.bpf].c``: Minimal global FIFO scheduler example
+ using a custom DSQ.
+
+ * ``scx_example_qmap[.bpf].c``: A multi-level FIFO scheduler supporting
+ five levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``.
+
+ABI Instability
+===============
+
+The APIs provided by sched_ext to BPF schedulers programs have no stability
+guarantees. This includes the ops table callbacks and constants defined in
+``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in
+``kernel/sched/ext.c``.
+
+While we will attempt to provide a relatively stable API surface when
+possible, they are subject to change without warning between kernel
+versions.
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index d3c2701bb4b4..6b230ecdcfa4 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -1,5 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
* Copyright (c) 2022 Tejun Heo <[email protected]>
* Copyright (c) 2022 David Vernet <[email protected]>
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index e12a057ead7b..bae49b743834 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -154,3 +154,5 @@ config SCHED_CLASS_EXT
wish to implement scheduling policies. The struct_ops structure
exported by sched_ext is struct sched_ext_ops, and is conceptually
similar to struct sched_class.
+
+ See Documentation/scheduler/sched-ext.rst for more details.
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 8619eb2dcbd5..828082e6e780 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1,5 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
* Copyright (c) 2022 Tejun Heo <[email protected]>
* Copyright (c) 2022 David Vernet <[email protected]>
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index c3df39984fc9..4252296ba464 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -1,5 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
* Copyright (c) 2022 Tejun Heo <[email protected]>
* Copyright (c) 2022 David Vernet <[email protected]>
--
2.39.1


2023-01-28 00:20:36

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 29/30] sched_ext: Add a basic, userland vruntime scheduler

From: David Vernet <[email protected]>

This patch adds a new scx_example_userland BPF scheduler that implements a
fairly unsophisticated sorted-list vruntime scheduler in userland to
demonstrate how most scheduling decisions can be delegated to userland. The
scheduler doesn't implement load balancing, and treats all tasks as part of
a single domain.

Signed-off-by: David Vernet <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
tools/sched_ext/.gitignore | 1 +
tools/sched_ext/Makefile | 10 +-
tools/sched_ext/scx_example_userland.bpf.c | 275 ++++++++++++
tools/sched_ext/scx_example_userland.c | 403 ++++++++++++++++++
tools/sched_ext/scx_example_userland_common.h | 19 +
5 files changed, 706 insertions(+), 2 deletions(-)
create mode 100644 tools/sched_ext/scx_example_userland.bpf.c
create mode 100644 tools/sched_ext/scx_example_userland.c
create mode 100644 tools/sched_ext/scx_example_userland_common.h

diff --git a/tools/sched_ext/.gitignore b/tools/sched_ext/.gitignore
index ebc34dcf925b..75a536dfebfc 100644
--- a/tools/sched_ext/.gitignore
+++ b/tools/sched_ext/.gitignore
@@ -2,6 +2,7 @@ scx_example_dummy
scx_example_qmap
scx_example_central
scx_example_pair
+scx_example_userland
*.skel.h
*.subskel.h
/tools/
diff --git a/tools/sched_ext/Makefile b/tools/sched_ext/Makefile
index 2303736698a2..fcb4faa75e37 100644
--- a/tools/sched_ext/Makefile
+++ b/tools/sched_ext/Makefile
@@ -115,7 +115,8 @@ BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH) \
-Wno-compare-distinct-pointer-types \
-O2 -mcpu=v3

-all: scx_example_dummy scx_example_qmap scx_example_central scx_example_pair
+all: scx_example_dummy scx_example_qmap scx_example_central scx_example_pair \
+ scx_example_userland

# sort removes libbpf duplicates when not cross-building
MAKE_DIRS := $(sort $(BUILD_DIR)/libbpf $(HOST_BUILD_DIR)/libbpf \
@@ -182,11 +183,16 @@ scx_example_pair: scx_example_pair.c scx_example_pair.skel.h user_exit_info.h
$(CC) $(CFLAGS) -c $< -o [email protected]
$(CC) -o $@ [email protected] $(HOST_BPFOBJ) $(LDFLAGS)

+scx_example_userland: scx_example_userland.c scx_example_userland.skel.h \
+ scx_example_userland_common.h user_exit_info.h
+ $(CC) $(CFLAGS) -c $< -o [email protected]
+ $(CC) -o $@ [email protected] $(HOST_BPFOBJ) $(LDFLAGS)
+
clean:
rm -rf $(SCRATCH_DIR) $(HOST_SCRATCH_DIR)
rm -f *.o *.bpf.o *.skel.h *.subskel.h
rm -f scx_example_dummy scx_example_qmap scx_example_central \
- scx_example_pair
+ scx_example_pair scx_example_userland

.PHONY: all clean

diff --git a/tools/sched_ext/scx_example_userland.bpf.c b/tools/sched_ext/scx_example_userland.bpf.c
new file mode 100644
index 000000000000..65362d5d5a60
--- /dev/null
+++ b/tools/sched_ext/scx_example_userland.bpf.c
@@ -0,0 +1,275 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A minimal userland scheduler.
+ *
+ * In terms of scheduling, this provides two different types of behaviors:
+ * 1. A global FIFO scheduling order for _any_ tasks that have CPU affinity.
+ * All such tasks are direct-dispatched from the kernel, and are never
+ * enqueued in user space.
+ * 2. A primitive vruntime scheduler that is implemented in user space, for all
+ * other tasks.
+ *
+ * Some parts of this example user space scheduler could be implemented more
+ * efficiently using more complex and sophisticated data structures. For
+ * example, rather than using BPF_MAP_TYPE_QUEUE's,
+ * BPF_MAP_TYPE_{USER_}RINGBUF's could be used for exchanging messages between
+ * user space and kernel space. Similarly, we use a simple vruntime-sorted list
+ * in user space, but an rbtree could be used instead.
+ *
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#include <string.h>
+#include "scx_common.bpf.h"
+#include "scx_example_userland_common.h"
+
+char _license[] SEC("license") = "GPL";
+
+const volatile bool switch_all;
+const volatile u32 num_possible_cpus;
+const volatile s32 usersched_pid;
+
+/* Stats that are printed by user space. */
+u64 nr_failed_enqueues, nr_kernel_enqueues, nr_user_enqueues;
+
+struct user_exit_info uei;
+
+/*
+ * Whether the user space scheduler needs to be scheduled due to a task being
+ * enqueued in user space.
+ */
+static bool usersched_needed;
+
+/*
+ * The map containing tasks that are enqueued in user space from the kernel.
+ *
+ * This map is drained by the user space scheduler.
+ */
+struct {
+ __uint(type, BPF_MAP_TYPE_QUEUE);
+ __uint(max_entries, USERLAND_MAX_TASKS);
+ __type(value, struct scx_userland_enqueued_task);
+} enqueued SEC(".maps");
+
+/*
+ * The map containing tasks that are dispatched to the kernel from user space.
+ *
+ * Drained by the kernel in userland_dispatch().
+ */
+struct {
+ __uint(type, BPF_MAP_TYPE_QUEUE);
+ __uint(max_entries, USERLAND_MAX_TASKS);
+ __type(value, s32);
+} dispatched SEC(".maps");
+
+/* Per-task scheduling context */
+struct task_ctx {
+ bool force_local; /* Dispatch directly to local DSQ */
+};
+
+/* Map that contains task-local storage. */
+struct {
+ __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+ __type(key, int);
+ __type(value, struct task_ctx);
+} task_ctx_stor SEC(".maps");
+
+static bool is_usersched_task(const struct task_struct *p)
+{
+ return p->pid == usersched_pid;
+}
+
+static bool keep_in_kernel(const struct task_struct *p)
+{
+ return p->nr_cpus_allowed < num_possible_cpus;
+}
+
+static struct task_struct *usersched_task(void)
+{
+ struct task_struct *p;
+
+ p = bpf_task_from_pid(usersched_pid);
+ /*
+ * Should never happen -- the usersched task should always be managed
+ * by sched_ext.
+ */
+ if (!p) {
+ scx_bpf_error("Failed to find usersched task %d", usersched_pid);
+ /*
+ * We should never hit this path, and we error out of the
+ * scheduler above just in case, so the scheduler will soon be
+ * be evicted regardless. So as to simplify the logic in the
+ * caller to not have to check for NULL, return an acquired
+ * reference to the current task here rather than NULL.
+ */
+ return bpf_task_acquire(bpf_get_current_task_btf());
+ }
+
+ return p;
+}
+
+s32 BPF_STRUCT_OPS(userland_select_cpu, struct task_struct *p,
+ s32 prev_cpu, u64 wake_flags)
+{
+ if (keep_in_kernel(p)) {
+ s32 cpu;
+ struct task_ctx *tctx;
+
+ tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
+ if (!tctx) {
+ scx_bpf_error("Failed to look up task-local storage for %s", p->comm);
+ return -ESRCH;
+ }
+
+ if (p->nr_cpus_allowed == 1 ||
+ scx_bpf_test_and_clear_cpu_idle(prev_cpu)) {
+ tctx->force_local = true;
+ return prev_cpu;
+ }
+
+ cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr);
+ if (cpu >= 0) {
+ tctx->force_local = true;
+ return cpu;
+ }
+ }
+
+ return prev_cpu;
+}
+
+static void dispatch_user_scheduler(void)
+{
+ struct task_struct *p;
+
+ usersched_needed = false;
+ p = usersched_task();
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0);
+ bpf_task_release(p);
+}
+
+static void enqueue_task_in_user_space(struct task_struct *p, u64 enq_flags)
+{
+ struct scx_userland_enqueued_task task;
+
+ memset(&task, 0, sizeof(task));
+ task.pid = p->pid;
+ task.sum_exec_runtime = p->se.sum_exec_runtime;
+ task.weight = p->scx.weight;
+
+ if (bpf_map_push_elem(&enqueued, &task, 0)) {
+ /*
+ * If we fail to enqueue the task in user space, put it
+ * directly on the global DSQ.
+ */
+ __sync_fetch_and_add(&nr_failed_enqueues, 1);
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+ } else {
+ __sync_fetch_and_add(&nr_user_enqueues, 1);
+ usersched_needed = true;
+ }
+}
+
+void BPF_STRUCT_OPS(userland_enqueue, struct task_struct *p, u64 enq_flags)
+{
+ if (keep_in_kernel(p)) {
+ u64 dsq_id = SCX_DSQ_GLOBAL;
+ struct task_ctx *tctx;
+
+ tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
+ if (!tctx) {
+ scx_bpf_error("Failed to lookup task ctx for %s", p->comm);
+ return;
+ }
+
+ if (tctx->force_local)
+ dsq_id = SCX_DSQ_LOCAL;
+ tctx->force_local = false;
+ scx_bpf_dispatch(p, dsq_id, SCX_SLICE_DFL, enq_flags);
+ __sync_fetch_and_add(&nr_kernel_enqueues, 1);
+ return;
+ } else if (!is_usersched_task(p)) {
+ enqueue_task_in_user_space(p, enq_flags);
+ }
+}
+
+static int drain_dispatch_q_loopfn(u32 idx, void *data)
+{
+ s32 cpu = *(s32 *)data;
+ s32 pid;
+ struct task_struct *p;
+
+ if (bpf_map_pop_elem(&dispatched, &pid))
+ return 1;
+
+ /*
+ * The task could have exited by the time we get around to dispatching
+ * it. Treat this as a normal occurrence, and simply move onto the next
+ * iteration.
+ */
+ p = bpf_task_from_pid(pid);
+ if (!p)
+ return 0;
+
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0);
+ bpf_task_release(p);
+ return 0;
+}
+
+void BPF_STRUCT_OPS(userland_dispatch, s32 cpu, struct task_struct *prev)
+{
+ if (usersched_needed)
+ dispatch_user_scheduler();
+
+ /* XXX: Use an iterator when it's available. */
+ bpf_loop(4096, drain_dispatch_q_loopfn, &cpu, 0);
+}
+
+s32 BPF_STRUCT_OPS(userland_prep_enable, struct task_struct *p,
+ struct scx_enable_args *args)
+{
+ if (bpf_task_storage_get(&task_ctx_stor, p, 0,
+ BPF_LOCAL_STORAGE_GET_F_CREATE))
+ return 0;
+ else
+ return -ENOMEM;
+}
+
+s32 BPF_STRUCT_OPS(userland_init)
+{
+ int ret;
+
+ if (num_possible_cpus == 0) {
+ scx_bpf_error("User scheduler # CPUs uninitialized (%d)",
+ num_possible_cpus);
+ return -EINVAL;
+ }
+
+ if (usersched_pid <= 0) {
+ scx_bpf_error("User scheduler pid uninitialized (%d)",
+ usersched_pid);
+ return -EINVAL;
+ }
+
+ if (switch_all)
+ scx_bpf_switch_all();
+ return 0;
+}
+
+void BPF_STRUCT_OPS(userland_exit, struct scx_exit_info *ei)
+{
+ uei_record(&uei, ei);
+}
+
+SEC(".struct_ops")
+struct sched_ext_ops userland_ops = {
+ .select_cpu = (void *)userland_select_cpu,
+ .enqueue = (void *)userland_enqueue,
+ .dispatch = (void *)userland_dispatch,
+ .prep_enable = (void *)userland_prep_enable,
+ .init = (void *)userland_init,
+ .exit = (void *)userland_exit,
+ .timeout_ms = 3000,
+ .name = "userland",
+};
diff --git a/tools/sched_ext/scx_example_userland.c b/tools/sched_ext/scx_example_userland.c
new file mode 100644
index 000000000000..4ddd257b9e42
--- /dev/null
+++ b/tools/sched_ext/scx_example_userland.c
@@ -0,0 +1,403 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A demo sched_ext user space scheduler which provides vruntime semantics
+ * using a simple ordered-list implementation.
+ *
+ * Each CPU in the system resides in a single, global domain. This precludes
+ * the need to do any load balancing between domains. The scheduler could
+ * easily be extended to support multiple domains, with load balancing
+ * happening in user space.
+ *
+ * Any task which has any CPU affinity is scheduled entirely in BPF. This
+ * program only schedules tasks which may run on any CPU.
+ *
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <[email protected]>
+ * Copyright (c) 2022 David Vernet <[email protected]>
+ */
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <unistd.h>
+#include <sched.h>
+#include <signal.h>
+#include <assert.h>
+#include <libgen.h>
+#include <pthread.h>
+#include <bpf/bpf.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+
+#include "user_exit_info.h"
+#include "scx_example_userland_common.h"
+#include "scx_example_userland.skel.h"
+
+const char help_fmt[] =
+"A minimal userland sched_ext scheduler.\n"
+"\n"
+"See the top-level comment in .bpf.c for more details.\n"
+"\n"
+"Usage: %s [-a]\n"
+"\n"
+" -a Switch all tasks\n"
+" -b The number of tasks to batch when dispatching.\n"
+" Defaults to 8\n"
+" -h Display this help and exit\n";
+
+/* Defined in UAPI */
+#define SCHED_EXT 7
+
+/* Number of tasks to batch when dispatching to user space. */
+static __u32 batch_size = 8;
+
+static volatile int exit_req;
+static int enqueued_fd, dispatched_fd;
+
+static struct scx_example_userland *skel;
+static struct bpf_link *ops_link;
+
+/* Stats collected in user space. */
+static __u64 nr_vruntime_enqueues, nr_vruntime_dispatches;
+
+/* The data structure containing tasks that are enqueued in user space. */
+struct enqueued_task {
+ LIST_ENTRY(enqueued_task) entries;
+ __u64 sum_exec_runtime;
+ double vruntime;
+};
+
+/*
+ * Use a vruntime-sorted list to store tasks. This could easily be extended to
+ * a more optimal data structure, such as an rbtree as is done in CFS. We
+ * currently elect to use a sorted list to simplify the example for
+ * illustrative purposes.
+ */
+LIST_HEAD(listhead, enqueued_task);
+
+/*
+ * A vruntime-sorted list of tasks. The head of the list contains the task with
+ * the lowest vruntime. That is, the task that has the "highest" claim to be
+ * scheduled.
+ */
+static struct listhead vruntime_head = LIST_HEAD_INITIALIZER(vruntime_head);
+
+/*
+ * The statically allocated array of tasks. We use a statically allocated list
+ * here to avoid having to allocate on the enqueue path, which could cause a
+ * deadlock. A more substantive user space scheduler could e.g. provide a hook
+ * for newly enabled tasks that are passed to the scheduler from the
+ * .prep_enable() callback to allows the scheduler to allocate on safe paths.
+ */
+struct enqueued_task tasks[USERLAND_MAX_TASKS];
+
+static double min_vruntime;
+
+static void sigint_handler(int userland)
+{
+ exit_req = 1;
+}
+
+static __u32 task_pid(const struct enqueued_task *task)
+{
+ return ((uintptr_t)task - (uintptr_t)tasks) / sizeof(*task);
+}
+
+static int dispatch_task(s32 pid)
+{
+ int err;
+
+ err = bpf_map_update_elem(dispatched_fd, NULL, &pid, 0);
+ if (err) {
+ fprintf(stderr, "Failed to dispatch task %d\n", pid);
+ exit_req = 1;
+ } else {
+ nr_vruntime_dispatches++;
+ }
+
+ return err;
+}
+
+static struct enqueued_task *get_enqueued_task(__s32 pid)
+{
+ if (pid >= USERLAND_MAX_TASKS)
+ return NULL;
+
+ return &tasks[pid];
+}
+
+static double calc_vruntime_delta(__u64 weight, __u64 delta)
+{
+ double weight_f = (double)weight / 100.0;
+ double delta_f = (double)delta;
+
+ return delta_f / weight_f;
+}
+
+static void update_enqueued(struct enqueued_task *enqueued, const struct scx_userland_enqueued_task *bpf_task)
+{
+ __u64 delta;
+
+ delta = bpf_task->sum_exec_runtime - enqueued->sum_exec_runtime;
+
+ enqueued->vruntime += calc_vruntime_delta(bpf_task->weight, delta);
+ if (min_vruntime > enqueued->vruntime)
+ enqueued->vruntime = min_vruntime;
+ enqueued->sum_exec_runtime = bpf_task->sum_exec_runtime;
+}
+
+static int vruntime_enqueue(const struct scx_userland_enqueued_task *bpf_task)
+{
+ struct enqueued_task *curr, *enqueued, *prev;
+
+ curr = get_enqueued_task(bpf_task->pid);
+ if (!curr)
+ return ENOENT;
+
+ update_enqueued(curr, bpf_task);
+ nr_vruntime_enqueues++;
+
+ /*
+ * Enqueue the task in a vruntime-sorted list. A more optimal data
+ * structure such as an rbtree could easily be used as well. We elect
+ * to use a list here simply because it's less code, and thus the
+ * example is less convoluted and better serves to illustrate what a
+ * user space scheduler could look like.
+ */
+
+ if (LIST_EMPTY(&vruntime_head)) {
+ LIST_INSERT_HEAD(&vruntime_head, curr, entries);
+ return 0;
+ }
+
+ LIST_FOREACH(enqueued, &vruntime_head, entries) {
+ if (curr->vruntime <= enqueued->vruntime) {
+ LIST_INSERT_BEFORE(enqueued, curr, entries);
+ return 0;
+ }
+ prev = enqueued;
+ }
+
+ LIST_INSERT_AFTER(prev, curr, entries);
+
+ return 0;
+}
+
+static void drain_enqueued_map(void)
+{
+ while (1) {
+ struct scx_userland_enqueued_task task;
+ int err;
+
+ if (bpf_map_lookup_and_delete_elem(enqueued_fd, NULL, &task))
+ return;
+
+ err = vruntime_enqueue(&task);
+ if (err) {
+ fprintf(stderr, "Failed to enqueue task %d: %s\n",
+ task.pid, strerror(err));
+ exit_req = 1;
+ return;
+ }
+ }
+}
+
+static void dispatch_batch(void)
+{
+ __u32 i;
+
+ for (i = 0; i < batch_size; i++) {
+ struct enqueued_task *task;
+ int err;
+ __s32 pid;
+
+ task = LIST_FIRST(&vruntime_head);
+ if (!task)
+ return;
+
+ min_vruntime = task->vruntime;
+ pid = task_pid(task);
+ LIST_REMOVE(task, entries);
+ err = dispatch_task(pid);
+ if (err) {
+ fprintf(stderr, "Failed to dispatch task %d in %u\n",
+ pid, i);
+ return;
+ }
+ }
+}
+
+static void *run_stats_printer(void *arg)
+{
+ while (!exit_req) {
+ __u64 nr_failed_enqueues, nr_kernel_enqueues, nr_user_enqueues, total;
+
+ nr_failed_enqueues = skel->bss->nr_failed_enqueues;
+ nr_kernel_enqueues = skel->bss->nr_kernel_enqueues;
+ nr_user_enqueues = skel->bss->nr_user_enqueues;
+ total = nr_failed_enqueues + nr_kernel_enqueues + nr_user_enqueues;
+
+ printf("o-----------------------o\n");
+ printf("| BPF ENQUEUES |\n");
+ printf("|-----------------------|\n");
+ printf("| kern: %10llu |\n", nr_kernel_enqueues);
+ printf("| user: %10llu |\n", nr_user_enqueues);
+ printf("| failed: %10llu |\n", nr_failed_enqueues);
+ printf("| -------------------- |\n");
+ printf("| total: %10llu |\n", total);
+ printf("| |\n");
+ printf("|-----------------------|\n");
+ printf("| VRUNTIME / USER |\n");
+ printf("|-----------------------|\n");
+ printf("| enq: %10llu |\n", nr_vruntime_enqueues);
+ printf("| disp: %10llu |\n", nr_vruntime_dispatches);
+ printf("o-----------------------o\n");
+ printf("\n\n");
+ sleep(1);
+ }
+
+ return NULL;
+}
+
+static int spawn_stats_thread(void)
+{
+ pthread_t stats_printer;
+
+ return pthread_create(&stats_printer, NULL, run_stats_printer, NULL);
+}
+
+static int bootstrap(int argc, char **argv)
+{
+ int err;
+ __u32 opt;
+ struct sched_param sched_param = {
+ .sched_priority = sched_get_priority_max(SCHED_EXT),
+ };
+ bool switch_all = false;
+
+ signal(SIGINT, sigint_handler);
+ signal(SIGTERM, sigint_handler);
+ libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
+
+ /*
+ * Enforce that the user scheduler task is managed by sched_ext. The
+ * task eagerly drains the list of enqueued tasks in its main work
+ * loop, and then yields the CPU. The BPF scheduler only schedules the
+ * user space scheduler task when at least one other task in the system
+ * needs to be scheduled.
+ */
+ err = syscall(__NR_sched_setscheduler, getpid(), SCHED_EXT, &sched_param);
+ if (err) {
+ fprintf(stderr, "Failed to set scheduler to SCHED_EXT: %s\n", strerror(err));
+ return err;
+ }
+
+ while ((opt = getopt(argc, argv, "ahb:")) != -1) {
+ switch (opt) {
+ case 'a':
+ switch_all = true;
+ break;
+ case 'b':
+ batch_size = strtoul(optarg, NULL, 0);
+ break;
+ default:
+ fprintf(stderr, help_fmt, basename(argv[0]));
+ exit(opt != 'h');
+ }
+ }
+
+ /*
+ * It's not always safe to allocate in a user space scheduler, as an
+ * enqueued task could hold a lock that we require in order to be able
+ * to allocate.
+ */
+ err = mlockall(MCL_CURRENT | MCL_FUTURE);
+ if (err) {
+ fprintf(stderr, "Failed to prefault and lock address space: %s\n",
+ strerror(err));
+ return err;
+ }
+
+ skel = scx_example_userland__open();
+ if (!skel) {
+ fprintf(stderr, "Failed to open scheduler: %s\n", strerror(errno));
+ return errno;
+ }
+ skel->rodata->num_possible_cpus = libbpf_num_possible_cpus();
+ assert(skel->rodata->num_possible_cpus > 0);
+ skel->rodata->usersched_pid = getpid();
+ assert(skel->rodata->usersched_pid > 0);
+ skel->rodata->switch_all = switch_all;
+
+ err = scx_example_userland__load(skel);
+ if (err) {
+ fprintf(stderr, "Failed to load scheduler: %s\n", strerror(err));
+ goto destroy_skel;
+ }
+
+ enqueued_fd = bpf_map__fd(skel->maps.enqueued);
+ dispatched_fd = bpf_map__fd(skel->maps.dispatched);
+ assert(enqueued_fd > 0);
+ assert(dispatched_fd > 0);
+
+ err = spawn_stats_thread();
+ if (err) {
+ fprintf(stderr, "Failed to spawn stats thread: %s\n", strerror(err));
+ goto destroy_skel;
+ }
+
+ ops_link = bpf_map__attach_struct_ops(skel->maps.userland_ops);
+ if (!ops_link) {
+ fprintf(stderr, "Failed to attach struct ops: %s\n", strerror(errno));
+ err = errno;
+ goto destroy_skel;
+ }
+
+ return 0;
+
+destroy_skel:
+ scx_example_userland__destroy(skel);
+ exit_req = 1;
+ return err;
+}
+
+static void sched_main_loop(void)
+{
+ while (!exit_req) {
+ /*
+ * Perform the following work in the main user space scheduler
+ * loop:
+ *
+ * 1. Drain all tasks from the enqueued map, and enqueue them
+ * to the vruntime sorted list.
+ *
+ * 2. Dispatch a batch of tasks from the vruntime sorted list
+ * down to the kernel.
+ *
+ * 3. Yield the CPU back to the system. The BPF scheduler will
+ * reschedule the user space scheduler once another task has
+ * been enqueued to user space.
+ */
+ drain_enqueued_map();
+ dispatch_batch();
+ sched_yield();
+ }
+}
+
+int main(int argc, char **argv)
+{
+ int err;
+
+ err = bootstrap(argc, argv);
+ if (err) {
+ fprintf(stderr, "Failed to bootstrap scheduler: %s\n", strerror(err));
+ return err;
+ }
+
+ sched_main_loop();
+
+ exit_req = 1;
+ bpf_link__destroy(ops_link);
+ uei_print(&skel->bss->uei);
+ scx_example_userland__destroy(skel);
+ return 0;
+}
diff --git a/tools/sched_ext/scx_example_userland_common.h b/tools/sched_ext/scx_example_userland_common.h
new file mode 100644
index 000000000000..639c6809c5ff
--- /dev/null
+++ b/tools/sched_ext/scx_example_userland_common.h
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta, Inc */
+
+#ifndef __SCX_USERLAND_COMMON_H
+#define __SCX_USERLAND_COMMON_H
+
+#define USERLAND_MAX_TASKS 8192
+
+/*
+ * An instance of a task that has been enqueued by the kernel for consumption
+ * by a user space global scheduler thread.
+ */
+struct scx_userland_enqueued_task {
+ __s32 pid;
+ u64 sum_exec_runtime;
+ u64 weight;
+};
+
+#endif // __SCX_USERLAND_COMMON_H
--
2.39.1


2023-01-28 00:20:42

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 30/30] sched_ext: Add a rust userspace hybrid example scheduler

From: Dan Schatzberg <[email protected]>

Atropos is a multi-domain BPF / userspace hybrid scheduler where the BPF
part does simple round robin in each domain and the userspace part
calculates the load factor of each domain and tells the BPF part how to load
balance the domains.

This scheduler demonstrates dividing scheduling logic between BPF and
userspace and using rust to build the userspace part. An earlier variant of
this scheduler was used to balance across six domains, each representing a
chiplet in a six-chiplet AMD processor, and could match the performance of
production setup using CFS.

v2: Updated to use generic BPF cpumask helpers.

Signed-off-by: Dan Schatzberg <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
tools/sched_ext/Makefile | 13 +-
tools/sched_ext/atropos/.gitignore | 3 +
tools/sched_ext/atropos/Cargo.toml | 28 +
tools/sched_ext/atropos/build.rs | 70 ++
tools/sched_ext/atropos/rustfmt.toml | 8 +
tools/sched_ext/atropos/src/bpf/atropos.bpf.c | 632 +++++++++++++++++
tools/sched_ext/atropos/src/bpf/atropos.h | 44 ++
tools/sched_ext/atropos/src/main.rs | 648 ++++++++++++++++++
.../sched_ext/atropos/src/oss/atropos_sys.rs | 10 +
tools/sched_ext/atropos/src/oss/mod.rs | 29 +
tools/sched_ext/atropos/src/util.rs | 24 +
11 files changed, 1507 insertions(+), 2 deletions(-)
create mode 100644 tools/sched_ext/atropos/.gitignore
create mode 100644 tools/sched_ext/atropos/Cargo.toml
create mode 100644 tools/sched_ext/atropos/build.rs
create mode 100644 tools/sched_ext/atropos/rustfmt.toml
create mode 100644 tools/sched_ext/atropos/src/bpf/atropos.bpf.c
create mode 100644 tools/sched_ext/atropos/src/bpf/atropos.h
create mode 100644 tools/sched_ext/atropos/src/main.rs
create mode 100644 tools/sched_ext/atropos/src/oss/atropos_sys.rs
create mode 100644 tools/sched_ext/atropos/src/oss/mod.rs
create mode 100644 tools/sched_ext/atropos/src/util.rs

diff --git a/tools/sched_ext/Makefile b/tools/sched_ext/Makefile
index fcb4faa75e37..0ae20dc0f10d 100644
--- a/tools/sched_ext/Makefile
+++ b/tools/sched_ext/Makefile
@@ -85,6 +85,8 @@ CFLAGS += -g -O2 -rdynamic -pthread -Wall -Werror $(GENFLAGS) \
-I$(INCLUDE_DIR) -I$(GENDIR) -I$(LIBDIR) \
-I$(TOOLSINCDIR) -I$(APIDIR)

+CARGOFLAGS := --release
+
# Silence some warnings when compiled with clang
ifneq ($(LLVM),)
CFLAGS += -Wno-unused-command-line-argument
@@ -116,7 +118,7 @@ BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH) \
-O2 -mcpu=v3

all: scx_example_dummy scx_example_qmap scx_example_central scx_example_pair \
- scx_example_userland
+ scx_example_userland atropos

# sort removes libbpf duplicates when not cross-building
MAKE_DIRS := $(sort $(BUILD_DIR)/libbpf $(HOST_BUILD_DIR)/libbpf \
@@ -188,13 +190,20 @@ scx_example_userland: scx_example_userland.c scx_example_userland.skel.h \
$(CC) $(CFLAGS) -c $< -o [email protected]
$(CC) -o $@ [email protected] $(HOST_BPFOBJ) $(LDFLAGS)

+atropos: export RUSTFLAGS = -C link-args=-lzstd -C link-args=-lz -C link-args=-lelf -L $(BPFOBJ_DIR)
+atropos: export ATROPOS_CLANG = $(CLANG)
+atropos: export ATROPOS_BPF_CFLAGS = $(BPF_CFLAGS)
+atropos: $(INCLUDE_DIR)/vmlinux.h
+ cargo build --manifest-path=atropos/Cargo.toml $(CARGOFLAGS)
+
clean:
+ cargo clean --manifest-path=atropos/Cargo.toml
rm -rf $(SCRATCH_DIR) $(HOST_SCRATCH_DIR)
rm -f *.o *.bpf.o *.skel.h *.subskel.h
rm -f scx_example_dummy scx_example_qmap scx_example_central \
scx_example_pair scx_example_userland

-.PHONY: all clean
+.PHONY: all atropos clean

# delete failed targets
.DELETE_ON_ERROR:
diff --git a/tools/sched_ext/atropos/.gitignore b/tools/sched_ext/atropos/.gitignore
new file mode 100644
index 000000000000..186dba259ec2
--- /dev/null
+++ b/tools/sched_ext/atropos/.gitignore
@@ -0,0 +1,3 @@
+src/bpf/.output
+Cargo.lock
+target
diff --git a/tools/sched_ext/atropos/Cargo.toml b/tools/sched_ext/atropos/Cargo.toml
new file mode 100644
index 000000000000..fcfce056c741
--- /dev/null
+++ b/tools/sched_ext/atropos/Cargo.toml
@@ -0,0 +1,28 @@
+[package]
+name = "atropos-bin"
+version = "0.5.0"
+authors = ["Dan Schatzberg <[email protected]>", "Meta"]
+edition = "2021"
+description = "Userspace scheduling with BPF"
+license = "GPL-2.0-only"
+
+[dependencies]
+anyhow = "1.0.65"
+bitvec = { version = "1.0", features = ["serde"] }
+clap = { version = "3.2.17", features = ["derive", "env", "regex", "unicode", "wrap_help"] }
+ctrlc = { version = "3.1", features = ["termination"] }
+fb_procfs = { git = "https://github.com/facebookincubator/below.git", rev = "f305730"}
+hex = "0.4.3"
+libbpf-rs = "0.19.1"
+libbpf-sys = { version = "1.0.4", features = ["novendor", "static"] }
+libc = "0.2.137"
+slog = { version = "2.7", features = ["max_level_trace", "nested-values"] }
+slog-async = { version = "2.3", features = ["nested-values"] }
+slog-term = "2.8"
+
+[build-dependencies]
+bindgen = { version = "0.61.0", features = ["logging", "static"], default-features = false }
+libbpf-cargo = "0.13.0"
+
+[features]
+enable_backtrace = []
diff --git a/tools/sched_ext/atropos/build.rs b/tools/sched_ext/atropos/build.rs
new file mode 100644
index 000000000000..26e792c5e17e
--- /dev/null
+++ b/tools/sched_ext/atropos/build.rs
@@ -0,0 +1,70 @@
+// Copyright (c) Meta Platforms, Inc. and affiliates.
+
+// This software may be used and distributed according to the terms of the
+// GNU General Public License version 2.
+extern crate bindgen;
+
+use std::env;
+use std::fs::create_dir_all;
+use std::path::Path;
+use std::path::PathBuf;
+
+use libbpf_cargo::SkeletonBuilder;
+
+const HEADER_PATH: &str = "src/bpf/atropos.h";
+
+fn bindgen_atropos() {
+ // Tell cargo to invalidate the built crate whenever the wrapper changes
+ println!("cargo:rerun-if-changed={}", HEADER_PATH);
+
+ // The bindgen::Builder is the main entry point
+ // to bindgen, and lets you build up options for
+ // the resulting bindings.
+ let bindings = bindgen::Builder::default()
+ // The input header we would like to generate
+ // bindings for.
+ .header(HEADER_PATH)
+ // Tell cargo to invalidate the built crate whenever any of the
+ // included header files changed.
+ .parse_callbacks(Box::new(bindgen::CargoCallbacks))
+ // Finish the builder and generate the bindings.
+ .generate()
+ // Unwrap the Result and panic on failure.
+ .expect("Unable to generate bindings");
+
+ // Write the bindings to the $OUT_DIR/bindings.rs file.
+ let out_path = PathBuf::from(env::var("OUT_DIR").unwrap());
+ bindings
+ .write_to_file(out_path.join("atropos-sys.rs"))
+ .expect("Couldn't write bindings!");
+}
+
+fn gen_bpf_sched(name: &str) {
+ let bpf_cflags = env::var("ATROPOS_BPF_CFLAGS").unwrap();
+ let clang = env::var("ATROPOS_CLANG").unwrap();
+ eprintln!("{}", clang);
+ let outpath = format!("./src/bpf/.output/{}.skel.rs", name);
+ let skel = Path::new(&outpath);
+ let src = format!("./src/bpf/{}.bpf.c", name);
+ SkeletonBuilder::new()
+ .source(src.clone())
+ .clang(clang)
+ .clang_args(bpf_cflags)
+ .build_and_generate(&skel)
+ .unwrap();
+ println!("cargo:rerun-if-changed={}", src);
+}
+
+fn main() {
+ bindgen_atropos();
+ // It's unfortunate we cannot use `OUT_DIR` to store the generated skeleton.
+ // Reasons are because the generated skeleton contains compiler attributes
+ // that cannot be `include!()`ed via macro. And we cannot use the `#[path = "..."]`
+ // trick either because you cannot yet `concat!(env!("OUT_DIR"), "/skel.rs")` inside
+ // the path attribute either (see https://github.com/rust-lang/rust/pull/83366).
+ //
+ // However, there is hope! When the above feature stabilizes we can clean this
+ // all up.
+ create_dir_all("./src/bpf/.output").unwrap();
+ gen_bpf_sched("atropos");
+}
diff --git a/tools/sched_ext/atropos/rustfmt.toml b/tools/sched_ext/atropos/rustfmt.toml
new file mode 100644
index 000000000000..b7258ed0a8d8
--- /dev/null
+++ b/tools/sched_ext/atropos/rustfmt.toml
@@ -0,0 +1,8 @@
+# Get help on options with `rustfmt --help=config`
+# Please keep these in alphabetical order.
+edition = "2021"
+group_imports = "StdExternalCrate"
+imports_granularity = "Item"
+merge_derives = false
+use_field_init_shorthand = true
+version = "Two"
diff --git a/tools/sched_ext/atropos/src/bpf/atropos.bpf.c b/tools/sched_ext/atropos/src/bpf/atropos.bpf.c
new file mode 100644
index 000000000000..17b89d57e487
--- /dev/null
+++ b/tools/sched_ext/atropos/src/bpf/atropos.bpf.c
@@ -0,0 +1,632 @@
+// Copyright (c) Meta Platforms, Inc. and affiliates.
+
+// This software may be used and distributed according to the terms of the
+// GNU General Public License version 2.
+//
+// Atropos is a multi-domain BPF / userspace hybrid scheduler where the BPF
+// part does simple round robin in each domain and the userspace part
+// calculates the load factor of each domain and tells the BPF part how to load
+// balance the domains.
+//
+// Every task has an entry in the task_data map which lists which domain the
+// task belongs to. When a task first enters the system (atropos_prep_enable),
+// they are round-robined to a domain.
+//
+// atropos_select_cpu is the primary scheduling logic, invoked when a task
+// becomes runnable. The lb_data map is populated by userspace to inform the BPF
+// scheduler that a task should be migrated to a new domain. Otherwise, the task
+// is scheduled in priority order as follows:
+// * The current core if the task was woken up synchronously and there are idle
+// cpus in the system
+// * The previous core, if idle
+// * The pinned-to core if the task is pinned to a specific core
+// * Any idle cpu in the domain
+//
+// If none of the above conditions are met, then the task is enqueued to a
+// dispatch queue corresponding to the domain (atropos_enqueue).
+//
+// atropos_dispatch will attempt to consume a task from its domain's
+// corresponding dispatch queue (this occurs after scheduling any tasks directly
+// assigned to it due to the logic in atropos_select_cpu). If no task is found,
+// then greedy load stealing will attempt to find a task on another dispatch
+// queue to run.
+//
+// Load balancing is almost entirely handled by userspace. BPF populates the
+// task weight, dom mask and current dom in the task_data map and executes the
+// load balance based on userspace populating the lb_data map.
+#include "../../../scx_common.bpf.h"
+#include "atropos.h"
+
+#include <errno.h>
+#include <stdbool.h>
+#include <string.h>
+#include <bpf/bpf_core_read.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+/*
+ * const volatiles are set during initialization and treated as consts by the
+ * jit compiler.
+ */
+
+/*
+ * Domains and cpus
+ */
+const volatile __u32 nr_doms;
+const volatile __u32 nr_cpus;
+const volatile __u32 cpu_dom_id_map[MAX_CPUS];
+const volatile __u64 dom_cpumasks[MAX_DOMS][MAX_CPUS / 64];
+
+const volatile bool switch_all;
+const volatile __u64 greedy_threshold = (u64)-1;
+
+/* base slice duration */
+const volatile __u64 slice_us = 20000;
+
+/*
+ * Exit info
+ */
+int exit_type = SCX_EXIT_NONE;
+char exit_msg[SCX_EXIT_MSG_LEN];
+
+struct pcpu_ctx {
+ __u32 dom_rr_cur; /* used when scanning other doms */
+
+ /* libbpf-rs does not respect the alignment, so pad out the struct explicitly */
+ __u8 _padding[CACHELINE_SIZE - sizeof(u64)];
+} __attribute__((aligned(CACHELINE_SIZE)));
+
+struct pcpu_ctx pcpu_ctx[MAX_CPUS];
+
+struct dom_cpumask {
+ struct bpf_cpumask __kptr_ref *cpumask;
+};
+
+/*
+ * Domain cpumasks
+ */
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __type(key, u32);
+ __type(value, struct dom_cpumask);
+ __uint(max_entries, MAX_DOMS);
+ __uint(map_flags, 0);
+} dom_cpumask_map SEC(".maps");
+
+/*
+ * Statistics
+ */
+struct {
+ __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+ __uint(key_size, sizeof(u32));
+ __uint(value_size, sizeof(u64));
+ __uint(max_entries, ATROPOS_NR_STATS);
+} stats SEC(".maps");
+
+static inline void stat_add(enum stat_idx idx, u64 addend)
+{
+ u32 idx_v = idx;
+
+ u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx_v);
+ if (cnt_p)
+ (*cnt_p) += addend;
+}
+
+// Map pid -> task_ctx
+struct {
+ __uint(type, BPF_MAP_TYPE_HASH);
+ __type(key, pid_t);
+ __type(value, struct task_ctx);
+ __uint(max_entries, 1000000);
+ __uint(map_flags, 0);
+} task_data SEC(".maps");
+
+// This is populated from userspace to indicate which pids should be reassigned
+// to new doms
+struct {
+ __uint(type, BPF_MAP_TYPE_HASH);
+ __type(key, pid_t);
+ __type(value, u32);
+ __uint(max_entries, 1000);
+ __uint(map_flags, 0);
+} lb_data SEC(".maps");
+
+struct refresh_task_cpumask_loop_ctx {
+ struct task_struct *p;
+ struct task_ctx *ctx;
+};
+
+static void task_set_dq(struct task_ctx *task_ctx, struct task_struct *p,
+ u32 dom_id)
+{
+ struct dom_cpumask *dom_cpumask;
+ struct bpf_cpumask *d_cpumask, *t_cpumask;
+
+ dom_cpumask = bpf_map_lookup_elem(&dom_cpumask_map, &dom_id);
+ if (!dom_cpumask) {
+ scx_bpf_error("Failed to look up domain %u cpumask", dom_id);
+ return;
+ }
+
+ d_cpumask = bpf_cpumask_kptr_get(&dom_cpumask->cpumask);
+ if (!d_cpumask) {
+ scx_bpf_error("Failed to get domain %u cpumask kptr", dom_id);
+ return;
+ }
+
+ t_cpumask = bpf_cpumask_kptr_get(&task_ctx->cpumask);
+ if (!t_cpumask) {
+ scx_bpf_error("Failed to look up task cpumask");
+ bpf_cpumask_release(d_cpumask);
+ return;
+ }
+
+ bpf_cpumask_and(t_cpumask, (const struct cpumask *)d_cpumask, p->cpus_ptr);
+ bpf_cpumask_release(d_cpumask);
+ bpf_cpumask_release(t_cpumask);
+}
+
+s32 BPF_STRUCT_OPS(atropos_select_cpu, struct task_struct *p, int prev_cpu,
+ u32 wake_flags)
+{
+ s32 cpu;
+ pid_t pid = p->pid;
+ struct task_ctx *task_ctx = bpf_map_lookup_elem(&task_data, &pid);
+ struct bpf_cpumask *p_cpumask;
+
+ if (!task_ctx) {
+ stat_add(ATROPOS_STAT_TASK_GET_ERR, 1);
+ return prev_cpu;
+ }
+
+ bool load_balanced = false;
+ u32 *new_dom = bpf_map_lookup_elem(&lb_data, &pid);
+ if (new_dom && *new_dom != task_ctx->dom_id) {
+ task_set_dq(task_ctx, p, *new_dom);
+ stat_add(ATROPOS_STAT_LOAD_BALANCE, 1);
+ load_balanced = true;
+ }
+
+ /*
+ * If WAKE_SYNC and the machine isn't fully saturated, wake up @p to the
+ * local dq of the waker.
+ */
+ if (p->nr_cpus_allowed > 1 && (wake_flags & SCX_WAKE_SYNC)) {
+ struct task_struct *current = (void *)bpf_get_current_task();
+
+ if (!(BPF_CORE_READ(current, flags) & PF_EXITING) &&
+ task_ctx->dom_id < MAX_DOMS) {
+ struct dom_cpumask *dmask_wrapper;
+ struct bpf_cpumask *d_cpumask;
+
+ dmask_wrapper = bpf_map_lookup_elem(&dom_cpumask_map, &task_ctx->dom_id);
+ if (!dmask_wrapper) {
+ scx_bpf_error("Failed to query for domain %u cpumask",
+ task_ctx->dom_id);
+ return prev_cpu;
+ }
+ d_cpumask = bpf_cpumask_kptr_get(&dmask_wrapper->cpumask);
+ if (!d_cpumask) {
+ scx_bpf_error("Failed to acquire domain %u cpumask kptr",
+ task_ctx->dom_id);
+ return prev_cpu;
+ }
+
+ cpu = scx_bpf_pick_idle_cpu(&d_cpumask->cpumask);
+ bpf_cpumask_release(d_cpumask);
+ if (bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) {
+ stat_add(ATROPOS_STAT_WAKE_SYNC, 1);
+ goto local;
+ }
+ }
+ }
+
+ /* if the previous CPU is idle, dispatch directly to it */
+ if (!load_balanced) {
+ u8 prev_idle = scx_bpf_test_and_clear_cpu_idle(prev_cpu);
+ if (*(volatile u8 *)&prev_idle) {
+ stat_add(ATROPOS_STAT_PREV_IDLE, 1);
+ cpu = prev_cpu;
+ goto local;
+ }
+ }
+
+ /* If only one core is allowed, dispatch */
+ p_cpumask = bpf_cpumask_kptr_get(&task_ctx->cpumask);
+ if (!p_cpumask) {
+ scx_bpf_error("Failed to acquire task %s cpumask kptr",
+ p->comm);
+ return prev_cpu;
+ }
+ if (p->nr_cpus_allowed == 1) {
+ cpu = bpf_cpumask_first(p_cpumask);
+ bpf_cpumask_release(p_cpumask);
+ stat_add(ATROPOS_STAT_PINNED, 1);
+ goto local;
+ }
+
+ /* Find an idle cpu and just dispatch */
+ cpu = scx_bpf_pick_idle_cpu(p_cpumask);
+ bpf_cpumask_release(p_cpumask);
+ if (cpu >= 0) {
+ stat_add(ATROPOS_STAT_DIRECT_DISPATCH, 1);
+ goto local;
+ }
+
+ return prev_cpu;
+
+local:
+ task_ctx->dispatch_local = true;
+ return cpu;
+}
+
+void BPF_STRUCT_OPS(atropos_enqueue, struct task_struct *p, u32 enq_flags)
+{
+ p->scx.slice = slice_us * 1000;
+
+ pid_t pid = p->pid;
+ struct task_ctx *task_ctx = bpf_map_lookup_elem(&task_data, &pid);
+ if (!task_ctx) {
+ stat_add(ATROPOS_STAT_TASK_GET_ERR, 1);
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+ return;
+ }
+
+ if (task_ctx->dispatch_local) {
+ task_ctx->dispatch_local = false;
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags);
+ return;
+ }
+
+ scx_bpf_dispatch(p, task_ctx->dom_id, SCX_SLICE_DFL, enq_flags);
+}
+
+static u32 cpu_to_dom_id(s32 cpu)
+{
+ if (nr_doms <= 1)
+ return 0;
+
+ if (cpu >= 0 && cpu < MAX_CPUS) {
+ u32 dom_id;
+
+ /*
+ * XXX - idk why the verifier thinks cpu_dom_id_map[cpu] is not
+ * safe here.
+ */
+ bpf_probe_read_kernel(&dom_id, sizeof(dom_id),
+ (const void *)&cpu_dom_id_map[cpu]);
+ return dom_id;
+ } else {
+ return MAX_DOMS;
+ }
+}
+
+static bool is_cpu_in_dom(u32 cpu, u32 dom_id)
+{
+ u64 mask = 0;
+
+ /*
+ * XXX - derefing two dimensional array triggers the verifier, use
+ * probe_read instead.
+ */
+ bpf_probe_read_kernel(&mask, sizeof(mask),
+ (const void *)&dom_cpumasks[dom_id][cpu / 64]);
+ return mask & (1LLU << (cpu % 64));
+}
+
+struct cpumask_intersects_domain_loop_ctx {
+ const struct cpumask *cpumask;
+ u32 dom_id;
+ bool ret;
+};
+
+static int cpumask_intersects_domain_loopfn(u32 idx, void *data)
+{
+ struct cpumask_intersects_domain_loop_ctx *lctx = data;
+ const struct cpumask *cpumask = lctx->cpumask;
+
+ if (bpf_cpumask_test_cpu(idx, cpumask) &&
+ is_cpu_in_dom(idx, lctx->dom_id)) {
+ lctx->ret = true;
+ return 1;
+ }
+ return 0;
+}
+
+static bool cpumask_intersects_domain(const struct cpumask *cpumask, u32 dom_id)
+{
+ struct cpumask_intersects_domain_loop_ctx lctx = {
+ .cpumask = cpumask,
+ .dom_id = dom_id,
+ .ret = false,
+ };
+
+ bpf_loop(nr_cpus, cpumask_intersects_domain_loopfn, &lctx, 0);
+ return lctx.ret;
+}
+
+static u32 dom_rr_next(s32 cpu)
+{
+ if (cpu >= 0 && cpu < MAX_CPUS) {
+ struct pcpu_ctx *pcpuc = &pcpu_ctx[cpu];
+ u32 dom_id = (pcpuc->dom_rr_cur + 1) % nr_doms;
+
+ if (dom_id == cpu_to_dom_id(cpu))
+ dom_id = (dom_id + 1) % nr_doms;
+
+ pcpuc->dom_rr_cur = dom_id;
+ return dom_id;
+ }
+ return 0;
+}
+
+static int greedy_loopfn(s32 idx, void *data)
+{
+ u32 dom_id = dom_rr_next(*(s32 *)data);
+
+ if (scx_bpf_dsq_nr_queued(dom_id) > greedy_threshold &&
+ scx_bpf_consume(dom_id)) {
+ stat_add(ATROPOS_STAT_GREEDY, 1);
+ return 1;
+ }
+ return 0;
+}
+
+void BPF_STRUCT_OPS(atropos_dispatch, s32 cpu, struct task_struct *prev)
+{
+ u32 dom = cpu_to_dom_id(cpu);
+ if (scx_bpf_consume(dom)) {
+ stat_add(ATROPOS_STAT_DSQ_DISPATCH, 1);
+ return;
+ }
+
+ if (greedy_threshold != (u64)-1)
+ bpf_loop(nr_doms - 1, greedy_loopfn, &cpu, 0);
+}
+
+struct pick_task_domain_loop_ctx {
+ struct task_struct *p;
+ const struct cpumask *cpumask;
+ u64 dom_mask;
+ u32 dom_rr_base;
+ u32 dom_id;
+};
+
+static int pick_task_domain_loopfn(u32 idx, void *data)
+{
+ struct pick_task_domain_loop_ctx *lctx = data;
+ u32 dom_id = (lctx->dom_rr_base + idx) % nr_doms;
+
+ if (dom_id >= MAX_DOMS)
+ return 1;
+
+ if (cpumask_intersects_domain(lctx->cpumask, dom_id)) {
+ lctx->dom_mask |= 1LLU << dom_id;
+ if (lctx->dom_id == MAX_DOMS)
+ lctx->dom_id = dom_id;
+ }
+ return 0;
+}
+
+static u32 pick_task_domain(struct task_ctx *task_ctx, struct task_struct *p,
+ const struct cpumask *cpumask)
+{
+ struct pick_task_domain_loop_ctx lctx = {
+ .p = p,
+ .cpumask = cpumask,
+ .dom_id = MAX_DOMS,
+ };
+ s32 cpu = bpf_get_smp_processor_id();
+
+ if (cpu < 0 || cpu >= MAX_CPUS)
+ return MAX_DOMS;
+
+ lctx.dom_rr_base = ++(pcpu_ctx[cpu].dom_rr_cur);
+
+ bpf_loop(nr_doms, pick_task_domain_loopfn, &lctx, 0);
+ task_ctx->dom_mask = lctx.dom_mask;
+
+ return lctx.dom_id;
+}
+
+static void task_set_domain(struct task_ctx *task_ctx, struct task_struct *p,
+ const struct cpumask *cpumask)
+{
+ u32 dom_id = 0;
+
+ if (nr_doms > 1)
+ dom_id = pick_task_domain(task_ctx, p, cpumask);
+
+ task_set_dq(task_ctx, p, dom_id);
+}
+
+void BPF_STRUCT_OPS(atropos_set_cpumask, struct task_struct *p,
+ const struct cpumask *cpumask)
+{
+ pid_t pid = p->pid;
+ struct task_ctx *task_ctx = bpf_map_lookup_elem(&task_data, &pid);
+ if (!task_ctx) {
+ stat_add(ATROPOS_STAT_TASK_GET_ERR, 1);
+ return;
+ }
+
+ task_set_domain(task_ctx, p, cpumask);
+}
+
+s32 BPF_STRUCT_OPS(atropos_prep_enable, struct task_struct *p,
+ struct scx_enable_args *args)
+{
+ struct bpf_cpumask *cpumask;
+ struct task_ctx task_ctx, *map_value;
+ long ret;
+ pid_t pid;
+
+ memset(&task_ctx, 0, sizeof(task_ctx));
+ task_ctx.weight = p->scx.weight;
+
+ pid = p->pid;
+ ret = bpf_map_update_elem(&task_data, &pid, &task_ctx, BPF_NOEXIST);
+ if (ret) {
+ stat_add(ATROPOS_STAT_TASK_GET_ERR, 1);
+ return ret;
+ }
+
+ /*
+ * Read the entry from the map immediately so we can add the cpumask
+ * with bpf_kptr_xchg().
+ */
+ map_value = bpf_map_lookup_elem(&task_data, &pid);
+ if (!map_value)
+ /* Should never happen -- it was just inserted above. */
+ return -EINVAL;
+
+ cpumask = bpf_cpumask_create();
+ if (!cpumask) {
+ bpf_map_delete_elem(&task_data, &pid);
+ return -ENOMEM;
+ }
+
+ cpumask = bpf_kptr_xchg(&map_value->cpumask, cpumask);
+ if (cpumask) {
+ /* Should never happen as we just inserted it above. */
+ bpf_cpumask_release(cpumask);
+ bpf_map_delete_elem(&task_data, &pid);
+ return -EINVAL;
+ }
+
+ task_set_domain(map_value, p, p->cpus_ptr);
+
+ return 0;
+}
+
+void BPF_STRUCT_OPS(atropos_disable, struct task_struct *p)
+{
+ pid_t pid = p->pid;
+ long ret = bpf_map_delete_elem(&task_data, &pid);
+ if (ret) {
+ stat_add(ATROPOS_STAT_TASK_GET_ERR, 1);
+ return;
+ }
+}
+
+struct initialize_domain_loop_ctx {
+ struct bpf_cpumask *cpumask;
+ u32 dom_id;
+};
+
+static int set_cpumask_bit(u32 idx, void *data)
+{
+ struct initialize_domain_loop_ctx *lctx = data;
+ u32 dom_id = lctx->dom_id;
+ u64 mask = 1LLU << (idx % 64);
+ const volatile __u64 *dptr;
+
+ dptr = MEMBER_VPTR(dom_cpumasks, [dom_id][idx / 64]);
+ if (!dptr) {
+ scx_bpf_error("Failed to initialize cpu %u for dom %u", idx, dom_id);
+ return 1;
+ }
+
+ if ((*dptr & mask))
+ bpf_cpumask_set_cpu(idx, lctx->cpumask);
+ else
+ bpf_cpumask_clear_cpu(idx, lctx->cpumask);
+
+ return 0;
+}
+
+static int create_local_dsq(u32 idx, void *data)
+{
+ struct dom_cpumask entry, *v;
+ struct bpf_cpumask *cpumask;
+ struct initialize_domain_loop_ctx loop_ctx;
+ u32 dom_id = idx;
+ s64 ret;
+
+ ret = scx_bpf_create_dsq(dom_id, -1);
+ if (ret < 0) {
+ scx_bpf_error("Failed to create dsq %u (%d)", dom_id, ret);
+ return 1;
+ }
+
+ memset(&entry, 0, sizeof(entry));
+ ret = bpf_map_update_elem(&dom_cpumask_map, &dom_id, &entry, 0);
+ if (ret) {
+ scx_bpf_error("Failed to add dom_cpumask entry %u (%d)", dom_id, ret);
+ return 1;
+ }
+
+ v = bpf_map_lookup_elem(&dom_cpumask_map, &dom_id);
+ if (!v) {
+ /* Should never happen, we just inserted it above. */
+ scx_bpf_error("Failed to lookup dom element %u", dom_id);
+ return 1;
+ }
+
+ cpumask = bpf_cpumask_create();
+ if (!cpumask) {
+ scx_bpf_error("Failed to create BPF cpumask for domain %u", dom_id);
+ return 1;
+ }
+
+ loop_ctx.cpumask = cpumask;
+ loop_ctx.dom_id = dom_id;
+ if (bpf_loop(nr_cpus, set_cpumask_bit, &loop_ctx, 0) != nr_cpus) {
+ scx_bpf_error("Failed to initialize cpumask for domain %u", dom_id);
+ bpf_cpumask_release(cpumask);
+ return 1;
+ }
+
+ cpumask = bpf_kptr_xchg(&v->cpumask, cpumask);
+ if (cpumask) {
+ scx_bpf_error("Domain %u was already present", dom_id);
+ bpf_cpumask_release(cpumask);
+ return 1;
+ }
+
+ return 0;
+}
+
+int BPF_STRUCT_OPS_SLEEPABLE(atropos_init)
+{
+ u32 local_nr_doms = nr_doms;
+
+ bpf_printk("atropos init");
+
+ if (switch_all)
+ scx_bpf_switch_all();
+
+ // BPF verifier gets cranky if we don't bound this like so
+ if (local_nr_doms > MAX_DOMS)
+ local_nr_doms = MAX_DOMS;
+
+ bpf_loop(local_nr_doms, create_local_dsq, NULL, 0);
+
+ for (u32 i = 0; i < nr_cpus; ++i) {
+ pcpu_ctx[i].dom_rr_cur = i;
+ }
+
+ return 0;
+}
+
+void BPF_STRUCT_OPS(atropos_exit, struct scx_exit_info *ei)
+{
+ bpf_probe_read_kernel_str(exit_msg, sizeof(exit_msg), ei->msg);
+ exit_type = ei->type;
+}
+
+SEC(".struct_ops")
+struct sched_ext_ops atropos = {
+ .select_cpu = (void *)atropos_select_cpu,
+ .enqueue = (void *)atropos_enqueue,
+ .dispatch = (void *)atropos_dispatch,
+ .set_cpumask = (void *)atropos_set_cpumask,
+ .prep_enable = (void *)atropos_prep_enable,
+ .disable = (void *)atropos_disable,
+ .init = (void *)atropos_init,
+ .exit = (void *)atropos_exit,
+ .flags = 0,
+ .name = "atropos",
+};
diff --git a/tools/sched_ext/atropos/src/bpf/atropos.h b/tools/sched_ext/atropos/src/bpf/atropos.h
new file mode 100644
index 000000000000..921210ec2a3c
--- /dev/null
+++ b/tools/sched_ext/atropos/src/bpf/atropos.h
@@ -0,0 +1,44 @@
+// Copyright (c) Meta Platforms, Inc. and affiliates.
+
+// This software may be used and distributed according to the terms of the
+// GNU General Public License version 2.
+#ifndef __ATROPOS_H
+#define __ATROPOS_H
+
+#include <stdbool.h>
+#ifndef __kptr_ref
+#ifdef __KERNEL__
+#error "__kptr_ref not defined in the kernel"
+#endif
+#define __kptr_ref
+#endif
+
+#define MAX_CPUS 512
+#define MAX_DOMS 64 /* limited to avoid complex bitmask ops */
+#define CACHELINE_SIZE 64
+
+/* Statistics */
+enum stat_idx {
+ ATROPOS_STAT_TASK_GET_ERR,
+ ATROPOS_STAT_TASK_GET_ERR_ENABLE,
+ ATROPOS_STAT_CPUMASK_ERR,
+ ATROPOS_STAT_WAKE_SYNC,
+ ATROPOS_STAT_PREV_IDLE,
+ ATROPOS_STAT_PINNED,
+ ATROPOS_STAT_DIRECT_DISPATCH,
+ ATROPOS_STAT_DSQ_DISPATCH,
+ ATROPOS_STAT_GREEDY,
+ ATROPOS_STAT_LOAD_BALANCE,
+ ATROPOS_STAT_LAST_TASK,
+ ATROPOS_NR_STATS,
+};
+
+struct task_ctx {
+ unsigned long long dom_mask; /* the domains this task can run on */
+ struct bpf_cpumask __kptr_ref *cpumask;
+ unsigned int dom_id;
+ unsigned int weight;
+ bool dispatch_local;
+};
+
+#endif /* __ATROPOS_H */
diff --git a/tools/sched_ext/atropos/src/main.rs b/tools/sched_ext/atropos/src/main.rs
new file mode 100644
index 000000000000..b9ae312a562f
--- /dev/null
+++ b/tools/sched_ext/atropos/src/main.rs
@@ -0,0 +1,648 @@
+// Copyright (c) Meta Platforms, Inc. and affiliates.
+
+// This software may be used and distributed according to the terms of the
+// GNU General Public License version 2.
+#![deny(clippy::all)]
+use std::collections::BTreeMap;
+use std::ffi::CStr;
+use std::sync::atomic::AtomicBool;
+use std::sync::atomic::Ordering;
+use std::sync::Arc;
+
+use ::fb_procfs as procfs;
+use anyhow::anyhow;
+use anyhow::bail;
+use anyhow::Context;
+use bitvec::prelude::*;
+use clap::Parser;
+
+mod util;
+
+oss_shim!();
+
+/// Atropos is a multi-domain BPF / userspace hybrid scheduler where the BPF
+/// part does simple round robin in each domain and the userspace part
+/// calculates the load factor of each domain and tells the BPF part how to load
+/// balance the domains.
+
+/// This scheduler demonstrates dividing scheduling logic between BPF and
+/// userspace and using rust to build the userspace part. An earlier variant of
+/// this scheduler was used to balance across six domains, each representing a
+/// chiplet in a six-chiplet AMD processor, and could match the performance of
+/// production setup using CFS.
+#[derive(Debug, Parser)]
+struct Opt {
+ /// Set the log level for more or less verbose output. --log_level=debug
+ /// will output libbpf verbose details
+ #[clap(short, long, default_value = "info")]
+ log_level: String,
+ /// Set the cpumask for a domain, provide multiple --cpumasks, one for each
+ /// domain. E.g. --cpumasks 0xff_00ff --cpumasks 0xff00 will create two
+ /// domains with the corresponding CPUs belonging to each domain. Each CPU
+ /// must belong to precisely one domain.
+ #[clap(short, long, required = true, min_values = 1)]
+ cpumasks: Vec<String>,
+ /// Switch all tasks to sched_ext. If not specified, only tasks which
+ /// have their scheduling policy set to SCHED_EXT using
+ /// sched_setscheduler(2) are switched.
+ #[clap(short, long, default_value = "false")]
+ all: bool,
+ /// Enable load balancing. Periodically userspace will calculate the load
+ /// factor of each domain and instruct BPF which processes to move.
+ #[clap(short, long, default_value = "true")]
+ load_balance: bool,
+ /// Enable greedy task stealing. When a domain is idle, a cpu will attempt
+ /// to steal tasks from a domain with at least greedy_threshold tasks
+ /// enqueued. These tasks aren't permanently stolen from the domain.
+ #[clap(short, long)]
+ greedy_threshold: Option<u64>,
+}
+
+type CpusetDqPair = (Vec<BitVec<u64, Lsb0>>, Vec<i32>);
+
+// Returns Vec of cpuset for each dq and a vec of dq for each cpu
+fn parse_cpusets(cpumasks: &[String]) -> anyhow::Result<CpusetDqPair> {
+ if cpumasks.len() > atropos_sys::MAX_DOMS as usize {
+ bail!(
+ "Number of requested DSQs ({}) is greater than MAX_DOMS ({})",
+ cpumasks.len(),
+ atropos_sys::MAX_DOMS
+ );
+ }
+ let num_cpus = libbpf_rs::num_possible_cpus()?;
+ if num_cpus > atropos_sys::MAX_CPUS as usize {
+ bail!(
+ "num_cpus ({}) is greater than MAX_CPUS ({})",
+ num_cpus,
+ atropos_sys::MAX_CPUS,
+ );
+ }
+ let mut cpus = vec![-1i32; num_cpus];
+ let mut cpusets = vec![bitvec![u64, Lsb0; 0; atropos_sys::MAX_CPUS as usize]; cpumasks.len()];
+ for (dq, cpumask) in cpumasks.iter().enumerate() {
+ let hex_str = {
+ let mut tmp_str = cpumask
+ .strip_prefix("0x")
+ .unwrap_or(cpumask)
+ .replace('_', "");
+ if tmp_str.len() % 2 != 0 {
+ tmp_str = "0".to_string() + &tmp_str;
+ }
+ tmp_str
+ };
+ let byte_vec = hex::decode(&hex_str)
+ .with_context(|| format!("Failed to parse cpumask: {}", cpumask))?;
+
+ for (index, &val) in byte_vec.iter().rev().enumerate() {
+ let mut v = val;
+ while v != 0 {
+ let lsb = v.trailing_zeros() as usize;
+ v &= !(1 << lsb);
+ let cpu = index * 8 + lsb;
+ if cpu > num_cpus {
+ bail!(
+ concat!(
+ "Found cpu ({}) in cpumask ({}) which is larger",
+ " than the number of cpus on the machine ({})"
+ ),
+ cpu,
+ cpumask,
+ num_cpus
+ );
+ }
+ if cpus[cpu] != -1 {
+ bail!(
+ "Found cpu ({}) with dq ({}) but also in cpumask ({})",
+ cpu,
+ cpus[cpu],
+ cpumask
+ );
+ }
+ cpus[cpu] = dq as i32;
+ cpusets[dq].set(cpu, true);
+ }
+ }
+ cpusets[dq].set_uninitialized(false);
+ }
+
+ for (cpu, &dq) in cpus.iter().enumerate() {
+ if dq < 0 {
+ bail!(
+ "Cpu {} not assigned to any dq. Make sure it is covered by some --cpumasks argument.",
+ cpu
+ );
+ }
+ }
+
+ Ok((cpusets, cpus))
+}
+
+struct Sample {
+ total_cpu: procfs::CpuStat,
+}
+
+fn get_cpustats(reader: &mut procfs::ProcReader) -> anyhow::Result<Sample> {
+ let stat = reader.read_stat().context("Failed to read procfs")?;
+ Ok(Sample {
+ total_cpu: stat
+ .total_cpu
+ .ok_or_else(|| anyhow!("Could not read total cpu stat in proc"))?,
+ })
+}
+
+fn calculate_cpu_busy(prev: &procfs::CpuStat, next: &procfs::CpuStat) -> anyhow::Result<f64> {
+ match (prev, next) {
+ (
+ procfs::CpuStat {
+ user_usec: Some(prev_user),
+ nice_usec: Some(prev_nice),
+ system_usec: Some(prev_system),
+ idle_usec: Some(prev_idle),
+ iowait_usec: Some(prev_iowait),
+ irq_usec: Some(prev_irq),
+ softirq_usec: Some(prev_softirq),
+ stolen_usec: Some(prev_stolen),
+ guest_usec: _,
+ guest_nice_usec: _,
+ },
+ procfs::CpuStat {
+ user_usec: Some(curr_user),
+ nice_usec: Some(curr_nice),
+ system_usec: Some(curr_system),
+ idle_usec: Some(curr_idle),
+ iowait_usec: Some(curr_iowait),
+ irq_usec: Some(curr_irq),
+ softirq_usec: Some(curr_softirq),
+ stolen_usec: Some(curr_stolen),
+ guest_usec: _,
+ guest_nice_usec: _,
+ },
+ ) => {
+ let idle_usec = curr_idle - prev_idle;
+ let iowait_usec = curr_iowait - prev_iowait;
+ let user_usec = curr_user - prev_user;
+ let system_usec = curr_system - prev_system;
+ let nice_usec = curr_nice - prev_nice;
+ let irq_usec = curr_irq - prev_irq;
+ let softirq_usec = curr_softirq - prev_softirq;
+ let stolen_usec = curr_stolen - prev_stolen;
+
+ let busy_usec =
+ user_usec + system_usec + nice_usec + irq_usec + softirq_usec + stolen_usec;
+ let total_usec = idle_usec + busy_usec + iowait_usec;
+ Ok(busy_usec as f64 / total_usec as f64)
+ }
+ _ => {
+ bail!("Some procfs stats are not populated!");
+ }
+ }
+}
+
+fn calculate_pid_busy(
+ prev: &procfs::PidStat,
+ next: &procfs::PidStat,
+ dur: std::time::Duration,
+) -> anyhow::Result<f64> {
+ match (
+ (prev.user_usecs, prev.system_usecs),
+ (next.user_usecs, prev.system_usecs),
+ ) {
+ ((Some(prev_user), Some(prev_system)), (Some(next_user), Some(next_system))) => {
+ if (next_user >= prev_user) && (next_system >= prev_system) {
+ let busy_usec = next_user + next_system - prev_user - prev_system;
+ Ok(busy_usec as f64 / dur.as_micros() as f64)
+ } else {
+ bail!("Pid usage values look wrong");
+ }
+ }
+ _ => {
+ bail!("Some procfs stats are not populated!");
+ }
+ }
+}
+
+struct PidInfo {
+ pub pid: i32,
+ pub dom: u32,
+ pub dom_mask: u64,
+}
+
+struct LoadInfo {
+ pids_by_milliload: BTreeMap<u64, PidInfo>,
+ pid_stats: BTreeMap<i32, procfs::PidStat>,
+ global_load_sum: f64,
+ dom_load: Vec<f64>,
+}
+
+// We calculate the load for each task and then each dom by enumerating all the
+// tasks in task_data and calculating their CPU util from procfs.
+
+// Given procfs reader, task data map, and pidstat from previous calculation,
+// return:
+// * a sorted map from milliload -> pid_data,
+// * a map from pid -> pidstat
+// * a vec of per-dom looads
+fn calculate_load(
+ proc_reader: &procfs::ProcReader,
+ task_data: &libbpf_rs::Map,
+ interval: std::time::Duration,
+ prev_pid_stat: &BTreeMap<i32, procfs::PidStat>,
+ nr_doms: usize,
+) -> anyhow::Result<LoadInfo> {
+ let mut ret = LoadInfo {
+ pids_by_milliload: BTreeMap::new(),
+ pid_stats: BTreeMap::new(),
+ global_load_sum: 0f64,
+ dom_load: vec![0f64; nr_doms],
+ };
+ for key in task_data.keys() {
+ if let Some(task_ctx_vec) = task_data
+ .lookup(&key, libbpf_rs::MapFlags::ANY)
+ .context("Failed to lookup task_data")?
+ {
+ let task_ctx =
+ unsafe { &*(task_ctx_vec.as_slice().as_ptr() as *const atropos_sys::task_ctx) };
+ let pid = i32::from_ne_bytes(
+ key.as_slice()
+ .try_into()
+ .context("Invalid key length in task_data map")?,
+ );
+ match proc_reader.read_tid_stat(pid as u32) {
+ Ok(stat) => {
+ ret.pid_stats.insert(pid, stat);
+ }
+ Err(procfs::Error::IoError(_, ref e))
+ if e.raw_os_error()
+ .map_or(false, |ec| ec == 2 || ec == 3 /* ENOENT or ESRCH */) =>
+ {
+ continue;
+ }
+ Err(e) => {
+ bail!(e);
+ }
+ }
+ let pid_load = match (prev_pid_stat.get(&pid), ret.pid_stats.get(&pid)) {
+ (Some(prev_pid_stat), Some(next_pid_stat)) => {
+ calculate_pid_busy(prev_pid_stat, next_pid_stat, interval)?
+ }
+ // If we don't have any utilization #s for the process, just skip it
+ _ => {
+ continue;
+ }
+ } * task_ctx.weight as f64;
+ if !pid_load.is_finite() || pid_load <= 0.0 {
+ continue;
+ }
+ ret.global_load_sum += pid_load;
+ ret.dom_load[task_ctx.dom_id as usize] += pid_load;
+ // Only record pids that are eligible for load balancing
+ if task_ctx.dom_mask == (1u64 << task_ctx.dom_id) {
+ continue;
+ }
+ ret.pids_by_milliload.insert(
+ (pid_load * 1000.0) as u64,
+ PidInfo {
+ pid,
+ dom: task_ctx.dom_id,
+ dom_mask: task_ctx.dom_mask,
+ },
+ );
+ }
+ }
+ Ok(ret)
+}
+
+#[derive(Copy, Clone, Default)]
+struct DomLoadBalanceInfo {
+ load_to_pull: f64,
+ load_to_give: f64,
+}
+
+#[derive(Default)]
+struct LoadBalanceInfo {
+ doms: Vec<DomLoadBalanceInfo>,
+ doms_with_load_to_pull: BTreeMap<u32, f64>,
+ doms_with_load_to_give: BTreeMap<u32, f64>,
+}
+
+// To balance dom loads we identify doms with lower and higher load than average
+fn calculate_dom_load_balance(global_load_avg: f64, dom_load: &[f64]) -> LoadBalanceInfo {
+ let mut ret = LoadBalanceInfo::default();
+ ret.doms.resize(dom_load.len(), Default::default());
+
+ const LOAD_IMBAL_HIGH_PCT: f64 = 0.10;
+ const LOAD_IMBAL_MAX_ADJ_PCT: f64 = 0.10;
+ let high = global_load_avg * LOAD_IMBAL_HIGH_PCT;
+ let adj_max = global_load_avg * LOAD_IMBAL_MAX_ADJ_PCT;
+
+ for (dom, dom_load) in dom_load.iter().enumerate() {
+ let mut imbal = dom_load - global_load_avg;
+
+ let mut dom_load_to_pull = 0f64;
+ let mut dom_load_to_give = 0f64;
+ if imbal >= 0f64 {
+ dom_load_to_give = imbal;
+ } else {
+ imbal = -imbal;
+ if imbal > high {
+ dom_load_to_pull = f64::min(imbal, adj_max);
+ }
+ }
+ ret.doms[dom].load_to_pull = dom_load_to_pull;
+ ret.doms[dom].load_to_give = dom_load_to_give;
+ if dom_load_to_pull > 0f64 {
+ ret.doms_with_load_to_pull
+ .insert(dom as u32, dom_load_to_pull);
+ }
+ if dom_load_to_give > 0f64 {
+ ret.doms_with_load_to_give
+ .insert(dom as u32, dom_load_to_give);
+ }
+ }
+ ret
+}
+
+fn clear_map(map: &mut libbpf_rs::Map) {
+ // XXX: libbpf_rs has some design flaw that make it impossible to
+ // delete while iterating despite it being safe so we alias it here
+ let deleter: &mut libbpf_rs::Map = unsafe { &mut *(map as *mut _) };
+ for key in map.keys() {
+ let _ = deleter.delete(&key);
+ }
+}
+
+// Actually execute the load balancing. Concretely this writes pid -> dom
+// entries into the lb_data map for bpf side to consume.
+//
+// The logic here is simple, greedily balance the heaviest load processes until
+// either we have no doms with load to give or no doms with load to pull.
+fn load_balance(
+ global_load_avg: f64,
+ lb_data: &mut libbpf_rs::Map,
+ pids_by_milliload: &BTreeMap<u64, PidInfo>,
+ mut doms_with_load_to_pull: BTreeMap<u32, f64>,
+ mut doms_with_load_to_give: BTreeMap<u32, f64>,
+) -> anyhow::Result<()> {
+ clear_map(lb_data);
+ const LOAD_IMBAL_MIN_ADJ_PCT: f64 = 0.01;
+ let adj_min = global_load_avg * LOAD_IMBAL_MIN_ADJ_PCT;
+ for (pid_milliload, pidinfo) in pids_by_milliload.iter().rev() {
+ if doms_with_load_to_give.is_empty() || doms_with_load_to_pull.is_empty() {
+ break;
+ }
+
+ let pid_load = *pid_milliload as f64 / 1000f64;
+ let mut remove_to_give = None;
+ let mut remove_to_pull = None;
+ if let Some(dom_imbal) = doms_with_load_to_give.get_mut(&pidinfo.dom) {
+ if *dom_imbal < pid_load {
+ continue;
+ }
+
+ for (new_dom, new_dom_imbal) in doms_with_load_to_pull.iter_mut() {
+ if (pidinfo.dom_mask & (1 << new_dom)) == 0 || *new_dom_imbal < pid_load {
+ continue;
+ }
+
+ *dom_imbal -= pid_load;
+ if *dom_imbal <= adj_min {
+ remove_to_give = Some(pidinfo.dom);
+ }
+ *new_dom_imbal -= pid_load;
+ if *new_dom_imbal <= adj_min {
+ remove_to_pull = Some(pidinfo.dom);
+ }
+
+ lb_data
+ .update(
+ &(pidinfo.pid as libc::pid_t).to_ne_bytes(),
+ &new_dom.to_ne_bytes(),
+ libbpf_rs::MapFlags::NO_EXIST,
+ )
+ .context("Failed to update lb_data")?;
+ break;
+ }
+ }
+
+ remove_to_give.map(|dom| doms_with_load_to_give.remove(&dom));
+ remove_to_pull.map(|dom| doms_with_load_to_pull.remove(&dom));
+ }
+ Ok(())
+}
+
+fn print_stats(
+ logger: slog::Logger,
+ stats_map: &mut libbpf_rs::Map,
+ nr_doms: usize,
+ nr_cpus: usize,
+ cpu_busy: f64,
+ global_load_avg: f64,
+ dom_load: &[f64],
+ dom_lb_info: &[DomLoadBalanceInfo],
+) -> anyhow::Result<()> {
+ let stats = {
+ let mut stats: Vec<u64> = Vec::new();
+ let zero_vec = vec![vec![0u8; stats_map.value_size() as usize]; nr_cpus];
+ for stat in 0..atropos_sys::stat_idx_ATROPOS_NR_STATS {
+ let cpu_stat_vec = stats_map
+ .lookup_percpu(&(stat as u32).to_ne_bytes(), libbpf_rs::MapFlags::ANY)
+ .with_context(|| format!("Failed to lookup stat {}", stat))?
+ .expect("per-cpu stat should exist");
+ let sum = cpu_stat_vec
+ .iter()
+ .map(|val| {
+ u64::from_ne_bytes(
+ val.as_slice()
+ .try_into()
+ .expect("Invalid value length in stat map"),
+ )
+ })
+ .sum();
+ stats_map
+ .update_percpu(
+ &(stat as u32).to_ne_bytes(),
+ &zero_vec,
+ libbpf_rs::MapFlags::ANY,
+ )
+ .context("Failed to zero stat")?;
+ stats.push(sum);
+ }
+ stats
+ };
+ let mut total = 0;
+ total += stats[atropos_sys::stat_idx_ATROPOS_STAT_WAKE_SYNC as usize];
+ total += stats[atropos_sys::stat_idx_ATROPOS_STAT_PREV_IDLE as usize];
+ total += stats[atropos_sys::stat_idx_ATROPOS_STAT_PINNED as usize];
+ total += stats[atropos_sys::stat_idx_ATROPOS_STAT_DIRECT_DISPATCH as usize];
+ total += stats[atropos_sys::stat_idx_ATROPOS_STAT_DSQ_DISPATCH as usize];
+ total += stats[atropos_sys::stat_idx_ATROPOS_STAT_GREEDY as usize];
+ total += stats[atropos_sys::stat_idx_ATROPOS_STAT_LAST_TASK as usize];
+ slog::info!(logger, "cpu={:5.1}", cpu_busy * 100.0);
+ slog::info!(
+ logger,
+ "task_get_errs: {}, cpumask_errs: {}",
+ stats[atropos_sys::stat_idx_ATROPOS_STAT_TASK_GET_ERR as usize],
+ stats[atropos_sys::stat_idx_ATROPOS_STAT_CPUMASK_ERR as usize],
+ );
+ slog::info!(
+ logger,
+ "tot={:6} wake_sync={:4.1},prev_idle={:4.1},pinned={:4.1},direct={:4.1},dq={:4.1},greedy={:4.1}",
+ total,
+ stats[atropos_sys::stat_idx_ATROPOS_STAT_WAKE_SYNC as usize] as f64 / total as f64 * 100f64,
+ stats[atropos_sys::stat_idx_ATROPOS_STAT_PREV_IDLE as usize] as f64 / total as f64 * 100f64,
+ stats[atropos_sys::stat_idx_ATROPOS_STAT_PINNED as usize] as f64 / total as f64 * 100f64,
+ stats[atropos_sys::stat_idx_ATROPOS_STAT_DIRECT_DISPATCH as usize] as f64 / total as f64
+ * 100f64,
+ stats[atropos_sys::stat_idx_ATROPOS_STAT_DSQ_DISPATCH as usize] as f64 / total as f64
+ * 100f64,
+ stats[atropos_sys::stat_idx_ATROPOS_STAT_GREEDY as usize] as f64 / total as f64 * 100f64,
+ );
+
+ slog::info!(
+ logger,
+ "load_avg:{:.1}, load_balances={}",
+ global_load_avg,
+ stats[atropos_sys::stat_idx_ATROPOS_STAT_LOAD_BALANCE as usize]
+ );
+ for i in 0..nr_doms {
+ slog::info!(logger, "DOM[{:02}]", i);
+ slog::info!(
+ logger,
+ " load={:.1} to_pull={:.1},to_give={:.1}",
+ dom_load[i],
+ dom_lb_info[i].load_to_pull,
+ dom_lb_info[i].load_to_give,
+ );
+ }
+ Ok(())
+}
+
+pub fn run(
+ logger: slog::Logger,
+ debug: bool,
+ cpumasks: Vec<String>,
+ switch_all: bool,
+ balance_load: bool,
+ greedy_threshold: Option<u64>,
+) -> anyhow::Result<()> {
+ slog::info!(logger, "Atropos Scheduler Initialized");
+ let mut skel_builder = AtroposSkelBuilder::default();
+ skel_builder.obj_builder.debug(debug);
+ let mut skel = skel_builder.open().context("Failed to open BPF program")?;
+
+ let (cpusets, cpus) = parse_cpusets(&cpumasks)?;
+ let nr_doms = cpusets.len();
+ let nr_cpus = libbpf_rs::num_possible_cpus()?;
+ skel.rodata().nr_doms = nr_doms as u32;
+ skel.rodata().nr_cpus = nr_cpus as u32;
+
+ for (cpu, dom) in cpus.iter().enumerate() {
+ skel.rodata().cpu_dom_id_map[cpu] = *dom as u32;
+ }
+
+ for (dom, cpuset) in cpusets.iter().enumerate() {
+ let raw_cpuset_slice = cpuset.as_raw_slice();
+ let dom_cpumask_slice = &mut skel.rodata().dom_cpumasks[dom];
+ let (left, _) = dom_cpumask_slice.split_at_mut(raw_cpuset_slice.len());
+ left.clone_from_slice(cpuset.as_raw_slice());
+ slog::info!(logger, "dom {} cpumask {:X?}", dom, dom_cpumask_slice);
+ }
+
+ skel.rodata().switch_all = switch_all;
+
+ if let Some(greedy) = greedy_threshold {
+ skel.rodata().greedy_threshold = greedy;
+ }
+
+ let mut skel = skel.load().context("Failed to load BPF program")?;
+ skel.attach().context("Failed to attach BPF program")?;
+
+ let _structops = skel
+ .maps_mut()
+ .atropos()
+ .attach_struct_ops()
+ .context("Failed to attach atropos struct ops")?;
+ slog::info!(logger, "Atropos Scheduler Attached");
+ let shutdown = Arc::new(AtomicBool::new(false));
+ let shutdown_clone = shutdown.clone();
+ ctrlc::set_handler(move || {
+ shutdown_clone.store(true, Ordering::Relaxed);
+ })
+ .context("Error setting Ctrl-C handler")?;
+
+ let mut proc_reader = procfs::ProcReader::new();
+ let mut prev_sample = get_cpustats(&mut proc_reader)?;
+ let mut prev_pid_stat: BTreeMap<i32, procfs::PidStat> = BTreeMap::new();
+ while !shutdown.load(Ordering::Relaxed)
+ && unsafe { std::ptr::read_volatile(&skel.bss().exit_type as *const _) } == 0
+ {
+ let interval = std::time::Duration::from_secs(1);
+ std::thread::sleep(interval);
+ let now = std::time::SystemTime::now();
+ let next_sample = get_cpustats(&mut proc_reader)?;
+ let cpu_busy = calculate_cpu_busy(&prev_sample.total_cpu, &next_sample.total_cpu)?;
+ prev_sample = next_sample;
+ let load_info = calculate_load(
+ &proc_reader,
+ skel.maps().task_data(),
+ interval,
+ &prev_pid_stat,
+ nr_doms,
+ )?;
+ prev_pid_stat = load_info.pid_stats;
+
+ let global_load_avg = load_info.global_load_sum / nr_doms as f64;
+ let mut lb_info = calculate_dom_load_balance(global_load_avg, &load_info.dom_load);
+
+ let doms_with_load_to_pull = std::mem::take(&mut lb_info.doms_with_load_to_pull);
+ let doms_with_load_to_give = std::mem::take(&mut lb_info.doms_with_load_to_give);
+ if balance_load {
+ load_balance(
+ global_load_avg,
+ skel.maps_mut().lb_data(),
+ &load_info.pids_by_milliload,
+ doms_with_load_to_pull,
+ doms_with_load_to_give,
+ )?;
+ slog::info!(
+ logger,
+ "Load balancing took {:?}",
+ now.elapsed().context("Getting a duration failed")?
+ );
+ }
+ print_stats(
+ logger.clone(),
+ skel.maps_mut().stats(),
+ nr_doms,
+ nr_cpus,
+ cpu_busy,
+ global_load_avg,
+ &load_info.dom_load,
+ &lb_info.doms,
+ )?;
+ }
+ /* Report msg if EXT_OPS_EXIT_ERROR */
+ if skel.bss().exit_type == 2 {
+ let exit_msg_cstr = unsafe { CStr::from_ptr(skel.bss().exit_msg.as_ptr() as *const _) };
+ let exit_msg = exit_msg_cstr
+ .to_str()
+ .context("Failed to convert exit msg to string")?;
+ eprintln!("exit_type={} msg={}", skel.bss().exit_type, exit_msg);
+ }
+ Ok(())
+}
+
+fn main() -> anyhow::Result<()> {
+ let opts = Opt::parse();
+ let logger = setup_logger(&opts.log_level)?;
+ let debug = opts.log_level == "debug";
+
+ run(
+ logger,
+ debug,
+ opts.cpumasks,
+ opts.all,
+ opts.load_balance,
+ opts.greedy_threshold,
+ )
+}
diff --git a/tools/sched_ext/atropos/src/oss/atropos_sys.rs b/tools/sched_ext/atropos/src/oss/atropos_sys.rs
new file mode 100644
index 000000000000..bbeaf856d40e
--- /dev/null
+++ b/tools/sched_ext/atropos/src/oss/atropos_sys.rs
@@ -0,0 +1,10 @@
+// Copyright (c) Meta Platforms, Inc. and affiliates.
+
+// This software may be used and distributed according to the terms of the
+// GNU General Public License version 2.
+#![allow(non_upper_case_globals)]
+#![allow(non_camel_case_types)]
+#![allow(non_snake_case)]
+#![allow(dead_code)]
+
+include!(concat!(env!("OUT_DIR"), "/atropos-sys.rs"));
diff --git a/tools/sched_ext/atropos/src/oss/mod.rs b/tools/sched_ext/atropos/src/oss/mod.rs
new file mode 100644
index 000000000000..5afcf35f777d
--- /dev/null
+++ b/tools/sched_ext/atropos/src/oss/mod.rs
@@ -0,0 +1,29 @@
+// Copyright (c) Meta Platforms, Inc. and affiliates.
+
+// This software may be used and distributed according to the terms of the
+// GNU General Public License version 2.
+#[path = "../bpf/.output/atropos.skel.rs"]
+mod atropos;
+use std::str::FromStr;
+
+use anyhow::bail;
+pub use atropos::*;
+use slog::o;
+use slog::Drain;
+use slog::Level;
+
+pub mod atropos_sys;
+
+pub fn setup_logger(level: &str) -> anyhow::Result<slog::Logger> {
+ let log_level = match Level::from_str(level) {
+ Ok(l) => l,
+ Err(()) => bail!("Failed to parse \"{}\" as a log level", level),
+ };
+ let decorator = slog_term::TermDecorator::new().build();
+ let drain = slog_term::FullFormat::new(decorator).build().fuse();
+ let drain = slog_async::Async::new(drain)
+ .build()
+ .filter_level(log_level)
+ .fuse();
+ Ok(slog::Logger::root(drain, o!()))
+}
diff --git a/tools/sched_ext/atropos/src/util.rs b/tools/sched_ext/atropos/src/util.rs
new file mode 100644
index 000000000000..eae414c0919a
--- /dev/null
+++ b/tools/sched_ext/atropos/src/util.rs
@@ -0,0 +1,24 @@
+// Copyright (c) Meta Platforms, Inc. and affiliates.
+
+// This software may be used and distributed according to the terms of the
+// GNU General Public License version 2.
+
+// Shim between facebook types and open source types.
+//
+// The type interfaces and module hierarchy should be identical on
+// both "branches". And since we glob import, all the submodules in
+// this crate will inherit our name bindings and can use generic paths,
+// eg `crate::logging::setup(..)`.
+#[macro_export]
+macro_rules! oss_shim {
+ () => {
+ #[cfg(fbcode_build)]
+ mod facebook;
+ #[cfg(fbcode_build)]
+ use facebook::*;
+ #[cfg(not(fbcode_build))]
+ mod oss;
+ #[cfg(not(fbcode_build))]
+ use oss::*;
+ };
+}
--
2.39.1


2023-01-28 17:25:08

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 08/30] sched: Expose css_tg(), __setscheduler_prio() and SCHED_CHANGE_BLOCK()

Hi Tejun,

I love your patch! Perhaps something to improve:

[auto build test WARNING on linux/master]
[also build test WARNING on linus/master v6.2-rc5 next-20230127]
[cannot apply to tip/sched/core tj-cgroup/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Tejun-Heo/sched-Encapsulate-task-attribute-change-sequence-into-a-helper-macro/20230128-123001
patch link: https://lore.kernel.org/r/20230128001639.3510083-9-tj%40kernel.org
patch subject: [PATCH 08/30] sched: Expose css_tg(), __setscheduler_prio() and SCHED_CHANGE_BLOCK()
config: arc-defconfig (https://download.01.org/0day-ci/archive/20230129/[email protected]/config)
compiler: arc-elf-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/49fe6fbf20db85f7bd19f55cdcd7e957ae0f0983
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Tejun-Heo/sched-Encapsulate-task-attribute-change-sequence-into-a-helper-macro/20230128-123001
git checkout 49fe6fbf20db85f7bd19f55cdcd7e957ae0f0983
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arc olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arc SHELL=/bin/bash kernel/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

>> kernel/sched/core.c:6967:6: warning: no previous prototype for '__setscheduler_prio' [-Wmissing-prototypes]
6967 | void __setscheduler_prio(struct task_struct *p, int prio)
| ^~~~~~~~~~~~~~~~~~~


vim +/__setscheduler_prio +6967 kernel/sched/core.c

6966
> 6967 void __setscheduler_prio(struct task_struct *p, int prio)
6968 {
6969 if (dl_prio(prio))
6970 p->sched_class = &dl_sched_class;
6971 else if (rt_prio(prio))
6972 p->sched_class = &rt_sched_class;
6973 else
6974 p->sched_class = &fair_sched_class;
6975
6976 p->prio = prio;
6977 }
6978

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-01-28 18:06:10

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 23/30] sched_ext: Add cgroup support

Hi Tejun,

I love your patch! Perhaps something to improve:

[auto build test WARNING on linux/master]
[also build test WARNING on linus/master v6.2-rc5]
[cannot apply to tip/sched/core tj-cgroup/for-next next-20230127]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Tejun-Heo/sched-Encapsulate-task-attribute-change-sequence-into-a-helper-macro/20230128-123001
patch link: https://lore.kernel.org/r/20230128001639.3510083-24-tj%40kernel.org
patch subject: [PATCH 23/30] sched_ext: Add cgroup support
config: arc-defconfig (https://download.01.org/0day-ci/archive/20230129/[email protected]/config)
compiler: arc-elf-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/6f861617cb846291559a9d0bfe60a9c1ca20f406
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Tejun-Heo/sched-Encapsulate-task-attribute-change-sequence-into-a-helper-macro/20230128-123001
git checkout 6f861617cb846291559a9d0bfe60a9c1ca20f406
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arc olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arc SHELL=/bin/bash kernel/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

In file included from kernel/sched/sched.h:3404,
from kernel/sched/core.c:85:
>> kernel/sched/ext.h:182:48: warning: 'struct cgroup_taskset' declared inside parameter list will not be visible outside of this definition or declaration
182 | static inline int scx_cgroup_can_attach(struct cgroup_taskset *tset) { return 0; }
| ^~~~~~~~~~~~~~
kernel/sched/ext.h:185:52: warning: 'struct cgroup_taskset' declared inside parameter list will not be visible outside of this definition or declaration
185 | static inline void scx_cgroup_cancel_attach(struct cgroup_taskset *tset) {}
| ^~~~~~~~~~~~~~
kernel/sched/core.c:7006:6: warning: no previous prototype for '__setscheduler_prio' [-Wmissing-prototypes]
7006 | void __setscheduler_prio(struct task_struct *p, int prio)
| ^~~~~~~~~~~~~~~~~~~
--
In file included from kernel/sched/sched.h:3404,
from kernel/sched/fair.c:55:
>> kernel/sched/ext.h:182:48: warning: 'struct cgroup_taskset' declared inside parameter list will not be visible outside of this definition or declaration
182 | static inline int scx_cgroup_can_attach(struct cgroup_taskset *tset) { return 0; }
| ^~~~~~~~~~~~~~
kernel/sched/ext.h:185:52: warning: 'struct cgroup_taskset' declared inside parameter list will not be visible outside of this definition or declaration
185 | static inline void scx_cgroup_cancel_attach(struct cgroup_taskset *tset) {}
| ^~~~~~~~~~~~~~
kernel/sched/fair.c:5937:6: warning: no previous prototype for 'init_cfs_bandwidth' [-Wmissing-prototypes]
5937 | void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
| ^~~~~~~~~~~~~~~~~~
kernel/sched/fair.c:12361:6: warning: no previous prototype for 'free_fair_sched_group' [-Wmissing-prototypes]
12361 | void free_fair_sched_group(struct task_group *tg) { }
| ^~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c:12363:5: warning: no previous prototype for 'alloc_fair_sched_group' [-Wmissing-prototypes]
12363 | int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
| ^~~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c:12368:6: warning: no previous prototype for 'online_fair_sched_group' [-Wmissing-prototypes]
12368 | void online_fair_sched_group(struct task_group *tg) { }
| ^~~~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c:12370:6: warning: no previous prototype for 'unregister_fair_sched_group' [-Wmissing-prototypes]
12370 | void unregister_fair_sched_group(struct task_group *tg) { }
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
--
In file included from kernel/sched/sched.h:3404,
from kernel/sched/build_policy.c:36:
>> kernel/sched/ext.h:182:48: warning: 'struct cgroup_taskset' declared inside parameter list will not be visible outside of this definition or declaration
182 | static inline int scx_cgroup_can_attach(struct cgroup_taskset *tset) { return 0; }
| ^~~~~~~~~~~~~~
kernel/sched/ext.h:185:52: warning: 'struct cgroup_taskset' declared inside parameter list will not be visible outside of this definition or declaration
185 | static inline void scx_cgroup_cancel_attach(struct cgroup_taskset *tset) {}
| ^~~~~~~~~~~~~~


vim +182 kernel/sched/ext.h

170
171 #ifdef CONFIG_EXT_GROUP_SCHED
172 int scx_tg_online(struct task_group *tg);
173 void scx_tg_offline(struct task_group *tg);
174 int scx_cgroup_can_attach(struct cgroup_taskset *tset);
175 void scx_move_task(struct task_struct *p);
176 void scx_cgroup_finish_attach(void);
177 void scx_cgroup_cancel_attach(struct cgroup_taskset *tset);
178 void scx_group_set_weight(struct task_group *tg, unsigned long cgrp_weight);
179 #else /* CONFIG_EXT_GROUP_SCHED */
180 static inline int scx_tg_online(struct task_group *tg) { return 0; }
181 static inline void scx_tg_offline(struct task_group *tg) {}
> 182 static inline int scx_cgroup_can_attach(struct cgroup_taskset *tset) { return 0; }

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-01-28 19:09:03

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 27/30] sched_ext: Implement core-sched support

Hi Tejun,

I love your patch! Yet something to improve:

[auto build test ERROR on linux/master]
[also build test ERROR on linus/master v6.2-rc5]
[cannot apply to tip/sched/core tj-cgroup/for-next next-20230127]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Tejun-Heo/sched-Encapsulate-task-attribute-change-sequence-into-a-helper-macro/20230128-123001
patch link: https://lore.kernel.org/r/20230128001639.3510083-28-tj%40kernel.org
patch subject: [PATCH 27/30] sched_ext: Implement core-sched support
config: s390-allmodconfig (https://download.01.org/0day-ci/archive/20230129/[email protected]/config)
compiler: s390-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/750f973dd349cc5c3df29319b0ffae740738a9d2
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Tejun-Heo/sched-Encapsulate-task-attribute-change-sequence-into-a-helper-macro/20230128-123001
git checkout 750f973dd349cc5c3df29319b0ffae740738a9d2
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash kernel/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All error/warnings (new ones prefixed by >>):

In file included from kernel/sched/build_policy.c:58:
>> kernel/sched/ext.c:3349:25: error: initialization of 'int (*)(const struct btf_type *, const struct btf_member *)' from incompatible pointer type 'int (*)(const struct btf_type *, const struct btf_member *, const struct bpf_prog *)' [-Werror=incompatible-pointer-types]
3349 | .check_member = bpf_scx_check_member,
| ^~~~~~~~~~~~~~~~~~~~
kernel/sched/ext.c:3349:25: note: (near initialization for 'bpf_sched_ext_ops.check_member')
kernel/sched/ext.c: In function 'scx_bpf_error_bstr':
>> kernel/sched/ext.c:3906:16: error: variable 'bprintf_data' has initializer but incomplete type
3906 | struct bpf_bprintf_data bprintf_data = { .get_bin_args = true };
| ^~~~~~~~~~~~~~~~
>> kernel/sched/ext.c:3906:51: error: 'struct bpf_bprintf_data' has no member named 'get_bin_args'
3906 | struct bpf_bprintf_data bprintf_data = { .get_bin_args = true };
| ^~~~~~~~~~~~
>> kernel/sched/ext.c:3906:66: warning: excess elements in struct initializer
3906 | struct bpf_bprintf_data bprintf_data = { .get_bin_args = true };
| ^~~~
kernel/sched/ext.c:3906:66: note: (near initialization for 'bprintf_data')
>> kernel/sched/ext.c:3906:33: error: storage size of 'bprintf_data' isn't known
3906 | struct bpf_bprintf_data bprintf_data = { .get_bin_args = true };
| ^~~~~~~~~~~~
>> kernel/sched/ext.c:3927:71: warning: passing argument 4 of 'bpf_bprintf_prepare' makes pointer from integer without a cast [-Wint-conversion]
3927 | ret = bpf_bprintf_prepare(fmt, UINT_MAX, bufs->data, data__sz / 8,
| ~~~~~~~~~^~~
| |
| u32 {aka unsigned int}
In file included from include/linux/bpf_verifier.h:7,
from kernel/sched/ext.c:3193:
include/linux/bpf.h:2800:31: note: expected 'u32 **' {aka 'unsigned int **'} but argument is of type 'u32' {aka 'unsigned int'}
2800 | u32 **bin_buf, u32 num_args);
| ~~~~~~^~~~~~~
>> kernel/sched/ext.c:3936:9: error: too many arguments to function 'bpf_bprintf_cleanup'
3936 | bpf_bprintf_cleanup(&bprintf_data);
| ^~~~~~~~~~~~~~~~~~~
include/linux/bpf.h:2801:6: note: declared here
2801 | void bpf_bprintf_cleanup(void);
| ^~~~~~~~~~~~~~~~~~~
kernel/sched/ext.c:3906:33: warning: unused variable 'bprintf_data' [-Wunused-variable]
3906 | struct bpf_bprintf_data bprintf_data = { .get_bin_args = true };
| ^~~~~~~~~~~~
cc1: some warnings being treated as errors


vim +3349 kernel/sched/ext.c

4c016c2bafb66b Tejun Heo 2023-01-27 3344
4c016c2bafb66b Tejun Heo 2023-01-27 3345 struct bpf_struct_ops bpf_sched_ext_ops = {
4c016c2bafb66b Tejun Heo 2023-01-27 3346 .verifier_ops = &bpf_scx_verifier_ops,
4c016c2bafb66b Tejun Heo 2023-01-27 3347 .reg = bpf_scx_reg,
4c016c2bafb66b Tejun Heo 2023-01-27 3348 .unreg = bpf_scx_unreg,
4c016c2bafb66b Tejun Heo 2023-01-27 @3349 .check_member = bpf_scx_check_member,
4c016c2bafb66b Tejun Heo 2023-01-27 3350 .init_member = bpf_scx_init_member,
4c016c2bafb66b Tejun Heo 2023-01-27 3351 .init = bpf_scx_init,
4c016c2bafb66b Tejun Heo 2023-01-27 3352 .name = "sched_ext_ops",
4c016c2bafb66b Tejun Heo 2023-01-27 3353 };
4c016c2bafb66b Tejun Heo 2023-01-27 3354

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-01-30 21:39:02

by Josh Don

[permalink] [raw]
Subject: Re: [PATCH 27/30] sched_ext: Implement core-sched support

Hi Tejun,

On Fri, Jan 27, 2023 at 4:17 PM Tejun Heo <[email protected]> wrote:
>
> The core-sched support is composed of the following parts:

Thanks, this looks pretty reasonable overall.

One meta comment is that I think we can shortcircuit from
touch_core_sched when we have sched_core_disabled().

Reviewed-by: Josh Don <[email protected]>

> + /*
> + * While core-scheduling, rq lock is shared among
> + * siblings but the debug annotations and rq clock
> + * aren't. Do pinning dance to transfer the ownership.
> + */
> + WARN_ON_ONCE(__rq_lockp(rq) != __rq_lockp(srq));
> + rq_unpin_lock(rq, rf);
> + rq_pin_lock(srq, &srf);
> +
> + update_rq_clock(srq);

Unfortunate that we have to do this superfluous update; maybe we can
save/restore the clock flags from before the pinning shenanigans?

> +static struct task_struct *pick_task_scx(struct rq *rq)
> +{
> + struct task_struct *curr = rq->curr;
> + struct task_struct *first = first_local_task(rq);
> +
> + if (curr->scx.flags & SCX_TASK_QUEUED) {
> + /* is curr the only runnable task? */
> + if (!first)
> + return curr;
> +
> + /*
> + * Does curr trump first? We can always go by core_sched_at for
> + * this comparison as it represents global FIFO ordering when
> + * the default core-sched ordering is in used and local-DSQ FIFO
> + * ordering otherwise.
> + */
> + if (curr->scx.slice && time_before64(curr->scx.core_sched_at,
> + first->scx.core_sched_at))
> + return curr;

So is this to handle the case where we have something running on 'rq'
to match the cookie of our sibling (which had priority), but now we
want to switch to running the first thing in the local queue, which
has a different cookie (and is now the highest priority entity)? Maybe
being slightly more specific in the comment would help :)

2023-01-30 23:41:15

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 23/30] sched_ext: Add cgroup support

On Sun, Jan 29, 2023 at 02:05:35AM +0800, kernel test robot wrote:
> Hi Tejun,
>
> I love your patch! Perhaps something to improve:

Fixed the build bugs from the two lkp reports. Will include the right base
tag from the next posting.

Thanks.

--
tejun

2023-01-31 00:26:27

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 27/30] sched_ext: Implement core-sched support

Hello,

On Mon, Jan 30, 2023 at 01:38:15PM -0800, Josh Don wrote:
> > The core-sched support is composed of the following parts:
>
> Thanks, this looks pretty reasonable overall.
>
> One meta comment is that I think we can shortcircuit from
> touch_core_sched when we have sched_core_disabled().

Yeah, touch_core_sched() is really cheap (it's just an assignment from an rq
field to a task field) but sched_core_disabled() is also just a static
branch. Will update.

> Reviewed-by: Josh Don <[email protected]>
>
> > + /*
> > + * While core-scheduling, rq lock is shared among
> > + * siblings but the debug annotations and rq clock
> > + * aren't. Do pinning dance to transfer the ownership.
> > + */
> > + WARN_ON_ONCE(__rq_lockp(rq) != __rq_lockp(srq));
> > + rq_unpin_lock(rq, rf);
> > + rq_pin_lock(srq, &srf);
> > +
> > + update_rq_clock(srq);
>
> Unfortunate that we have to do this superfluous update; maybe we can
> save/restore the clock flags from before the pinning shenanigans?

So, this one isn't really superflous. There are two rq's involved - self and
sibling. self's rq clock is saved and restored through rq_unpin_lock() and
rq_repin_lock(). We're transferring the lock owner ship from self to sibling
without actually unlocking and relocking the lock as they should be sharing
the same lock; however, that doesn't mean that the two queues share rq
clocks, so the sibling needs to update its rq clock upon getting the lock
transferred to it. It might make sense to make the siblings share the rq
clock when core-sched is enabled but that's probably for some other time.

> > +static struct task_struct *pick_task_scx(struct rq *rq)
> > +{
> > + struct task_struct *curr = rq->curr;
> > + struct task_struct *first = first_local_task(rq);
> > +
> > + if (curr->scx.flags & SCX_TASK_QUEUED) {
> > + /* is curr the only runnable task? */
> > + if (!first)
> > + return curr;
> > +
> > + /*
> > + * Does curr trump first? We can always go by core_sched_at for
> > + * this comparison as it represents global FIFO ordering when
> > + * the default core-sched ordering is in used and local-DSQ FIFO
> > + * ordering otherwise.
> > + */
> > + if (curr->scx.slice && time_before64(curr->scx.core_sched_at,
> > + first->scx.core_sched_at))
> > + return curr;
>
> So is this to handle the case where we have something running on 'rq'
> to match the cookie of our sibling (which had priority), but now we
> want to switch to running the first thing in the local queue, which
> has a different cookie (and is now the highest priority entity)? Maybe
> being slightly more specific in the comment would help :)

pick_task_scx() is to pick the next best candidate for the rq. The
candidates then compete to determine the next cookie. Here, as long as only
one task gets dispatched at any given time, the condition check shouldn't
really trigger - ie. if curr has slice remaining, balance_one() wouldn't
have populated the local DSQ anyway and first would be NULL.

However, the BPF scheduler is free to dispatch whatever tasks anytime (e.g.
scx_example_central), so it's possible that a task with an earlier timestamp
has been dispatched to the local DSQ since curr started executing, in which
case we likely want to return the first on DSQ as the CPU's candidate.

IOW, the time_before() is there mostly to cover unusual cases. Most should
either trigger !first before or fail the curr->scx.slice test. Will update
the comment to clarify.

Thanks.

--
tejun

2023-01-31 00:36:35

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 27/30] sched_ext: Implement core-sched support

On Mon, Jan 30, 2023 at 02:26:20PM -1000, Tejun Heo wrote:
> However, the BPF scheduler is free to dispatch whatever tasks anytime (e.g.
> scx_example_central), so it's possible that a task with an earlier timestamp
> has been dispatched to the local DSQ since curr started executing, in which
> case we likely want to return the first on DSQ as the CPU's candidate.

Okay, a more common case would be when a CPU is forced to run a task which
isn't current by its sibling winning a different cookie and then the BPF
scheduler putting that task right back on the local DSQ. For the CPU then,
the right candidate would be the first task on DSQ not the current running
one which is dragged forward because the sibling trumping us.

Thanks.

--
tejun

2023-01-31 01:46:13

by Josh Don

[permalink] [raw]
Subject: Re: [PATCH 27/30] sched_ext: Implement core-sched support

On Mon, Jan 30, 2023 at 4:26 PM Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Mon, Jan 30, 2023 at 01:38:15PM -0800, Josh Don wrote:
> > > The core-sched support is composed of the following parts:
> >
> > Thanks, this looks pretty reasonable overall.
> >
> > One meta comment is that I think we can shortcircuit from
> > touch_core_sched when we have sched_core_disabled().
>
> Yeah, touch_core_sched() is really cheap (it's just an assignment from an rq
> field to a task field) but sched_core_disabled() is also just a static
> branch. Will update.

Yep, true, I was just going through and reasoning about whether
anything needed to be done in the !sched_core_disabled() case.

> > Reviewed-by: Josh Don <[email protected]>
> >
> > > + /*
> > > + * While core-scheduling, rq lock is shared among
> > > + * siblings but the debug annotations and rq clock
> > > + * aren't. Do pinning dance to transfer the ownership.
> > > + */
> > > + WARN_ON_ONCE(__rq_lockp(rq) != __rq_lockp(srq));
> > > + rq_unpin_lock(rq, rf);
> > > + rq_pin_lock(srq, &srf);
> > > +
> > > + update_rq_clock(srq);
> >
> > Unfortunate that we have to do this superfluous update; maybe we can
> > save/restore the clock flags from before the pinning shenanigans?
>
> So, this one isn't really superflous. There are two rq's involved - self and
> sibling. self's rq clock is saved and restored through rq_unpin_lock() and
> rq_repin_lock(). We're transferring the lock owner ship from self to sibling
> without actually unlocking and relocking the lock as they should be sharing
> the same lock; however, that doesn't mean that the two queues share rq
> clocks, so the sibling needs to update its rq clock upon getting the lock
> transferred to it. It might make sense to make the siblings share the rq
> clock when core-sched is enabled but that's probably for some other time.

Yep, whoops, I forgot that part didn't make it.

2023-02-08 21:55:50

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET v2] sched: Implement BPF extensible scheduler class

Hello,

Ping. Peter, do you have any comments on the updated parts? Also, I'd really
appreciate if other scheduler folks can share their thoughts.

Thank you so much.

--
tejun