Hi all,
Based on our investigation in the area of the MIPS scheduler context
switch we wish to share our learnings about the kernel scheduler in
the form of kernel documentation. Investigations were done mostly by
stepping through the code using GDB and code inspection. The aim of
the patchset is to provide a brief overview of the kernel scheduler
starting from a brief history, the overview of the kernel structs
used by the scheduler, scheduler invocation and context switch. We
have also added a small section on scheduler state modelling
possibilities. In order to add these subjects we have restructured
the existing scheduler documentation so as to put them in to suitable
sections. We hope the new structure will enable easy extension of the
scheduler documentation.
Patch 1 creates place holders and new structure for the scheduler documentation.
The main sections are
- Scheduler overview: Overview of the scheduler.
- CFS: A section dedicated to CFS scheduler.
- Process context switching: Context switching overview.
- Scheduler features: We thought most of the existing documentation can be moved
here.
- Architecture Specific Scheduler Implementation Differences: Aimed for each
architecture and future updates.
- Scheduler Debugging Interface: For scheduler diagnostics and utilities
- Scheduler related functions: Scheduler API reference.
Patch 2: Adds documentation for the place holders of the Scheduler overview,
Scheduler State Transition and CFS sections.
Patch 3: Adds documentation for the place holder of the Process context switching
and add 2 new sections to for x86 and MIPS context switch.
John Mathew (3):
docs: scheduler: Restructure scheduler documentation.
docs: scheduler: Add scheduler overview documentation
docs: scheduler: Add introduction to scheduler context-switch
Documentation/scheduler/arch-specific.rst | 14 +
Documentation/scheduler/cfs-data-structs.rst | 208 ++++++++++++++
Documentation/scheduler/cfs-overview.rst | 46 ++++
.../scheduler/cfs-sched-overview.rst | 17 ++
Documentation/scheduler/context-switching.rst | 71 +++++
Documentation/scheduler/index.rst | 31 ++-
.../scheduler/mips-context-switch.rst | 78 ++++++
Documentation/scheduler/overview.rst | 260 ++++++++++++++++++
Documentation/scheduler/sched-debugging.rst | 14 +
Documentation/scheduler/sched-features.rst | 20 ++
Documentation/scheduler/scheduler-api.rst | 34 +++
.../scheduler/x86-context-switch.rst | 59 ++++
kernel/sched/core.c | 36 ++-
13 files changed, 867 insertions(+), 21 deletions(-)
create mode 100644 Documentation/scheduler/arch-specific.rst
create mode 100644 Documentation/scheduler/cfs-data-structs.rst
create mode 100644 Documentation/scheduler/cfs-overview.rst
create mode 100644 Documentation/scheduler/cfs-sched-overview.rst
create mode 100644 Documentation/scheduler/context-switching.rst
create mode 100644 Documentation/scheduler/mips-context-switch.rst
create mode 100644 Documentation/scheduler/overview.rst
create mode 100644 Documentation/scheduler/sched-debugging.rst
create mode 100644 Documentation/scheduler/sched-features.rst
create mode 100644 Documentation/scheduler/scheduler-api.rst
create mode 100644 Documentation/scheduler/x86-context-switch.rst
--
2.17.1
Add new sections to enable addition of new documentation on
the scheduler. Existing documentation is moved under the related
new sections. The sections are
- overview
- cfs-sched-overview
- sched-features
- arch-specific.rst
- sched-debugging.rst
Suggested-by: Lukas Bulwahn <[email protected]>
Signed-off-by: John Mathew <[email protected]>
---
Documentation/scheduler/arch-specific.rst | 11 +++++++
.../scheduler/cfs-sched-overview.rst | 15 ++++++++++
Documentation/scheduler/index.rst | 29 ++++++++++---------
Documentation/scheduler/overview.rst | 5 ++++
Documentation/scheduler/sched-debugging.rst | 14 +++++++++
Documentation/scheduler/sched-features.rst | 20 +++++++++++++
6 files changed, 81 insertions(+), 13 deletions(-)
create mode 100644 Documentation/scheduler/arch-specific.rst
create mode 100644 Documentation/scheduler/cfs-sched-overview.rst
create mode 100644 Documentation/scheduler/overview.rst
create mode 100644 Documentation/scheduler/sched-debugging.rst
create mode 100644 Documentation/scheduler/sched-features.rst
diff --git a/Documentation/scheduler/arch-specific.rst b/Documentation/scheduler/arch-specific.rst
new file mode 100644
index 000000000000..c9c34863d994
--- /dev/null
+++ b/Documentation/scheduler/arch-specific.rst
@@ -0,0 +1,11 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+Architecture Specific Scheduler Implementation Differences
+==========================================================
+
+.. class:: toc-title
+
+ Table of contents
+
+.. toctree::
+ :maxdepth: 2
diff --git a/Documentation/scheduler/cfs-sched-overview.rst b/Documentation/scheduler/cfs-sched-overview.rst
new file mode 100644
index 000000000000..44dac92d9462
--- /dev/null
+++ b/Documentation/scheduler/cfs-sched-overview.rst
@@ -0,0 +1,15 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+CFS
+====
+
+This documentation describes the linux CFS scheduler.
+
+.. class:: toc-title
+
+ Table of contents
+
+.. toctree::
+ :maxdepth: 2
+
+ sched-design-CFS
diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index 69074e5de9c4..83c718d05b9d 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -1,23 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
===============
Linux Scheduler
===============
-.. toctree::
- :maxdepth: 1
+This documentation outlines the Linux kernel scheduler with its concepts,
+details about the scheduler design and its data structures and architecture
+specific implementation differences.
+
+.. class:: toc-title
- completion
- sched-arch
- sched-bwc
- sched-deadline
- sched-design-CFS
- sched-domains
- sched-energy
- sched-nice-design
- sched-rt-group
- sched-stats
+ Table of contents
+
+.. toctree::
+ :maxdepth: 2
- text_files
+ overview
+ cfs-sched-overview
+ sched-features
+ arch-specific.rst
+ sched-debugging.rst
.. only:: subproject and html
diff --git a/Documentation/scheduler/overview.rst b/Documentation/scheduler/overview.rst
new file mode 100644
index 000000000000..aee16feefc61
--- /dev/null
+++ b/Documentation/scheduler/overview.rst
@@ -0,0 +1,5 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+====================
+Scheduler overview
+====================
diff --git a/Documentation/scheduler/sched-debugging.rst b/Documentation/scheduler/sched-debugging.rst
new file mode 100644
index 000000000000..e332069f99d6
--- /dev/null
+++ b/Documentation/scheduler/sched-debugging.rst
@@ -0,0 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+Scheduler Debugging Interface
+==============================
+
+.. class:: toc-title
+
+ Table of contents
+
+.. toctree::
+ :maxdepth: 2
+
+ sched-stats
+ text_files
diff --git a/Documentation/scheduler/sched-features.rst b/Documentation/scheduler/sched-features.rst
new file mode 100644
index 000000000000..1afbd9cc8d52
--- /dev/null
+++ b/Documentation/scheduler/sched-features.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+Scheduler Features
+=====================
+
+.. class:: toc-title
+
+ Table of contents
+
+.. toctree::
+ :maxdepth: 2
+
+ sched-arch
+ sched-bwc
+ sched-deadline
+ sched-domains
+ sched-energy
+ sched-nice-design
+ sched-rt-group
+ completion
--
2.17.1
Add documentation for introduction to
-context-switch
-x86 context-switch
-MIPS context switch
Suggested-by: Lukas Bulwahn <[email protected]>
Co-developed-by: Mostafa Chamanara <[email protected]>
Signed-off-by: Mostafa Chamanara <[email protected]>
Signed-off-by: John Mathew <[email protected]>
---
Documentation/scheduler/arch-specific.rst | 3 +
Documentation/scheduler/context-switching.rst | 71 +++++++++++++++++
Documentation/scheduler/index.rst | 1 +
.../scheduler/mips-context-switch.rst | 78 +++++++++++++++++++
.../scheduler/x86-context-switch.rst | 59 ++++++++++++++
5 files changed, 212 insertions(+)
create mode 100644 Documentation/scheduler/context-switching.rst
create mode 100644 Documentation/scheduler/mips-context-switch.rst
create mode 100644 Documentation/scheduler/x86-context-switch.rst
diff --git a/Documentation/scheduler/arch-specific.rst b/Documentation/scheduler/arch-specific.rst
index c9c34863d994..65dc393b605f 100644
--- a/Documentation/scheduler/arch-specific.rst
+++ b/Documentation/scheduler/arch-specific.rst
@@ -9,3 +9,6 @@ Architecture Specific Scheduler Implementation Differences
.. toctree::
:maxdepth: 2
+
+ x86-context-switch
+ mips-context-switch
diff --git a/Documentation/scheduler/context-switching.rst b/Documentation/scheduler/context-switching.rst
new file mode 100644
index 000000000000..dd4ff63b1e97
--- /dev/null
+++ b/Documentation/scheduler/context-switching.rst
@@ -0,0 +1,71 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+==========================
+Process context switching
+==========================
+
+Context Switching
+-----------------
+
+Context switching, the switching from a running task to another, is handled by
+the :c:func:`context_switch()` function defined in kernel/sched.c . It is called
+by __schedule() when a new process has been selected to run.
+
+ The execution flow is as follows:
+
+* Calls prepare_task_switch() to prepare both previous and new task by
+ storing or changing some values in their task_struct.
+
+
+* Calls macro :c:macro:`arch_start_context_switch()`
+ A facility to provide batching of the reload of page tables and other process
+ state with the actual context switch code for paravirtualized guests. By
+ convention, only one of the batched update (lazy) modes (CPU, MMU) should be
+ active at any given time, entry should never be nested, and entry and exits
+ should always be paired. This is for sanity of maintaining and reasoning about
+ the kernel code. In this case, the exit (end of the context switch) is in
+ architecture-specific code, and so doesn't need a generic definition.
+
+
+* The next few steps consists of handling the transfer of real and anonymous
+ address spaces between the switching tasks. Four possible context switches are
+
+ - kernel task switching to another kernel task.
+ - user task switching to a kernel task.
+ - kernel task switching to user task,
+ - user task switching to user task.
+
+For a kernel task switching to kernel task :c:func:`enter_lazy_tlb()` is called
+which is an architecture specific implementation to handle a context without an
+mm. Architectures implement lazy tricks to minimize tlb flushes here.
+Then the active address space from the previous task is borrowed (transferred)
+to the next task. The active address space of the previous task is set to NULL.
+
+For a user task switching to kernel task it will have a real address space. This
+address space is pinned by calling :c:func:`mmgrab()` . This makes sure that the
+address space will not get freed even after the previous task exits.
+
+For a user task switching to user task the architecture specific
+:c:func:`switch_mm_irqs_off()` or :c:func:`switch_mm()` functions. The main
+functionality of this calls is to switch the address space between the
+user space processes. This includes switching the page table pointers and
+ensuring that the new address space has a valid ASID.
+
+For a kernel task switching to a user task, the active address space of the
+kernel task is transferred to the user task and the active address space of the
+kernel task is set to NULL.
+
+* Next the :c:func:`prepare_lock_switch()` function is called for a lockdep
+ release of the runqueue lock to handle the special case of the scheduler in which
+ the runqueue lock will be released by the next task.
+
+* Then the architecture specific implementation of the :c:func:`switch_to()`
+ function is called to switch the register state and the stack. This involves
+ saving and restoring stack information and the processor registers and any other
+ architecture-specific state that must be managed and restored on a per-process
+ basis.
+
+* Calls finish_task_switch() must be called after the context switch,
+ paired with a prepare_task_switch() call before the context switch.It will
+ reconcile locking set up by prepare_task_switch, and do any other
+ architecture-specific cleanup actions.
diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index 9772cf81fd96..289e06a68764 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -18,6 +18,7 @@ specific implementation differences.
overview
cfs-sched-overview
+ context-switching
sched-features
arch-specific.rst
sched-debugging.rst
diff --git a/Documentation/scheduler/mips-context-switch.rst b/Documentation/scheduler/mips-context-switch.rst
new file mode 100644
index 000000000000..e917bbe1c104
--- /dev/null
+++ b/Documentation/scheduler/mips-context-switch.rst
@@ -0,0 +1,78 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+==============================================
+MIPS Architecture And Scheduler implementation
+==============================================
+
+Multi-threading in MIPS CPUs
+-----------------------------
+The MIPS architecture defines four coprocessors.
+
+- CP0: supports virtual memory system and exception handling.
+- CP1: reserved for the floating point coprocessor, the FPU
+- CP2: available for specific implementations.
+- CP3: reserved for floating point operations in the release 1 implementation
+ of MIPS64.
+
+MIPS32 and MIPS64 architectures provide support for optional components known
+as Modules or Application Specific Extensions. The MT module enables the
+architecture to support multi-threaded implementations. This includes support
+for virtual processors and light weight thread contexts. Implementation of MT
+features depends on the individual MIPS cores. The virtual processing element (VPE)
+maintains a complete copy of the processor state as seen by the software system
+which includes interrupts, register set, and MMU. This enables a single processor
+to appear to an SMP operating system like two separate cores if it has 2 VPE's.
+For example two separate OS can run on each VPE such as Linux and and an RTOS.
+
+A lighter version of VPE enables threading at the user/application software level.
+It is called Thread Context (TC). TC, is the hardware state necessary to support
+a thread of execution. This includes a set of general purpose registers (GPRs),
+a program counter (PC), and some multiplier and coprocessor state. TC's have
+common execution unit. MIPS ISA provides instructions to utilize TC.
+
+The Quality of service block of the MT module allows the allocation of processor
+cycles to threads, and sets relative thread priorities. This enables 2 thread
+prioritization mechanisms. The user can prioritize one thread over the other as
+well as allocate a specific ratio of the cycles to specific threads. These
+mechanisms help to allocate bandwidth a set of threads effectively. QoS block
+improves system level determinism and predictability. QosS block can be replaced
+by more application specific blocks.
+
+MIPS Context Switch
+-------------------
+
+Context switch behavior specific to MIPS begins in the way :c:macro:`switch_to()`
+macro is implemented. The main steps in the MIPS implementation of the macro are:
+
+* Handle the FPU affinity management feature . This feature is enabled by the
+ :c:macro:`CONFIG_MIPS_MT_FPAFF` at build time The macro checks if the FPU was
+ used in the most recent time slice. In case FPU was not used, the restriction of
+ having to run on a cpu with FPU is removed.
+* For the previous task, disable the fpu and clear the bit indicating the FPU was
+ used in this quantum for the task.
+* If fpu is enabled in the next task, check FCSR for any unmasked exceptions
+ pending, clear them and send a signal.
+* if MIPS DSP modules is enabled, save the dsp context of the previous task and
+ restore the dsp context of the next task.
+* If coprocessor 2 is present set the access allowed field of the coprocessor 2.
+* if coprocessor 2 access allowed field was set in previous task, clear it.
+* clear the the access allowed field of the coprocessor 2.
+* clear the llbit on MIPS release 6 such that instruction eretnc can be used
+ unconditionally when returning to userland in entry.S. LLbit is used to specify
+ operation for instructions that provide atomic read-modify-write. LLbit is set
+ when a linked load occurs and is tested by the conditional store. It is cleared,
+ during other CPU operation, when a store to the location would no longer be
+ atomic. In particular, it is cleared by exception return instructions.
+ eretnc instruction enables to return from interrupt, exception, or error trap
+ without clearing the LLbit.
+* clear the global variable ll_bit used by mips exception handler.
+* write the thread pointer to the mips userlocal register if the cpu supports
+ this feature. This register is not interpreted by hardware and can be used to
+ share data between privileged and unprivileged software.
+* if hardware watchpoint feature is enabled during build the watchpoint registers
+ are restored from the next task.
+* Finally the mips processor specific implementation of the :c:func:`resume()`
+ function is called. It restores the registers of the next task including the
+ stack pointer. The implementation is in assembly::
+
+ arch/mips/kernel/r4k_switch.S
diff --git a/Documentation/scheduler/x86-context-switch.rst b/Documentation/scheduler/x86-context-switch.rst
new file mode 100644
index 000000000000..ae7b2e09453a
--- /dev/null
+++ b/Documentation/scheduler/x86-context-switch.rst
@@ -0,0 +1,59 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+X86 Context Switch
+------------------
+
+The x86 architecture context switching logic is as follows.
+After the switching of MM in the scheduler :c:func:`context_switch()` the call
+to the x86 implementation of :c:macro:`switch_to()`
+is made. For x86 arch it is located at ::
+
+ arch/x86/include/asm/switch_to.h
+
+Since 4.9, switch_to() has been broken in to two parts: a :c:func:`prepare_switch_to()`
+macro and the inline assembly portion of has been moved to an actual assembly
+file ::
+
+ arch/x86/entry/entry_64.S
+
+* There is still a C portion of the switch which occurs via a jump in the middle
+ of the assembly code. The source is located in arch/x86/kernel/process_64.c
+ since 2.6.24
+
+The main function of the prepare_switch_to() is to handle the case when stack
+uses virtual memory. This is configured at build time and is mostly enable in
+most modern distributions. This function accesses the stack pointer to prevent a
+double fault.Switching to a stack that has top-level paging entry that is not
+present in the current MM will result in a page fault which will be promoted to
+double fault and the result is a panic. So it is necessary to probe the stack now
+so that the vmalloc_fault can fix the page tables.
+
+The main steps of the inline assembly function __switch_to_asm() are:
+
+* store the callee saved registers to the old stack which will be switched away from.
+* swaps the stack pointers between the old and the new task.
+* move the stack canary value to the current cpu's interrupt stack.
+* if return trampoline is enabled, overwrite all entries in the RSB on exiting
+ a guest, to prevent malicious branch target predictions from affecting the host
+ kernel.
+* restore all registers from the new stack previously pushed in reverse order.
+
+The main steps of the c function :c:func:`__switch_to()` which the assembly code
+jumps to is as follows:
+
+* retrieve the thread :c:type:`struct thread_struct <thread_struct>` and fpu
+ :c:type:`struct fpu <fpu>` structs from the next and previous tasks.
+* gets the current cpu TSS :c:type:`struct tss_struct <tss_struct>`
+* save the current FPU state while on the old task.
+* store the FS and GS segment registers before changing the thread local storage.
+* reload the GDT for the new tasks TLS.
+* save the ES and DS segments of the previous task and load the same from the
+ nest task.
+* load the FS and GS segment registers.
+* update the current task of the cpu.
+* update the top of stack pointer for the CPU for entry trampoline.
+* initialize FPU state for next task.
+* set sp0 to point to the entry trampoline stack.
+* call :c:func:`_switch_to_xtra()` to handles debug registers, i/o bitmaps and
+ speculation mitigation.
+* Writes the task's CLOSid/RMID to IA32_PQR_MSR.
--
2.17.1
Add documentation for
-scheduler overview
-scheduler state transtion
-CFS overview
-CFS data structs
Add rst for scheduler APIs and modify sched/core.c
to add kernel-doc comments.
Suggested-by: Lukas Bulwahn <[email protected]>
Co-developed-by: Mostafa Chamanara <[email protected]>
Signed-off-by: Mostafa Chamanara <[email protected]>
Signed-off-by: John Mathew <[email protected]>
---
Documentation/scheduler/cfs-data-structs.rst | 208 ++++++++++++++
Documentation/scheduler/cfs-overview.rst | 46 ++++
.../scheduler/cfs-sched-overview.rst | 2 +
Documentation/scheduler/index.rst | 1 +
Documentation/scheduler/overview.rst | 255 ++++++++++++++++++
Documentation/scheduler/scheduler-api.rst | 34 +++
kernel/sched/core.c | 36 ++-
7 files changed, 574 insertions(+), 8 deletions(-)
create mode 100644 Documentation/scheduler/cfs-data-structs.rst
create mode 100644 Documentation/scheduler/cfs-overview.rst
create mode 100644 Documentation/scheduler/scheduler-api.rst
diff --git a/Documentation/scheduler/cfs-data-structs.rst b/Documentation/scheduler/cfs-data-structs.rst
new file mode 100644
index 000000000000..53b978e9676d
--- /dev/null
+++ b/Documentation/scheduler/cfs-data-structs.rst
@@ -0,0 +1,208 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+====================
+CFS Data Structures
+====================
+
+Main parts of the Linux scheduler are:
+
+**Running Queue:** This is the central data structure of process scheduling. It
+manages tasks that are in a runnable state waiting for a processor. Each CPU has
+its own run queue. The scheduler picks a task from the queue and assigns it to
+the CPU core. The main members of the :c:type:`struct rq <rq>` are:
+
+:c:member:`nr_running`
+ Total number of tasks on the runqueue.
+
+:c:member:`nr_switches`
+ Number of context switches.
+
+:c:member:`cfs`
+ Running queue structure for cfs scheduler.
+
+:c:member:`rt`
+ running queue structure for rt scheduler
+
+:c:member:`next_balance`
+ Timestamp to next load balance check.
+
+:c:member:`curr`
+ Points to currently running task of this running queue.
+
+:c:member:`idle`
+ Points to currently idle task of this running queue.
+
+:c:member:`lock`
+ Spin lock of running queue. task_rq_lock() and task_rq_unlock() can be
+ used to lock running queue which a specific task runs on.
+
+:c:member:`clock`
+ Clock value for the queue set to time now.
+
+Each rq struct points to a cfs_rq struct which represents the rb tree. The main
+members of the :c:type:`struct cfs_rq <cfs_rq>` are:
+
+:c:member:`nr_running`
+ Number of runnable tasks on this queue.
+
+:c:member:`min_vruntime`
+ Smallest vruntime of the queue.
+
+:c:member:`curr`
+ scheduling entity of the current process.
+
+:c:member:`next`
+ scheduling entity of the next process.
+
+:c:member:`load`
+ Cumulative load_weight of tasks for load balancing.
+ load_weight is the encoding of the tasks priority and vruntime. The load of a
+ task is the metric indicating the number of cpus needed to make satisfactory
+ progress on its job. Load of a task influences the time a task spends on the cpu
+ and also helps to estimate the overall cpu load which is needed for load balancing.
+ Priority of the task is not enough for the scheduler to estimate the vruntime
+ of a process. So priority value must be mapped to the capacity of the standard cpu
+ which is done in the array :c:type:`sched_prio_to_weight[]`. The array contains
+ mappings for the nice values from -20 to 19. Nice value 0 is mapped to 1024.
+ Each entry advances by ~1.25 which means if for every increment in nice value
+ the task gets 10% less cpu and vice versa. The load_weight derived is stored in a
+ :c:type:`struct load_weight <load_weight>` which contains both the value and its
+ inverse. Inverse value enables arithmetic speed up by changing divisions in to
+ multiplications. The cfs_rq stores the cumulative load_weight of all the tasks in
+ the runqueue.
+
+**Scheduling entity** : Scheduler uses scheduling entities which contain
+sufficient information to actually accomplish the scheduling job of a task or a
+task-group. The scheduling entity may be a group of tasks or a single task.
+Every task is associated with a sched_entity structure. CFS adds support for nesting
+of tasks and task groups. The main members of the :c:type:`struct sched_entity <sched_entity>`
+are :
+
+:c:member:`load`
+ load_weight of the scheduling entity. This is different from the cfs_rq
+ load. This value is also calculated differently between group and task entities.
+ If group scheduling is enabled the sched_entity load is calculated in the
+ :c:func:`calc_group_shares` else it is the maximum allowed load for the task group.
+
+:c:member:`run_node`
+ Node of the CFS RB tree.
+
+:c:member:`on_rq`
+ If entity is currently on a runqueue.
+
+:c:member:`exec_start`
+ Timestamp of a task when it starts running.
+
+:c:member:`sum_exec_runtime`
+ To store the time a task has been running in total.
+
+:c:member:`vruntime`
+ vruntime of the task explained below.
+
+Few members are added when CFS is enabled.
+
+:c:member:`parent`
+ parent of this scheduling entity. Enables hierarchy of scheduling entities.
+
+:c:member:`cfs_rq`
+ runqueue on which this entity is to be queued.
+
+:c:member:`my_q`
+ runqueue owned by this entity/group.
+
+Each scheduling entity may be run from its parents runqueue. Scheduler traverses
+the sched_entity hierarchy to pick the next task to run on the cpu. The entity
+gets picked up from the cfs_rq on which it is queued and its time slice is divided
+among all the tasks on its my_q.
+
+vruntime is the value by which tasks are ordered on the red-black tree. Tasks are
+arranged in increasing order of vruntime which is the amount of time a task has
+spent running on the cpu.vruntime of a task is updated periodically based on the
+:c:func:`scheduler_tick` function. scheduler_tick() calls task_tick() hook of CFS.
+This hook calls :c:func:`task_tick_fair` which internally calls :c:func:`entity_tick`.
+:c:func:`entity_tick` does two main steps. First it updates the runtime statistics of
+the currently running task. Then it checks if the current task needs to be pre-empted.
+Within :c:func:`entity_tick` the :c:func:`update_curr` is responsible for updating the
+current task's runtime statistics including the vruntime. The function first gets
+the scheduled task from the runqueue and the clock value of the main runqueue : struct rq.
+The difference between the start time of the task and the clock value is
+calculated and stored in a variable. Next the vruntime of the task is calculated
+in the calc_delta_fair() function. This function calls __calc_delta() to calculate
+the vruntime of the task based on the formula ::
+
+ vruntime += delta_exec * (NICE_0_LOAD/curr->load.weight);
+
+where:
+
+* delta_exec is the time spent by the task since the last time vruntime was
+ updated.
+* NICE_0_LOAD is the load of a task with normal priority.
+* curr is the shed_entity instance of the cfs_rq struct of the currently-running
+ task.
+* load.weight: sched_entity load_weight. It is described above.
+
+vruntime progresses slowly for tasks of higher priority. update_curr() then
+calls update_min_vruntime() to update the min_vruntime of the queue. In
+:c:func:`update_min_vruntime` the kernel gets the vruntimes
+for leftmost element in the tree *cfs_rq->rb_leftmost* if it exists and the
+scheduled process. The smallest of the two is chosen. The maximum of the current
+min_vruntime and previously chosen vruntime is taken as the min_vruntime for the
+queue to ensure that the the vruntime keeps increasing and never decreases.
+min_vruntime maintains the time of the task which has run the least on the cpu.
+This value is used to compare against all the tasks in a runqueue. A task with
+the least difference between its vruntime and min_runtime will get the cpu sooner.
+
+After returning from the update_curr() the entity_tick() then calls
+:c:func:`check_preempt_tick` to ensure fairness of scheduling. The vruntime of
+the current process is checked against the left most task in the RB tree to
+decide if a task switch is necessary.
+
+**Schedule class:** It is an extensible hierarchy of scheduler modules. The
+modules encapsulate scheduling policy details.
+They are called from the core code which is independent. Scheduling classes are
+implemented through the sched_class structure.
+fair_sched_class and rt_sched_class class are implementations of this class. The
+main members of the :c:type:`struct sched_class <sched_class>` are :
+
+For the fair_sched_class the hooks (implemented as <function name>_fair)
+does the following:
+
+:c:member:`enqueue_task`
+ Update the fair scheduling stats and puts scheduling entity in
+ to rb tree and increments the nr_running variable.
+
+:c:member:`dequeue_task`
+ Moves the entity out of the rb tree when entity no longer runnable
+ and decrements the nr_running variable. Also update the fair scheduling stats.
+
+:c:member:`yield_task`
+ Use the buddy mechanism to skip onto the next highest priority se at
+ every level in the CFS tree, unless doing so would introduce gross unfairness
+ in CPU time distribution.
+
+:c:member:`check_preempt_curr`
+ Check whether the task that woke up should pre-empt the
+ running task.
+
+:c:member:`pick_next_task`
+ Pick the next eligible task. This may not be the left most task
+ in the rbtree. Instead a buddy system is used which provides benefits of
+ cache locality and group scheduling.
+
+:c:member:`task_tick`
+ Called from scheduler_tick(). Updates the runtime statistics of the
+ currently running task and checks if this task needs to be pre-empted.
+
+:c:member:`task_fork`
+ scheduler setup for newly forked task.
+
+:c:member:`task_dead`
+ A task struct has one reference for the use as "current". If a task
+ dies, then it sets TASK_DEAD in tsk->state and calls schedule one last time.
+ The schedule call will never return, and the scheduled task must drop that
+ reference.
+
+Kernel forwards the tasks to each class based on the scheduling policy assigned
+to each task. Tasks assigned with SCHED_NORMAL, SCHED_IDLE and SCHED_BATCH
+go to fair_sched_class and tasks assigned with SCHED_RR and SCHED_FIFO go to
+rt_sched_class
diff --git a/Documentation/scheduler/cfs-overview.rst b/Documentation/scheduler/cfs-overview.rst
new file mode 100644
index 000000000000..10a15aa50fcd
--- /dev/null
+++ b/Documentation/scheduler/cfs-overview.rst
@@ -0,0 +1,46 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=============
+CFS Overview
+=============
+
+History
+-------
+
+Linux 2.6.23 introduced a modular scheduler core and a Completely Fair Scheduler
+(CFS) implemented as a scheduling module. Scheduler has been improving since
+kernel version 2.4. In kernel 2.4 there was one running queue for all processes.
+During every schedule the queue was locked and every task time-slice was update.
+This implementation caused interactivity issues. A re-write in kernel version 2.5
+assigned a running queue for each processor a running queue and was able to
+achieve a O(1) run time irrespective of the number of tasks in the system. This
+scheduler did good until the Rotating Staircase Deadline (RSDL) scheduler was
+introduced. The RDSL scheduler attempted to reduce the scheduler complexity. The
+CFS scheduler was inspired from the RDSL scheduler and the current CFS scheduler
+originated. The improvements were not only in performance and interactivity but
+also simplicity of the scheduling logic and modularized scheduler code. Since
+kernel 2.6.23, there has been improvements to the CFS scheduler in the areas of
+optimization, load balancing and group scheduling features.
+
+RDSL scheduler removed the interactivity estimation code from the previous linux
+scheduler. RDSL was fair by giving equal time slices to all processes. There was
+no classification of IO bound or CPU bound processes. CFS adopts this concept of
+fairness. CFS takes in to use the length of the sleep time in the interactive
+process so processes which sleep less will get more CPU time.
+
+CFS uses a time ordered red-black tree for each CPU. The red-black tree is a type
+of self-balancing binary search tree. Every running process, has a node in the
+red-black tree. The process at the left-most position of the red-black tree is
+the one to be scheduled next. The red-black tree is complex, but it has a good
+worst-case running time for its operations and is efficient in practice: it can
+search, insert, and delete in O(log n) time, where n is the number of elements in
+the tree. The leaf nodes are not relevant and do not contain data. The red-black
+tree is always balanced. Because the red-black tree is a binary tree, the time
+complexities of lookup operations are logarithmic. However, non-left-most lookup
+is hardly ever done and the left-most node pointer is always cached.A red-black
+tree can be implemented with internal storage—that is, no external allocations
+are needed to maintain the data structure.
+
+The implementation of the rb tree is at ::
+
+ include/linux/rbtree.h
diff --git a/Documentation/scheduler/cfs-sched-overview.rst b/Documentation/scheduler/cfs-sched-overview.rst
index 44dac92d9462..c6eef7372d4e 100644
--- a/Documentation/scheduler/cfs-sched-overview.rst
+++ b/Documentation/scheduler/cfs-sched-overview.rst
@@ -12,4 +12,6 @@ This documentation describes the linux CFS scheduler.
.. toctree::
:maxdepth: 2
+ cfs-overview
+ cfs-data-structs
sched-design-CFS
diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index 83c718d05b9d..9772cf81fd96 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -21,6 +21,7 @@ specific implementation differences.
sched-features
arch-specific.rst
sched-debugging.rst
+ scheduler-api.rst
.. only:: subproject and html
diff --git a/Documentation/scheduler/overview.rst b/Documentation/scheduler/overview.rst
index aee16feefc61..5ead8885250e 100644
--- a/Documentation/scheduler/overview.rst
+++ b/Documentation/scheduler/overview.rst
@@ -3,3 +3,258 @@
====================
Scheduler overview
====================
+
+Linux kernel implements priority based scheduling. Therefore, more than one
+process are allowed to run at any given time and each process is allowed to run
+as if it were the only process on the system. The process scheduler coordinate
+which process runs when. In that context, it has the following tasks:
+
+- share CPU cores equally among all currently running processes
+- pick appropriate process to run next if required, considering scheduling
+ class/policy and process priorities
+- balance processes between multiple cores in SMP systems
+
+
+Scheduler attempts to be responsive for I/O bound processes and efficient for
+CPU bound processes. Scheduler also applies different scheduling policies for
+real time and normal processes based on their respective priorities.
+Higher priorities in the kernel have a numerical smaller value. Real time
+priorities range from 1 (highest) – 99 whereas normal priorities range from
+100 – 139 (lowest).
+
+
+Process
+=======
+
+Processes are the most fundamental abstraction in a Unix system, after files.
+Its consists of object code, data, resources, and state.
+
+Each process in the system is represented by :c:type:`struct task_struct <task_struct>`.
+Since task_struct data type must be able to capture all information of a process,
+it is relatively large. When a process/thread is created, the kernel allocates a
+new task_struct for it. The kernel then stores this task_struct in a circular
+linked list call task_list. Macro next_task and prev_task allow a process to
+obtain its next task and its previous task respectively. The frequently used
+fields of the task struct are:
+
+*state:* The running state of the task. The possible states are:
+
+- TASK_RUNNING: The task is currently running or in a run queue waiting to run.
+- TASK_INTERRUPTIBLE: The task is sleeping waiting for some event to occur. This
+ task can be interrupted by signals. On waking up the task transitions to
+ TASK_RUNNING.
+- TASK_UNINTERRUPTIBLE: Similar to TASK_INTERRUPTIBLE but does not wake up on
+ signals. Needs an explicit wake-up call to be woken up. Contributes to loadavg.
+- __TASK_TRACED: Task is being traced by another task like a debugger.
+- __TASK_STOPPED: Task execution has stopped and not eligible to run. SIGSTOP,
+ SIGTSTP etc causes this state. The task can be continued by the signal SIGCONT.
+- TASK_PARKED: state to support kthread parking/unparking.
+- TASK_DEAD: If a task dies, then it sets TASK_DEAD in tsk->state and calls
+ schedule one last time. The schedule call will never return.
+- TASK_WAKEKILL: It works like TASK_UNINTERRUPTIBLE with the bonus that it
+ can respond to fatal signals.
+- TASK_WAKING: To handle concurrent waking of the same task for SMP. Indicates that
+ someone is already waking the task.
+- TASK_NOLOAD: To be used along with TASK_UNINTERRUPTIBLE to indicate an idle task
+ which does not contribute to loadavg.
+- TASK_NEW: set during fork(), to guarantee that no one will run the task,
+ a signal or any other wake event cannot wake it up and insert it on the
+ runqueue.
+
+*exit_state* : The exiting state of the task. The possible states are:
+
+- EXIT_ZOMBIE: The task is terminated and waiting for parent to collect the exit
+ information of the task.
+- EXIT_DEAD: After collecting the exit information the task is put to this state
+ and removed from the system.
+
+*static_prio:* Nice value of a task. The value of this field does not change.
+ Value ranges from -20 to 19. This value is mapped to nice value and used in the
+ scheduler.
+
+*prio:* Dynamic priority of a task. Previously a function of static priority and
+ tasks interactivity. Value not used by CFS scheduler but used by the rt scheduler.
+ Might be boosted by interactivity modifiers. Changes upon fork, setprio syscalls,
+ and whenever the interactivity estimator recalculates.
+
+*normal_prio:* Expected priority of a task. The value of static_prio and normal_prio
+ are the same for non real time processes. For real time processes value of prio
+ is used.
+
+*rt_priority:* Field used by real time tasks. Real time tasks are prioritized
+ based on this value.
+
+*sched_class:* Pointer to sched_class CFS structure.
+
+*sched_entity:* pointer to sched_entity CFS structure.
+
+*poicy:* Value for scheduling policy. The possible values are:
+
+* SCHED_NORMAL: Regular tasks use this policy.
+
+* SCHED_BATCH: Tasks which need to run longer without pre-emption use this
+ policy. Suitable for batch jobs.
+* SCHED_IDLE: Policy used by background tasks.
+
+* SCHED_FIFO & SCHED_RR: These policies for real time tasks. Handled by real
+ time scheduler.
+
+*nr_cpus_allowed:* bit field containing tasks affinity towards a set of cpu cores.
+ Set using sched_setaffinity() system call.
+
+
+Thread
+=======
+
+From a process context, thread is an execution context or flow of execution in a
+process. Every process consists of at least one thread.
+In a multiprocessor system multiple threads in a process enables parallelism.
+Each thread has its own task_struct. Threads are created like normal tasks but
+the clone() is provided with flags that enable the sharing of resources such as
+address space ::
+
+ clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
+
+The scheduler schedules task_structs. So Linux doesn't differentiate between
+thread and process.
+
+The Scheduler Entry Point
+=========================
+
+The main scheduler entry point is arch independent schedule() function defined
+in kernel/sched.c. It implements the scheduler and its objective is to find a
+process in the runqueue list and then assign the CPU to it. It is invoked,
+directly or in a lazy(deferred) way from many different places in the kernel.
+A lazy invocation does not call the function by its name, but gives the kernel
+a hint(by setting a flag) that the scheduler needs to be called soon. The
+need_resched flag is a message to the kernel that the scheduler should be invoked
+as soon as possible because another process deserves to run.
+
+Following are some places that notify the kernel to schedule:
+
+* scheduler_tick() : This function is called on every timer interrupt
+ with HZ frequency and calls scheduler on any task that has used up its quantum
+ of CPU time.
+
+* Running task goes to sleep state : Right before a task goes to sleep ,
+ schedule() will be called to pick the next task to run and the change its state
+ to either TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE. For instance,
+ prepare_to_wait() is one of the functions that makes the task go to the sleep
+ state
+
+* try_to_wake_up() : This function awakens a sleeping process by setting
+ the task’s state to TASK_RUNNING and putting it back on the run queue of the
+ local CPU, for example, the function is invoked to wake up processes included
+ in a wait queue or to resume execution of processes waiting for a signal.
+
+* yield() : A process voluntarily yields the CPU by calling this function, it directly calls schedule() but it is strongly
+ recommended not to use it.
+
+* wait_event() : The process is put to sleep (TASK_UNINTERRUPTIBLE)
+ until the condition evaluates to true. The condition is checked each time the
+ wait-queue wq is woken up.
+
+* _cond_resched() : It gives the scheduler a chance to run a
+ higher-priority process.
+
+* __cond_resched_lock() : if a reschedule is pending, drop the given
+ lock, call schedule, and on return reacquire the lock.
+
+* do_task_dead() : Changes the the task state to TASK_DEAD and calls
+ schedule to pick next task to run.
+
+* preempt_schedule() : The function checks whether local interrupts are
+ enabled and the preempt_count field of current is zero; if both conditions are
+ true, it invokes schedule() to select another process to run.
+
+* preempt_schedule_irq() : It sets the PREEMPT_ACTIVE flag in the
+ preempt_count field, temporarily sets the big kernel lock counter to -1, enables
+ the local interrupts, and invokes schedule() to select another process to run.
+ When the former process will resume, preempt_schedule_irq() restores the previous
+ value of the big kernel lock counter, clears the PREEMPT_ACTIVE flag, and
+ disables local interrupts. The schedule() function will continue to be invoked
+ as long as the TIF_NEED_RESCHED flag of the current process is set.
+
+Calling functions mentioned above leads to a call to __schedule(), note
+that preemption must be disabled before it is called and enabled after the call
+using preempt_disable and preempt_enable functions family.
+
+
+The steps during invocation are:
+--------------------------------
+1. Disables pre-emption to avoid another task pre-empting the scheduling thread
+ as the linux kernel is pre-emptive.
+2. Retrieves running queue based on current processor and obtain the lock of
+ current rq, to allow only one thread to modify the runqueue at a time.
+3. Examine the state of the previously executed task when the schedule() was called.
+ If it is not runnable and has not been pre-empted in kernel mode, then it
+ should be removed from the runqueue. However, if it has non-blocked pending
+ signals, its state is set to TASK_RUNNING and it is left in the runqueue.
+4. The next action is to check if any runnable tasks exist in the CPU's runqueue.
+ If not, idle_balance() is called to get some runnable tasks from other CPUs.
+5. Next the corresponding class is asked to pick the next suitable task to be
+ scheduled on the CPU by calling the hook pick_next_task(). This is followed
+ by clearing the need_resched flag which might have been set previously to
+ invoke the schedule() function call in the first place. pick_next_task()
+ is also implemented in core.c. It iterates through the list of scheduling
+ classes to find the class with the highest priority that has a runnable task.
+ If the class is found, the scheduling class hook is called. Since most tasks
+ are handled by the sched_fair class, a short cut to this class is implemented
+ in the beginning of the function.
+6. schedule() checks if pick_next_task() found a new task or if it picked the same
+ task again that was running before. If the latter is the case, no task switch
+ is performed and the current task just keeps running. If a new task is found,
+ which is the more likely case, the actual task switch is executed by calling
+ context_switch(). Internally, context_switch() switches to the new task's
+ memory map and swaps register state and stack.
+7. To finish up, the runqueue is unlocked and pre-emption is re-enabled. In case
+ pre-emption was requested during the time in which it was disabled, schedule()
+ is run again right away.
+
+Scheduler State Transition
+==========================
+
+A very high level scheduler state transition flow with a few states can be
+depicted as follows.
+
+.. kernel-render:: DOT
+ :alt: DOT digraph of Scheduler state transition
+ :caption: Scheduler state transition
+
+ digraph sched_transition {
+ node [shape = point, label="exisiting task\n calls fork()"]; fork
+ node [shape = box, label="TASK_NEW\n(Ready to run)"] tsk_new;
+ node [shape = box, label="TASK_RUNNING\n(Ready to run)"] tsk_ready_run;
+ node [shape = box, label="TASK_RUNNING\n(Running)"] tsk_running;
+ node [shape = box, label="TASK_DEAD\nEXIT_ZOMBIE"] exit_zombie;
+ node [shape = box, label="TASK_INTERRUPTIBLE\nTASK_UNINTERRUPTIBLE\nTASK_WAKEKILL"] tsk_int;
+ fork -> tsk_new [ label = "task\nforks" ];
+ tsk_new -> tsk_ready_run;
+ tsk_ready_run -> tsk_running [ label = "schedule() calls context_switch()" ];
+ tsk_running -> tsk_ready_run [ label = "task is pre-empted" ];
+ subgraph int {
+ tsk_running -> tsk_int [ label = "task needs to wait for event" ];
+ tsk_int -> tsk_ready_run [ label = "event occurred" ];
+ }
+ tsk_int -> exit_zombie [ label = "task exits via do_exit()" ];
+ }
+
+Scheduler provides trace points tracing all major events of the scheduler.
+The tracepoints are defined in ::
+
+ include/trace/events/sched.h
+
+Using these treacepoints it is possible to model the scheduler state transition
+in an automata model. The following conference paper discusses such modeling.
+
+https://www.researchgate.net/publication/332440267_Modeling_the_Behavior_of_Threads_in_the_PREEMPT_RT_Linux_Kernel_Using_Automata
+
+To model the scheduler efficiently the system was divided in to generators and
+specifications. Some of the generators used were "need_resched", "sleepable" and
+"runnable" , "thread_context" and "scheduling context".
+The specifications are the necessary and sufficient conditions to call the scheduler.
+New trace events were added to specify the generators and specifications. In case a
+kernel event referred to more then one event,extra fields of the kernel event was
+used to distinguish between automation events. The final model was done parallel
+composition of all generators and specifications composed of 15 events, 7 generators
+and 10 specifications. This resulted in 149 states and 327 transitions.
diff --git a/Documentation/scheduler/scheduler-api.rst b/Documentation/scheduler/scheduler-api.rst
new file mode 100644
index 000000000000..5fea0eb0f1ff
--- /dev/null
+++ b/Documentation/scheduler/scheduler-api.rst
@@ -0,0 +1,34 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=============================
+Scheduler related functions
+=============================
+
+
+.. kernel-doc:: kernel/sched/core.c
+ :functions: __schedule
+
+.. kernel-doc:: kernel/sched/core.c
+ :functions: scheduler_tick
+
+
+.. kernel-doc:: kernel/sched/core.c
+ :functions: try_to_wake_up
+
+.. kernel-doc:: kernel/sched/core.c
+ :functions: __cond_resched_lock
+
+.. kernel-doc:: kernel/sched/core.c
+ :functions: do_task_dead
+
+.. kernel-doc:: kernel/sched/core.c
+ :functions: preempt_schedule
+
+.. kernel-doc:: kernel/sched/core.c
+ :functions: preempt_schedule_irq
+
+.. kernel-doc:: kernel/sched/core.c
+ :functions: prepare_task_switch
+
+.. kernel-doc:: kernel/sched/core.c
+ :functions: finish_task_switch
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1a9983da4408..ccefc820557f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3578,8 +3578,12 @@ unsigned long long task_sched_runtime(struct task_struct *p)
return ns;
}
-/*
- * This function gets called by the timer code, with HZ frequency.
+/**
+ * scheduler_tick -
+ *
+ * This function is called on every timer interrupt with HZ frequency and
+ * calls scheduler on any task that has used up its quantum of CPU time.
+ *
* We call it with interrupts disabled.
*/
void scheduler_tick(void)
@@ -3958,8 +3962,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
BUG();
}
-/*
- * __schedule() is the main scheduler function.
+/**
+ * __schedule() - The main scheduler function.
*
* The main means of driving the scheduler and thus entering this function are:
*
@@ -4086,6 +4090,12 @@ static void __sched notrace __schedule(bool preempt)
balance_callback(rq);
}
+/**
+ * do_task_dead - Final step of task exit
+ *
+ * Changes the the task state to TASK_DEAD and calls schedule to pick next
+ * task to run.
+ */
void __noreturn do_task_dead(void)
{
/* Causes final put_task_struct in finish_task_switch(): */
@@ -4244,7 +4254,9 @@ static void __sched notrace preempt_schedule_common(void)
}
#ifdef CONFIG_PREEMPTION
-/*
+/**
+ * preempt_schedule -
+ *
* This is the entry point to schedule() from in-kernel preemption
* off of preempt_enable.
*/
@@ -4316,7 +4328,9 @@ EXPORT_SYMBOL_GPL(preempt_schedule_notrace);
#endif /* CONFIG_PREEMPTION */
-/*
+/**
+ * preempt_schedule_irq -
+ *
* This is the entry point to schedule() from kernel preemption
* off of irq context.
* Note, that this is called and return with irqs disabled. This will
@@ -5614,6 +5628,11 @@ SYSCALL_DEFINE0(sched_yield)
}
#ifndef CONFIG_PREEMPTION
+/**
+ * _cond_resched -
+ *
+ * gives the scheduler a chance to run a higher-priority process
+ */
int __sched _cond_resched(void)
{
if (should_resched(0)) {
@@ -5626,9 +5645,10 @@ int __sched _cond_resched(void)
EXPORT_SYMBOL(_cond_resched);
#endif
-/*
- * __cond_resched_lock() - if a reschedule is pending, drop the given lock,
+/**
+ * __cond_resched_lock - if a reschedule is pending, drop the given lock,
* call schedule, and on return reacquire the lock.
+ * @lock: target lock
*
* This works OK both with and without CONFIG_PREEMPTION. We do strange low-level
* operations here to prevent schedule() from being called twice (once via
--
2.17.1
On Wed, Apr 01, 2020 at 01:00:28PM +0300, John Mathew wrote:
I dispise RST, it's an unreadable mess, but I did skim the document and
felt I should comment on this:
> +* _cond_resched() : It gives the scheduler a chance to run a
> + higher-priority process.
> +
> +* __cond_resched_lock() : if a reschedule is pending, drop the given
> + lock, call schedule, and on return reacquire the lock.
Those are not functions anybody should be using; the normal entry points
are: cond_resched() and cond_resched_lock().
> +Scheduler State Transition
> +==========================
> +
> +A very high level scheduler state transition flow with a few states can be
> +depicted as follows.
> +
> +.. kernel-render:: DOT
> + :alt: DOT digraph of Scheduler state transition
> + :caption: Scheduler state transition
> +
> + digraph sched_transition {
> + node [shape = point, label="exisiting task\n calls fork()"]; fork
> + node [shape = box, label="TASK_NEW\n(Ready to run)"] tsk_new;
> + node [shape = box, label="TASK_RUNNING\n(Ready to run)"] tsk_ready_run;
> + node [shape = box, label="TASK_RUNNING\n(Running)"] tsk_running;
> + node [shape = box, label="TASK_DEAD\nEXIT_ZOMBIE"] exit_zombie;
> + node [shape = box, label="TASK_INTERRUPTIBLE\nTASK_UNINTERRUPTIBLE\nTASK_WAKEKILL"] tsk_int;
> + fork -> tsk_new [ label = "task\nforks" ];
> + tsk_new -> tsk_ready_run;
> + tsk_ready_run -> tsk_running [ label = "schedule() calls context_switch()" ];
> + tsk_running -> tsk_ready_run [ label = "task is pre-empted" ];
> + subgraph int {
> + tsk_running -> tsk_int [ label = "task needs to wait for event" ];
> + tsk_int -> tsk_ready_run [ label = "event occurred" ];
> + }
> + tsk_int -> exit_zombie [ label = "task exits via do_exit()" ];
> + }
And that is a prime example of why I hates RST, it pretty much mandates
you view this with something other than a text editor.
Also, Daniel, you modeled all this, is the above anywhere close?
> +Scheduler provides trace points tracing all major events of the scheduler.
> +The tracepoints are defined in ::
> +
> + include/trace/events/sched.h
> +
> +Using these treacepoints it is possible to model the scheduler state transition
> +in an automata model. The following conference paper discusses such modeling.
> +
> +https://www.researchgate.net/publication/332440267_Modeling_the_Behavior_of_Threads_in_the_PREEMPT_RT_Linux_Kernel_Using_Automata
Ah, you've found Daniel ;-)
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1a9983da4408..ccefc820557f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3578,8 +3578,12 @@ unsigned long long task_sched_runtime(struct task_struct *p)
> return ns;
> }
>
> -/*
> - * This function gets called by the timer code, with HZ frequency.
> +/**
> + * scheduler_tick -
> + *
> + * This function is called on every timer interrupt with HZ frequency and
> + * calls scheduler on any task that has used up its quantum of CPU time.
> + *
> * We call it with interrupts disabled.
> */
> void scheduler_tick(void)
> @@ -3958,8 +3962,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> BUG();
> }
>
> -/*
> - * __schedule() is the main scheduler function.
> +/**
> + * __schedule() - The main scheduler function.
> *
> * The main means of driving the scheduler and thus entering this function are:
> *
> @@ -4086,6 +4090,12 @@ static void __sched notrace __schedule(bool preempt)
> balance_callback(rq);
> }
>
> +/**
> + * do_task_dead - Final step of task exit
> + *
> + * Changes the the task state to TASK_DEAD and calls schedule to pick next
> + * task to run.
> + */
That has whitespace damage.
> void __noreturn do_task_dead(void)
> {
> /* Causes final put_task_struct in finish_task_switch(): */
> @@ -4244,7 +4254,9 @@ static void __sched notrace preempt_schedule_common(void)
> }
>
> #ifdef CONFIG_PREEMPTION
> -/*
> +/**
> + * preempt_schedule -
> + *
> * This is the entry point to schedule() from in-kernel preemption
> * off of preempt_enable.
> */
> @@ -4316,7 +4328,9 @@ EXPORT_SYMBOL_GPL(preempt_schedule_notrace);
>
> #endif /* CONFIG_PREEMPTION */
>
> -/*
> +/**
> + * preempt_schedule_irq -
> + *
> * This is the entry point to schedule() from kernel preemption
> * off of irq context.
> * Note, that this is called and return with irqs disabled. This will
> @@ -5614,6 +5628,11 @@ SYSCALL_DEFINE0(sched_yield)
> }
>
> #ifndef CONFIG_PREEMPTION
> +/**
> + * _cond_resched -
> + *
> + * gives the scheduler a chance to run a higher-priority process
> + */
> int __sched _cond_resched(void)
> {
> if (should_resched(0)) {
> @@ -5626,9 +5645,10 @@ int __sched _cond_resched(void)
> EXPORT_SYMBOL(_cond_resched);
> #endif
>
> -/*
> - * __cond_resched_lock() - if a reschedule is pending, drop the given lock,
> +/**
> + * __cond_resched_lock - if a reschedule is pending, drop the given lock,
> * call schedule, and on return reacquire the lock.
> + * @lock: target lock
> *
> * This works OK both with and without CONFIG_PREEMPTION. We do strange low-level
> * operations here to prevent schedule() from being called twice (once via
Just know that the first time someone comes and whines about how a
scheduler comment doesn't build or generates bad output, I remove the
/** kerneldoc thing.
Also, like I said above, _cond_resched() and __cond_resched_lock()
really should not be exposed like this, they're not API.
On Wed, Apr 01, 2020 at 01:00:29PM +0300, John Mathew wrote:
> +Context switching, the switching from a running task to another, is handled by
> +the :c:func:`context_switch()` function defined in kernel/sched.c . It is called
Per my insitent complaints, :c:func: is now completely redundant,
anything with '()' on is now automagically treated as a function name.
If it's not readable in a text editor, it's broken.
> +by __schedule() when a new process has been selected to run.
> +
> + The execution flow is as follows:
> +
> +* Calls prepare_task_switch() to prepare both previous and new task by
> + storing or changing some values in their task_struct.
This is wildly inacurate. Take for instance perf_event_task_sched_out(),
that doesn't just store or change a few values.
> +* Calls macro :c:macro:`arch_start_context_switch()`
> + A facility to provide batching of the reload of page tables and other process
> + state with the actual context switch code for paravirtualized guests. By
> + convention, only one of the batched update (lazy) modes (CPU, MMU) should be
> + active at any given time, entry should never be nested, and entry and exits
> + should always be paired. This is for sanity of maintaining and reasoning about
> + the kernel code. In this case, the exit (end of the context switch) is in
> + architecture-specific code, and so doesn't need a generic definition.
> +
> +
> +* The next few steps consists of handling the transfer of real and anonymous
> + address spaces between the switching tasks. Four possible context switches are
> +
> + - kernel task switching to another kernel task.
> + - user task switching to a kernel task.
> + - kernel task switching to user task,
> + - user task switching to user task.
> +
> +For a kernel task switching to kernel task :c:func:`enter_lazy_tlb()` is called
> +which is an architecture specific implementation to handle a context without an
> +mm. Architectures implement lazy tricks to minimize tlb flushes here.
> +Then the active address space from the previous task is borrowed (transferred)
> +to the next task. The active address space of the previous task is set to NULL.
> +
> +For a user task switching to kernel task it will have a real address space. This
> +address space is pinned by calling :c:func:`mmgrab()` . This makes sure that the
> +address space will not get freed even after the previous task exits.
> +
> +For a user task switching to user task the architecture specific
> +:c:func:`switch_mm_irqs_off()` or :c:func:`switch_mm()` functions. The main
> +functionality of this calls is to switch the address space between the
> +user space processes. This includes switching the page table pointers and
> +ensuring that the new address space has a valid ASID.
That's worded a bit odd; you need an ASID allocated to a process in
order for it to have an active address space.
> +For a kernel task switching to a user task, the active address space of the
> +kernel task is transferred to the user task and the active address space of the
> +kernel task is set to NULL.
That reads odd. The kernel to user transition switches away from
whatever address space the kernel task borrowed and to the address space
of the user task (of course hoping they're the same).
But we don't transfer anything to the user task, particularly kernel
threads don't have an address space to give.
> +
> +* Next the :c:func:`prepare_lock_switch()` function is called for a lockdep
> + release of the runqueue lock to handle the special case of the scheduler in which
> + the runqueue lock will be released by the next task.
> +
> +* Then the architecture specific implementation of the :c:func:`switch_to()`
> + function is called to switch the register state and the stack. This involves
> + saving and restoring stack information and the processor registers and any other
> + architecture-specific state that must be managed and restored on a per-process
> + basis.
> +
> +* Calls finish_task_switch() must be called after the context switch,
> + paired with a prepare_task_switch() call before the context switch.It will
> + reconcile locking set up by prepare_task_switch, and do any other
> + architecture-specific cleanup actions.
You spend a lot of words on the prepare, but then ignore most of the
finish_task_switch() magic. That seems unbalanced.
> diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
> index 9772cf81fd96..289e06a68764 100644
> --- a/Documentation/scheduler/index.rst
> +++ b/Documentation/scheduler/index.rst
> +MIPS Context Switch
> +-------------------
> +
> +Context switch behavior specific to MIPS begins in the way :c:macro:`switch_to()`
> +macro is implemented. The main steps in the MIPS implementation of the macro are:
> +
> +* Handle the FPU affinity management feature . This feature is enabled by the
> + :c:macro:`CONFIG_MIPS_MT_FPAFF` at build time The macro checks if the FPU was
> + used in the most recent time slice. In case FPU was not used, the restriction of
> + having to run on a cpu with FPU is removed.
Last time I looked at that gunk it was broken, presumably it still is.
In particular it is racy against userspace changing task affinity
itself.
> +* For the previous task, disable the fpu and clear the bit indicating the FPU was
> + used in this quantum for the task.
> +* If fpu is enabled in the next task, check FCSR for any unmasked exceptions
> + pending, clear them and send a signal.
> +* if MIPS DSP modules is enabled, save the dsp context of the previous task and
> + restore the dsp context of the next task.
> +* If coprocessor 2 is present set the access allowed field of the coprocessor 2.
> +* if coprocessor 2 access allowed field was set in previous task, clear it.
> +* clear the the access allowed field of the coprocessor 2.
> +* clear the llbit on MIPS release 6 such that instruction eretnc can be used
> + unconditionally when returning to userland in entry.S. LLbit is used to specify
> + operation for instructions that provide atomic read-modify-write. LLbit is set
> + when a linked load occurs and is tested by the conditional store. It is cleared,
> + during other CPU operation, when a store to the location would no longer be
> + atomic. In particular, it is cleared by exception return instructions.
> + eretnc instruction enables to return from interrupt, exception, or error trap
> + without clearing the LLbit.
> +* clear the global variable ll_bit used by mips exception handler.
> +* write the thread pointer to the mips userlocal register if the cpu supports
> + this feature. This register is not interpreted by hardware and can be used to
> + share data between privileged and unprivileged software.
> +* if hardware watchpoint feature is enabled during build the watchpoint registers
> + are restored from the next task.
> +* Finally the mips processor specific implementation of the :c:func:`resume()`
> + function is called. It restores the registers of the next task including the
> + stack pointer. The implementation is in assembly::
> +
> + arch/mips/kernel/r4k_switch.S
There's also octeon_switch.S r2300_switch.S
> diff --git a/Documentation/scheduler/x86-context-switch.rst b/Documentation/scheduler/x86-context-switch.rst
> new file mode 100644
> index 000000000000..ae7b2e09453a
> --- /dev/null
> +++ b/Documentation/scheduler/x86-context-switch.rst
> @@ -0,0 +1,59 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +X86 Context Switch
> +------------------
> +
> +The x86 architecture context switching logic is as follows.
> +After the switching of MM in the scheduler :c:func:`context_switch()` the call
> +to the x86 implementation of :c:macro:`switch_to()`
> +is made. For x86 arch it is located at ::
> +
> + arch/x86/include/asm/switch_to.h
> +
> +Since 4.9, switch_to() has been broken in to two parts: a :c:func:`prepare_switch_to()`
> +macro and the inline assembly portion of has been moved to an actual assembly
> +file ::
> +
> + arch/x86/entry/entry_64.S
and entry_32.S
Although I suspect it will soon move to another file... it has no
business being in .entry.text
> +
> +* There is still a C portion of the switch which occurs via a jump in the middle
> + of the assembly code. The source is located in arch/x86/kernel/process_64.c
> + since 2.6.24
Uhm what? afaict there is no jmp in the middle. There is a jump at the
end, which is a tail-call optimization.
> +The main function of the prepare_switch_to() is to handle the case when stack
> +uses virtual memory. This is configured at build time and is mostly enable in
> +most modern distributions. This function accesses the stack pointer to prevent a
> +double fault.Switching to a stack that has top-level paging entry that is not
> +present in the current MM will result in a page fault which will be promoted to
> +double fault and the result is a panic. So it is necessary to probe the stack now
> +so that the vmalloc_fault can fix the page tables.
> +
> +The main steps of the inline assembly function __switch_to_asm() are:
> +
> +* store the callee saved registers to the old stack which will be switched away from.
> +* swaps the stack pointers between the old and the new task.
> +* move the stack canary value to the current cpu's interrupt stack.
> +* if return trampoline is enabled, overwrite all entries in the RSB on exiting
> + a guest, to prevent malicious branch target predictions from affecting the host
> + kernel.
> +* restore all registers from the new stack previously pushed in reverse order.
> +
> +The main steps of the c function :c:func:`__switch_to()` which the assembly code
> +jumps to is as follows:
Note that this is effectively the new task running.
> +* retrieve the thread :c:type:`struct thread_struct <thread_struct>` and fpu
> + :c:type:`struct fpu <fpu>` structs from the next and previous tasks.
> +* gets the current cpu TSS :c:type:`struct tss_struct <tss_struct>`
> +* save the current FPU state while on the old task.
> +* store the FS and GS segment registers before changing the thread local storage.
> +* reload the GDT for the new tasks TLS.
You mentioned arch_start_context_switch() previously, this is where
arch_end_context_switch() lives.
> +* save the ES and DS segments of the previous task and load the same from the
> + nest task.
> +* load the FS and GS segment registers.
> +* update the current task of the cpu.
> +* update the top of stack pointer for the CPU for entry trampoline.
> +* initialize FPU state for next task.
> +* set sp0 to point to the entry trampoline stack.
> +* call :c:func:`_switch_to_xtra()` to handles debug registers, i/o bitmaps and
> + speculation mitigation.
> +* Writes the task's CLOSid/RMID to IA32_PQR_MSR.
On 4/1/20 12:35 PM, Peter Zijlstra wrote:
>> +Scheduler State Transition
>> +==========================
>> +
>> +A very high level scheduler state transition flow with a few states can be
>> +depicted as follows.
>> +
>> +.. kernel-render:: DOT
>> + :alt: DOT digraph of Scheduler state transition
>> + :caption: Scheduler state transition
>> +
>> + digraph sched_transition {
>> + node [shape = point, label="exisiting task\n calls fork()"]; fork
>> + node [shape = box, label="TASK_NEW\n(Ready to run)"] tsk_new;
>> + node [shape = box, label="TASK_RUNNING\n(Ready to run)"] tsk_ready_run;
>> + node [shape = box, label="TASK_RUNNING\n(Running)"] tsk_running;
>> + node [shape = box, label="TASK_DEAD\nEXIT_ZOMBIE"] exit_zombie;
>> + node [shape = box, label="TASK_INTERRUPTIBLE\nTASK_UNINTERRUPTIBLE\nTASK_WAKEKILL"] tsk_int;
>> + fork -> tsk_new [ label = "task\nforks" ];
>> + tsk_new -> tsk_ready_run;
>> + tsk_ready_run -> tsk_running [ label = "schedule() calls context_switch()" ];
>> + tsk_running -> tsk_ready_run [ label = "task is pre-empted" ];
>> + subgraph int {
>> + tsk_running -> tsk_int [ label = "task needs to wait for event" ];
>> + tsk_int -> tsk_ready_run [ label = "event occurred" ];
>> + }
>> + tsk_int -> exit_zombie [ label = "task exits via do_exit()" ];
>> + }
> And that is a prime example of why I hates RST, it pretty much mandates
> you view this with something other than a text editor.
The good thing about the dot format is that we can convert it to many other
formats, including text:
[bristot@x1 ~]$ cat sched_transition.dot | graph-easy
*
|
| task
| forks
v
+------------------------------------+
| TASK_NEW |
| (Ready to run) |
+------------------------------------+
|
|
v
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -+
' int '
' '
' +------------------------------------+ '
' | TASK_RUNNING | '
' +--------------> | (Ready to run) | <--+ '
' | +------------------------------------+ | '
' | | | '
' | | schedule() calls context_switch() | task is pre-empted '
' | v | '
' | +------------------------------------+ | '
' | | TASK_RUNNING | | '
' | | (Running) | ---+ '
' | event occurred +------------------------------------+ '
' | | '
' | | - - - - - - - - - - - -+
' | | '
' | | task needs to wait for event '
' | v '
' | +------------------------------------+ '
' | | TASK_INTERRUPTIBLE | '
' | | TASK_UNINTERRUPTIBLE | '
' +--------------- | TASK_WAKEKILL | '
' +------------------------------------+ '
' '
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
|
| task exits via do_exit()
v
+------------------------------------+
| TASK_DEAD |
| EXIT_ZOMBIE |
+------------------------------------+
Is there a way to also add this representation, while hiding it
when using a graphical reader?
PS: I know nothing about rst, only about rts - real-time systems...
yeah, it is a bad joke haha.
> Also, Daniel, you modeled all this, is the above anywhere close?
hum... let's say that we modeled things we a different goal. Here the
idea is to explain, document... not to create an explicit model.
For instance, there is the context switch _in_, but not the _out_ in the
above example.
But, well, if the goal is to document, it is nice to have graphical
representations.
>> +Scheduler provides trace points tracing all major events of the scheduler.
>> +The tracepoints are defined in ::
>> +
>> + include/trace/events/sched.h
>> +
>> +Using these treacepoints it is possible to model the scheduler state transition
>> +in an automata model. The following conference paper discusses such modeling.
this is a workshop paper.
>> +
>> +https://www.researchgate.net/publication/332440267_Modeling_the_Behavior_of_Threads_in_the_PREEMPT_RT_Linux_Kernel_Using_Automata
> Ah, you've found Daniel ;-)
:-)
This is a better reference (a journal paper).
Daniel B. de Oliveira, Rômulo S. de Oliveira, Tommaso Cucinotta, **A thread
synchronization model for the PREEMPT_RT Linux kernel**, *Journal of Systems
Architecture*, Volume 107, 2020, 101729, ISSN 1383-7621,
https://doi.org/10.1016/j.sysarc.2020.101729.
-- Daniel
On Wed, Apr 01, 2020 at 01:00:28PM +0300, John Mathew wrote:
> +====================
> +CFS Data Structures
> +====================
> +
> +Main parts of the Linux scheduler are:
The main parts ...
> +**Running Queue:** This is the central data structure of process scheduling. It
I've never heard anyone call this 'Running Queue'. It's always called
'run queue'.
> +the CPU core. The main members of the :c:type:`struct rq <rq>` are:
> +
> +:c:member:`nr_running`
> + Total number of tasks on the runqueue.
This seems like it should be kernel-doc so the documentation is with the
code ... meaning it might possibly get updated when the code changes as
opposed to languishing over here.
> +Each rq struct points to a cfs_rq struct which represents the rb tree. The main
How does a cfs_rq struct represent an rb tree? I suspect what you intended
to say is that the cfs_rq structs are stored in an rb tree (I'm not familiar
with the details of the scheduler implementation).
More generally, you're making a mistake that a lot of documentation
writers make which is to overdescribe the current implementation.
The important thing to document is that they're stored in a tree; the
type of tree used is an irrelevant detail. If that changes, we shouldn't
need to update this documentation.
> +Each scheduling entity may be run from its parents runqueue. Scheduler traverses
The scheduler ...
also, please format your line lengths to more like 75 characters; don't go
all the way to 80. Just use 'fmt'; its defaults work fine.
> +vruntime is the value by which tasks are ordered on the red-black tree. Tasks are
> +arranged in increasing order of vruntime which is the amount of time a task has
> +spent running on the cpu.vruntime of a task is updated periodically based on the
> +:c:func:`scheduler_tick` function.
This is a backwards explanation.
vruntime is the amount of time a task has spent running on the cpu. It is
updated periodically by scheduler_tick(). Tasks are stored in the
scheduler's tree sorted by vruntime.
> +History
> +-------
> +
> +Linux 2.6.23 introduced a modular scheduler core and a Completely Fair Scheduler
> +(CFS) implemented as a scheduling module. Scheduler has been improving since
> +kernel version 2.4. In kernel 2.4 there was one running queue for all processes.
I would drop 'Scheduler has been improving since kernel version 2.4'. I
just assume that Linux has been improving.
> +CFS uses a time ordered red-black tree for each CPU. The red-black tree is a type
> +of self-balancing binary search tree. Every running process, has a node in the
Don't explain what an rbtree is, just link to the rbtree documentation.
Which, admittedly, hasn't been converted to rst format yet, but link to
rbtree.txt and someone else can fix that up when they do the conversion.
Hi,
On 01/04/20 13:00, John Mathew wrote:
> Add documentation for
> -scheduler overview
> -scheduler state transtion
> -CFS overview
> -CFS data structs
>
> Add rst for scheduler APIs and modify sched/core.c
> to add kernel-doc comments.
>
> Suggested-by: Lukas Bulwahn <[email protected]>
> Co-developed-by: Mostafa Chamanara <[email protected]>
> Signed-off-by: Mostafa Chamanara <[email protected]>
> Signed-off-by: John Mathew <[email protected]>
> ---
[...]
> +Kernel forwards the tasks to each class based on the scheduling policy assigned
> +to each task. Tasks assigned with SCHED_NORMAL, SCHED_IDLE and SCHED_BATCH
> +go to fair_sched_class and tasks assigned with SCHED_RR and SCHED_FIFO go to
> +rt_sched_class
I think you also need to mention that SCHED_DEADLINE goes to
dl_sched_class.
[...]
> +*poicy:* Value for scheduling policy. The possible values are:
> +
> +* SCHED_NORMAL: Regular tasks use this policy.
> +
> +* SCHED_BATCH: Tasks which need to run longer without pre-emption use this
> + policy. Suitable for batch jobs.
> +* SCHED_IDLE: Policy used by background tasks.
> +
> +* SCHED_FIFO & SCHED_RR: These policies for real time tasks. Handled by real
> + time scheduler.
Here as well. Maybe add also a pointer to related documentation?
Documentation/scheduler/sched-deadline.txt
Thanks,
Juri
On Wed, Apr 01, 2020 at 01:47:04PM +0200, Daniel Bristot de Oliveira wrote:
> On 4/1/20 12:35 PM, Peter Zijlstra wrote:
> >> +Scheduler State Transition
> >> +==========================
> >> +
> >> +A very high level scheduler state transition flow with a few states can be
> >> +depicted as follows.
> >> +
> >> +.. kernel-render:: DOT
> >> + :alt: DOT digraph of Scheduler state transition
> >> + :caption: Scheduler state transition
> >> +
> >> + digraph sched_transition {
> >> + node [shape = point, label="exisiting task\n calls fork()"]; fork
> >> + node [shape = box, label="TASK_NEW\n(Ready to run)"] tsk_new;
> >> + node [shape = box, label="TASK_RUNNING\n(Ready to run)"] tsk_ready_run;
> >> + node [shape = box, label="TASK_RUNNING\n(Running)"] tsk_running;
> >> + node [shape = box, label="TASK_DEAD\nEXIT_ZOMBIE"] exit_zombie;
> >> + node [shape = box, label="TASK_INTERRUPTIBLE\nTASK_UNINTERRUPTIBLE\nTASK_WAKEKILL"] tsk_int;
> >> + fork -> tsk_new [ label = "task\nforks" ];
> >> + tsk_new -> tsk_ready_run;
> >> + tsk_ready_run -> tsk_running [ label = "schedule() calls context_switch()" ];
> >> + tsk_running -> tsk_ready_run [ label = "task is pre-empted" ];
> >> + subgraph int {
> >> + tsk_running -> tsk_int [ label = "task needs to wait for event" ];
> >> + tsk_int -> tsk_ready_run [ label = "event occurred" ];
> >> + }
> >> + tsk_int -> exit_zombie [ label = "task exits via do_exit()" ];
> >> + }
> > And that is a prime example of why I hates RST, it pretty much mandates
> > you view this with something other than a text editor.
>
> The good thing about the dot format is that we can convert it to many other
> formats, including text:
Oh, I know and love dot files, I generate them occasionally. But they
stink as end-result, which is what it is here.
If you can't read a document (or worse comment) in a code editor it's
broken (and yes, I know some subsystems have a different opinion here).
On Wed, Apr 01 2020, Peter Zijlstra wrote:
> On Wed, Apr 01, 2020 at 01:47:04PM +0200, Daniel Bristot de Oliveira wrote:
>> On 4/1/20 12:35 PM, Peter Zijlstra wrote:
>> >> +Scheduler State Transition
>> >> +==========================
>> >> +
>> >> +A very high level scheduler state transition flow with a few states can be
>> >> +depicted as follows.
>> >> +
>> >> +.. kernel-render:: DOT
>> >> + :alt: DOT digraph of Scheduler state transition
>> >> + :caption: Scheduler state transition
>> >> +
>> >> + digraph sched_transition {
>> >> + node [shape = point, label="exisiting task\n calls fork()"]; fork
>> >> + node [shape = box, label="TASK_NEW\n(Ready to run)"] tsk_new;
>> >> + node [shape = box, label="TASK_RUNNING\n(Ready to run)"] tsk_ready_run;
>> >> + node [shape = box, label="TASK_RUNNING\n(Running)"] tsk_running;
>> >> + node [shape = box, label="TASK_DEAD\nEXIT_ZOMBIE"] exit_zombie;
>> >> + node [shape = box, label="TASK_INTERRUPTIBLE\nTASK_UNINTERRUPTIBLE\nTASK_WAKEKILL"] tsk_int;
>> >> + fork -> tsk_new [ label = "task\nforks" ];
>> >> + tsk_new -> tsk_ready_run;
>> >> + tsk_ready_run -> tsk_running [ label = "schedule() calls context_switch()" ];
>> >> + tsk_running -> tsk_ready_run [ label = "task is pre-empted" ];
>> >> + subgraph int {
>> >> + tsk_running -> tsk_int [ label = "task needs to wait for event" ];
>> >> + tsk_int -> tsk_ready_run [ label = "event occurred" ];
>> >> + }
>> >> + tsk_int -> exit_zombie [ label = "task exits via do_exit()" ];
>> >> + }
>> > And that is a prime example of why I hates RST, it pretty much mandates
>> > you view this with something other than a text editor.
>>
>> The good thing about the dot format is that we can convert it to many other
>> formats, including text:
>
> Oh, I know and love dot files, I generate them occasionally. But they
> stink as end-result, which is what it is here.
>
> If you can't read a document (or worse comment) in a code editor it's
> broken (and yes, I know some subsystems have a different opinion here).
Agreed. Feel free to use dot to draw stuff, but please include the textual
version in the doc. If you really insist on having a fancier image in the
web output, have a look at things like ditaa (or whatever equivalent is
supported in rst).
On Wed, 1 Apr 2020 04:54:03 -0700
Matthew Wilcox <[email protected]> wrote:
> > +the CPU core. The main members of the :c:type:`struct rq <rq>` are:
> > +
> > +:c:member:`nr_running`
> > + Total number of tasks on the runqueue.
>
> This seems like it should be kernel-doc so the documentation is with the
> code ... meaning it might possibly get updated when the code changes as
> opposed to languishing over here.
That was my first impression as well.
Thanks for working on the docs; I'll take a closer look soon.
jon
On 01/04/20 11:00, John Mathew wrote:
> +**Schedule class:** It is an extensible hierarchy of scheduler modules. The
> +modules encapsulate scheduling policy details.
> +They are called from the core code which is independent. Scheduling classes are
> +implemented through the sched_class structure.
> +fair_sched_class and rt_sched_class class are implementations of this class. The
> +main members of the :c:type:`struct sched_class <sched_class>` are :
> +
> +For the fair_sched_class the hooks (implemented as <function name>_fair)
> +does the following:
> +
> +:c:member:`enqueue_task`
> + Update the fair scheduling stats and puts scheduling entity in
> + to rb tree and increments the nr_running variable.
> +
> +:c:member:`dequeue_task`
> + Moves the entity out of the rb tree when entity no longer runnable
> + and decrements the nr_running variable. Also update the fair scheduling stats.
> +
> +:c:member:`yield_task`
> + Use the buddy mechanism to skip onto the next highest priority se at
> + every level in the CFS tree, unless doing so would introduce gross unfairness
> + in CPU time distribution.
> +
> +:c:member:`check_preempt_curr`
> + Check whether the task that woke up should pre-empt the
> + running task.
> +
> +:c:member:`pick_next_task`
> + Pick the next eligible task. This may not be the left most task
> + in the rbtree. Instead a buddy system is used which provides benefits of
> + cache locality and group scheduling.
> +
> +:c:member:`task_tick`
> + Called from scheduler_tick(). Updates the runtime statistics of the
> + currently running task and checks if this task needs to be pre-empted.
> +
> +:c:member:`task_fork`
> + scheduler setup for newly forked task.
> +
> +:c:member:`task_dead`
> + A task struct has one reference for the use as "current". If a task
> + dies, then it sets TASK_DEAD in tsk->state and calls schedule one last time.
> + The schedule call will never return, and the scheduled task must drop that
> + reference.
> +
I tend to agree with Matthew in that this is too much info on the current
implem. What would be useful however is some sort of documentation for the
sched_class fields themselves; as you say those are (mainly) called from
core code, so IMO what's interesting is when/why the core code calls them.
For instance highlighting the "change" cycle would be a good start, see
e.g. do_set_cpus_allowed() and what it does with {en,de}queue_task() &
{set_next,put_prev}_task().
On Wed, Apr 1, 2020 at 1:02 PM John Mathew <[email protected]> wrote:
>
> Hi all,
>
> Based on our investigation in the area of the MIPS scheduler context
> switch we wish to share our learnings about the kernel scheduler in
> the form of kernel documentation. Investigations were done mostly by
> stepping through the code using GDB and code inspection. The aim of
> the patchset is to provide a brief overview of the kernel scheduler
> starting from a brief history, the overview of the kernel structs
> used by the scheduler, scheduler invocation and context switch. We
> have also added a small section on scheduler state modelling
> possibilities. In order to add these subjects we have restructured
> the existing scheduler documentation so as to put them in to suitable
> sections. We hope the new structure will enable easy extension of the
> scheduler documentation.
>
> Patch 1 creates place holders and new structure for the scheduler documentation.
> The main sections are
> - Scheduler overview: Overview of the scheduler.
> - CFS: A section dedicated to CFS scheduler.
> - Process context switching: Context switching overview.
> - Scheduler features: We thought most of the existing documentation can be moved
> here.
> - Architecture Specific Scheduler Implementation Differences: Aimed for each
> architecture and future updates.
> - Scheduler Debugging Interface: For scheduler diagnostics and utilities
> - Scheduler related functions: Scheduler API reference.
>
> Patch 2: Adds documentation for the place holders of the Scheduler overview,
> Scheduler State Transition and CFS sections.
>
> Patch 3: Adds documentation for the place holder of the Process context switching
> and add 2 new sections to for x86 and MIPS context switch.
>
>
> John Mathew (3):
> docs: scheduler: Restructure scheduler documentation.
> docs: scheduler: Add scheduler overview documentation
> docs: scheduler: Add introduction to scheduler context-switch
>
> Documentation/scheduler/arch-specific.rst | 14 +
> Documentation/scheduler/cfs-data-structs.rst | 208 ++++++++++++++
> Documentation/scheduler/cfs-overview.rst | 46 ++++
> .../scheduler/cfs-sched-overview.rst | 17 ++
> Documentation/scheduler/context-switching.rst | 71 +++++
> Documentation/scheduler/index.rst | 31 ++-
> .../scheduler/mips-context-switch.rst | 78 ++++++
> Documentation/scheduler/overview.rst | 260 ++++++++++++++++++
> Documentation/scheduler/sched-debugging.rst | 14 +
> Documentation/scheduler/sched-features.rst | 20 ++
> Documentation/scheduler/scheduler-api.rst | 34 +++
> .../scheduler/x86-context-switch.rst | 59 ++++
> kernel/sched/core.c | 36 ++-
> 13 files changed, 867 insertions(+), 21 deletions(-)
> create mode 100644 Documentation/scheduler/arch-specific.rst
> create mode 100644 Documentation/scheduler/cfs-data-structs.rst
> create mode 100644 Documentation/scheduler/cfs-overview.rst
> create mode 100644 Documentation/scheduler/cfs-sched-overview.rst
> create mode 100644 Documentation/scheduler/context-switching.rst
> create mode 100644 Documentation/scheduler/mips-context-switch.rst
> create mode 100644 Documentation/scheduler/overview.rst
> create mode 100644 Documentation/scheduler/sched-debugging.rst
> create mode 100644 Documentation/scheduler/sched-features.rst
> create mode 100644 Documentation/scheduler/scheduler-api.rst
> create mode 100644 Documentation/scheduler/x86-context-switch.rst
>
> --
> 2.17.1
>
Thank you Peter, Daniel, Matthew, Juri, Valentin and Jonathan for taking
time to review this patchset.
Please find the summary of feedback:
Formatting and Representation
---------------------------------------
1. Fix white space damage in the kernel-doc comments added to
the scheduler function (Peter)
2. Find a format that can be displayed properly and edited in a text editor
and also looks good in the html view(Daniel)
3. Provide graphical representation for scheduler automata models.
4. Limit document width to 75 (Matthew)
5. Add text representation of dot format images and check if ditaa
tool can be used for graphical output (Valentin)
Referencing (of code parts)
---------------------------------------
1. Remove :c:func: directive as it is redundant. (Peter)
2. Add reference to all the arch specific implementations of resume() (Peter).
3. Add reference to the 32 bit version of assembly implementation of
__switch_to_asm (Peter)
4. Remove references to the _cond_resched() and __cond_resched_lock()
from the scheduler API reference list added in the patch. (Peter)
5. Describe rq struct member as kernel-doc comments (Matthew)
6. Add reference to scheduler journal paper about scheduler
automata model (Daniel)
Style of Writing and Wording
---------------------------------------
1. Rename running queue to runqueue (Matthew)
2. Describe cfs_rq correctly (Matthew)
3. Describe vruntime correctly (Matthew)
4. Rephrase sentences about scheduler history (Matthew)
Documentation Scope, Focus and Depth
----------------------------------------------------
1. Describe finish_task_switch() correctly (Peter)
2. Do not over-describe current implementation (Matthew)
3. Drop explanation and add reference to existing rbtree
documentation (Matthew)
4. Do not add too much info on current implementation (Valentin)
5. Add documentation for sched_class fields themselves, when/why the
core code calls them (Valentin).
6. Highlight the change cycle (Valentin)
7. Mention SCHED_DEADLINE (Juri)
8. Add pointer to sched-deadline.txt. (Juri)
Technical Understanding and Description
-----------------------------------------------------
1. Describe prepare_task_switch() correctly (Peter)
2. Describe ASID allocation correctly (Peter)
3. Describe kernel task to user space task transition correctly (Peter)
4. Check FPU affinity management feature (Peter)
We will work on all this feedback and provide a RFC PATCH v2
-John
On Wed, 1 Apr 2020 13:47:04 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:
> > And that is a prime example of why I hates RST, it pretty much mandates
> > you view this with something other than a text editor.
>
> The good thing about the dot format is that we can convert it to many other
> formats, including text:
>
> [bristot@x1 ~]$ cat sched_transition.dot | graph-easy
>
> *
>
> |
> | task
> | forks
> v
> +------------------------------------+
> | TASK_NEW |
> | (Ready to run) |
> +------------------------------------+
> |
> |
> v
> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -+
> ' int '
> ' '
> ' +------------------------------------+ '
> ' | TASK_RUNNING | '
> ' +--------------> | (Ready to run) | <--+ '
> ' | +------------------------------------+ | '
> ' | | | '
> ' | | schedule() calls context_switch() | task is pre-empted '
> ' | v | '
> ' | +------------------------------------+ | '
> ' | | TASK_RUNNING | | '
> ' | | (Running) | ---+ '
> ' | event occurred +------------------------------------+ '
> ' | | '
> ' | | - - - - - - - - - - - -+
> ' | | '
> ' | | task needs to wait for event '
> ' | v '
> ' | +------------------------------------+ '
> ' | | TASK_INTERRUPTIBLE | '
> ' | | TASK_UNINTERRUPTIBLE | '
> ' +--------------- | TASK_WAKEKILL | '
> ' +------------------------------------+ '
> ' '
> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
> |
> | task exits via do_exit()
> v
> +------------------------------------+
> | TASK_DEAD |
> | EXIT_ZOMBIE |
> +------------------------------------+
>
>
> Is there a way to also add this representation, while hiding it
> when using a graphical reader?
Better, honestly, to just put the ascii art into the doc as a literal
block. I don't see any real reason to embed Dot stuff unless there's
really no alternative.
Thanks,
jon
On 4/7/20 9:40 PM, Jonathan Corbet wrote:
> On Wed, 1 Apr 2020 13:47:04 +0200
> Daniel Bristot de Oliveira <[email protected]> wrote:
>
>>> And that is a prime example of why I hates RST, it pretty much mandates
>>> you view this with something other than a text editor.
>> The good thing about the dot format is that we can convert it to many other
>> formats, including text:
>>
>> [bristot@x1 ~]$ cat sched_transition.dot | graph-easy
>>
>> *
>>
>> |
>> | task
>> | forks
>> v
>> +------------------------------------+
>> | TASK_NEW |
>> | (Ready to run) |
>> +------------------------------------+
>> |
>> |
>> v
>> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -+
>> ' int '
>> ' '
>> ' +------------------------------------+ '
>> ' | TASK_RUNNING | '
>> ' +--------------> | (Ready to run) | <--+ '
>> ' | +------------------------------------+ | '
>> ' | | | '
>> ' | | schedule() calls context_switch() | task is pre-empted '
>> ' | v | '
>> ' | +------------------------------------+ | '
>> ' | | TASK_RUNNING | | '
>> ' | | (Running) | ---+ '
>> ' | event occurred +------------------------------------+ '
>> ' | | '
>> ' | | - - - - - - - - - - - -+
>> ' | | '
>> ' | | task needs to wait for event '
>> ' | v '
>> ' | +------------------------------------+ '
>> ' | | TASK_INTERRUPTIBLE | '
>> ' | | TASK_UNINTERRUPTIBLE | '
>> ' +--------------- | TASK_WAKEKILL | '
>> ' +------------------------------------+ '
>> ' '
>> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
>> |
>> | task exits via do_exit()
>> v
>> +------------------------------------+
>> | TASK_DEAD |
>> | EXIT_ZOMBIE |
>> +------------------------------------+
>>
>>
>> Is there a way to also add this representation, while hiding it
>> when using a graphical reader?
> Better, honestly, to just put the ascii art into the doc as a literal
> block. I don't see any real reason to embed Dot stuff unless there's
> really no alternative.
I agree.
I think that their idea was focused on a media that could translate the
"source-code in .dot" into a graphical representation, which is good. But that
is not the case for this file and its audience.
But, maybe, it would be nice to have the .dot somewhere (not in the document, I
agree) as a "source code" for future updates.
-- Daniel