Hello,
The deadline server [1] allows RT tasks to run on a system safely, while not
wasting CPU that RT tasks may not get on an idle system due to RT throttling.
Here are patches that are mostly fixes that we found while testing out the
deadline server [1] for ChromeOS. It snowballed from 10 to 15 patches as I
found that my unit test was breaking, and then we also saw some crashes in the
field related to the dl_timer! All of that is fixed.
There is also a fix to core scheduling among several other fixes.
Appreciate a thorough review. I kept all the patches on top of Daniel's and
Peter's patches because I will let them squash it with appropriate attribution
to the contributors.
These patches are based on Daniel's preview branch for v6:
https://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git/?h=dl_server_v6
Daniel mentioned he is working on fixing the fair server interface issues [2].
These patches apply cleanly on his preview version.
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
Joel Fernandes (Google) (12):
sched/core: Add clearing of ->dl_server in put_prev_task_balance()
sched/core: Fix priority checking for DL server picks
sched/core: Fix picking of tasks for core scheduling with DL server
sched/debug: Use unsigned long for cpu variable to prevent cast errors
selftests/sched: Add a test to verify that DL server works with core
scheduling
selftests/sched: Migrate cs_prctl_test to kselfttest
admin-guide/hw-vuln: Correct prctl() argument description
sched: Fix build error in "sched/rt: Remove default bandwidth control"
sched/deadline: Mark DL server as unthrottled before enqueue
sched/deadline: Make start_dl_timer callers more robust
sched/deadline: Do not restart the DL server on replenish from timer
sched/deadline: Always start a new period if CFS exceeded DL runtime
Suleiman Souhlal (2):
sched: server: Don't start hrtick for DL server tasks
sched/deadline: Reverse args to dl_time_before in replenish
Youssef Esmat (1):
sched/core: Clear prev->dl_server in CFS pick fast path
.../admin-guide/hw-vuln/core-scheduling.rst | 4 +-
include/linux/sched.h | 3 +-
kernel/sched/core.c | 46 +++-
kernel/sched/deadline.c | 87 ++++--
kernel/sched/debug.c | 4 +-
kernel/sched/fair.c | 22 +-
kernel/sched/rt.c | 2 +
kernel/sched/sched.h | 3 +-
tools/testing/selftests/sched/Makefile | 17 +-
tools/testing/selftests/sched/common.c | 24 ++
tools/testing/selftests/sched/common.h | 8 +
.../selftests/sched/cs_dlserver_test.c | 254 ++++++++++++++++++
tools/testing/selftests/sched/cs_prctl_test.c | 74 ++---
13 files changed, 456 insertions(+), 92 deletions(-)
create mode 100644 tools/testing/selftests/sched/common.c
create mode 100644 tools/testing/selftests/sched/common.h
create mode 100644 tools/testing/selftests/sched/cs_dlserver_test.c
--
2.34.1
Paths using put_prev_task_balance() need to do a pick shortly after. Make sure
they also clear the ->dl_server on prev as a part of that.
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/sched/core.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 973fd610d089..7f3a2596c1ed 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5998,6 +5998,14 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
#endif
put_prev_task(rq, prev);
+
+ /*
+ * We've updated @prev and no longer need the server link, clear it.
+ * Must be done before ->pick_next_task() because that can (re)set
+ * ->dl_server.
+ */
+ if (prev->dl_server)
+ prev->dl_server = NULL;
}
/*
@@ -6041,14 +6049,6 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
restart:
put_prev_task_balance(rq, prev, rf);
- /*
- * We've updated @prev and no longer need the server link, clear it.
- * Must be done before ->pick_next_task() because that can (re)set
- * ->dl_server.
- */
- if (prev->dl_server)
- prev->dl_server = NULL;
-
for_each_class(class) {
p = class->pick_next_task(rq);
if (p)
--
2.34.1
From: Youssef Esmat <[email protected]>
In case the previous pick was a DL server pick, ->dl_server might be
set. Clear it in the fast path as well.
Signed-off-by: Youssef Esmat <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/sched/core.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f3a2596c1ed..63f41453b79e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6036,6 +6036,13 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
p = pick_next_task_idle(rq);
}
+ /*
+ * This is a normal CFS pick, but the previous could be a DL pick.
+ * Clear it as previous is no longer picked.
+ */
+ if (prev->dl_server)
+ prev->dl_server = NULL;
+
/*
* This is the fast path; it cannot be a DL server pick;
* therefore even if @p == @prev, ->dl_server must be NULL.
--
2.34.1
In core scheduling, a DL server pick (which is CFS task) should be given higher
priority than tasks in other classes.
Not doing so causes CFS starvation. A kselftest is added later to demonstrate
this. A CFS task that is competing with RT tasks can be completely starved
without this and the DL server's boosting completely ignored.
Fix these problems.
Reviewed-by: Vineeth Pillai <[email protected]>
Reported-by: Suleiman Souhlal <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/sched/core.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 63f41453b79e..0a05caf9d3d7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -162,6 +162,9 @@ static inline int __task_prio(const struct task_struct *p)
if (p->sched_class == &stop_sched_class) /* trumps deadline */
return -2;
+ if (p->dl_server)
+ return -1; /* deadline */
+
if (rt_prio(p->prio)) /* includes deadline */
return p->prio; /* [-1, 99] */
@@ -191,8 +194,24 @@ static inline bool prio_less(const struct task_struct *a,
if (-pb < -pa)
return false;
- if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
- return !dl_time_before(a->dl.deadline, b->dl.deadline);
+ if (pa == -1) { /* dl_prio() doesn't work because of stop_class above */
+ const struct sched_dl_entity *a_dl, *b_dl;
+
+ a_dl = &a->dl;
+ /*
+ * Since,'a' and 'b' can be CFS tasks served by DL server,
+ * __task_prio() can return -1 (for DL) even for those. In that
+ * case, get to the dl_server's DL entity.
+ */
+ if (a->dl_server)
+ a_dl = a->dl_server;
+
+ b_dl = &b->dl;
+ if (b->dl_server)
+ b_dl = b->dl_server;
+
+ return !dl_time_before(a_dl->deadline, b_dl->deadline);
+ }
if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
return cfs_prio_less(a, b, in_fi);
--
2.34.1
* Use simple CFS pick_task for DL pick_task
DL server's pick_task calls CFS's pick_next_task_fair(), this is wrong
because core scheduling's pick_task only calls CFS's pick_task() for
evaluation / checking of the CFS task (comparing across CPUs), not for
actually affirmatively picking the next task. This causes RB tree corruption
issues in CFS that were found by syzbot.
* Make pick_task_fair clear DL server
A DL task pick might set ->dl_server, but it is possible the task will
never run (say the other HT has a stop task). If the CFS task is picked
in the future directly (say without DL server), ->dl_server will be
set. So clear it in pick_task_fair().
This fixes the KASAN issue reported by syzbot in set_next_entity().
(DL refactoring suggestions by Vineeth Pillai).
Reviewed-by: Vineeth Pillai <[email protected]>
Reported-by: Suleiman Souhlal <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
include/linux/sched.h | 3 ++-
kernel/sched/deadline.c | 27 ++++++++++++++++++++++-----
kernel/sched/fair.c | 22 ++++++++++++++++++++--
kernel/sched/sched.h | 3 ++-
4 files changed, 46 insertions(+), 9 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a1f918674383..e5ad1f232b35 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -672,7 +672,8 @@ struct sched_dl_entity {
*/
struct rq *rq;
dl_server_has_tasks_f server_has_tasks;
- dl_server_pick_f server_pick;
+ dl_server_pick_f server_pick_next;
+ dl_server_pick_f server_pick_task;
#ifdef CONFIG_RT_MUTEXES
/*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f5aaa3adac94..8fafe3f8b59c 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1567,11 +1567,13 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
dl_server_has_tasks_f has_tasks,
- dl_server_pick_f pick)
+ dl_server_pick_f pick_next,
+ dl_server_pick_f pick_task)
{
dl_se->rq = rq;
dl_se->server_has_tasks = has_tasks;
- dl_se->server_pick = pick;
+ dl_se->server_pick_next = pick_next;
+ dl_se->server_pick_task = pick_task;
}
int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
@@ -2271,7 +2273,12 @@ static struct sched_dl_entity *pick_next_dl_entity(struct dl_rq *dl_rq)
return __node_2_dle(left);
}
-static struct task_struct *pick_task_dl(struct rq *rq)
+/*
+ * __pick_next_task_dl - Helper to pick the next -deadline task to run.
+ * @rq: The runqueue to pick the next task from.
+ * @peek: If true, just peek at the next task. Only relevant for dlserver.
+ */
+static struct task_struct *__pick_next_task_dl(struct rq *rq, bool peek)
{
struct sched_dl_entity *dl_se;
struct dl_rq *dl_rq = &rq->dl;
@@ -2285,7 +2292,10 @@ static struct task_struct *pick_task_dl(struct rq *rq)
WARN_ON_ONCE(!dl_se);
if (dl_server(dl_se)) {
- p = dl_se->server_pick(dl_se);
+ if (IS_ENABLED(CONFIG_SMP) && peek)
+ p = dl_se->server_pick_task(dl_se);
+ else
+ p = dl_se->server_pick_next(dl_se);
if (!p) {
WARN_ON_ONCE(1);
dl_se->dl_yielded = 1;
@@ -2300,11 +2310,18 @@ static struct task_struct *pick_task_dl(struct rq *rq)
return p;
}
+#ifdef CONFIG_SMP
+static struct task_struct *pick_task_dl(struct rq *rq)
+{
+ return __pick_next_task_dl(rq, true);
+}
+#endif
+
static struct task_struct *pick_next_task_dl(struct rq *rq)
{
struct task_struct *p;
- p = pick_task_dl(rq);
+ p = __pick_next_task_dl(rq, false);
if (!p)
return p;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b48287629610..9cc528a14001 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8392,6 +8392,14 @@ static struct task_struct *pick_task_fair(struct rq *rq)
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
+ /*
+ * This can be called from directly from CFS's ->pick_task() or indirectly
+ * from DL's ->pick_task when fair server is enabled. In the indirect case,
+ * DL will set ->dl_server just after this function is called, so its Ok to
+ * clear. In the direct case, we are picking directly so we must clear it.
+ */
+ task_of(se)->dl_server = NULL;
+
return task_of(se);
}
#endif
@@ -8551,7 +8559,16 @@ static bool fair_server_has_tasks(struct sched_dl_entity *dl_se)
return !!dl_se->rq->cfs.nr_running;
}
-static struct task_struct *fair_server_pick(struct sched_dl_entity *dl_se)
+static struct task_struct *fair_server_pick_task(struct sched_dl_entity *dl_se)
+{
+#ifdef CONFIG_SMP
+ return pick_task_fair(dl_se->rq);
+#else
+ return NULL;
+#endif
+}
+
+static struct task_struct *fair_server_pick_next(struct sched_dl_entity *dl_se)
{
return pick_next_task_fair(dl_se->rq, NULL, NULL);
}
@@ -8561,7 +8578,8 @@ void fair_server_init(struct rq *rq)
struct sched_dl_entity *dl_se = &rq->fair_server;
init_dl_entity(dl_se);
- dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
+ dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick_next,
+ fair_server_pick_task);
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4d2c216049cb..bfb15037489c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -338,7 +338,8 @@ extern void dl_server_start(struct sched_dl_entity *dl_se);
extern void dl_server_stop(struct sched_dl_entity *dl_se);
extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
dl_server_has_tasks_f has_tasks,
- dl_server_pick_f pick);
+ dl_server_pick_f pick_next,
+ dl_server_pick_f pick_task);
extern void fair_server_init(struct rq *);
extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
--
2.34.1
This avoids compiler errors seen with clang:
In file included from kernel/sched/build_utility.c:72:
kernel/sched/debug.c:532:47: error: cast to 'void *' from smaller integer
type 'unsigned int' [-Werror,-Wint-to-void-pointer-cast]
debugfs_create_file("runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops);
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/sched/debug.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index c6c0120ff105..2590041696bc 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -516,7 +516,7 @@ static struct dentry *debugfs_sched;
static void debugfs_fair_server_init(void)
{
struct dentry *d_fair;
- unsigned int cpu;
+ unsigned long cpu;
d_fair = debugfs_create_dir("fair_server", debugfs_sched);
if (!d_fair)
@@ -526,7 +526,7 @@ static void debugfs_fair_server_init(void)
struct dentry *d_cpu;
char buf[32];
- snprintf(buf, sizeof(buf), "cpu%d", cpu);
+ snprintf(buf, sizeof(buf), "cpu%lu", cpu);
d_cpu = debugfs_create_dir(buf, d_fair);
debugfs_create_file("runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops);
--
2.34.1
From: Suleiman Souhlal <[email protected]>
Otherwise, we might start it even for tasks in a sched class that should
have it off.
Signed-off-by: Suleiman Souhlal <[email protected]>
---
kernel/sched/deadline.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 8fafe3f8b59c..5adfc15803c3 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2325,11 +2325,12 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
if (!p)
return p;
- if (!p->dl_server)
+ if (!p->dl_server) {
set_next_task_dl(rq, p, true);
- if (hrtick_enabled(rq))
- start_hrtick_dl(rq, &p->dl);
+ if (hrtick_enabled(rq))
+ start_hrtick_dl(rq, &p->dl);
+ }
return p;
}
--
2.34.1
This test verifies that DL server infrastructure gives CFS tasks a fixed
bandwidth even when RT tasks are being "core scheduled" on a core.
Verify that they are getting the expected bandwidth (and thus not being
starved).
Also verified that not having core scheduling fixes makes the test fail
as the CFS task gets no bandwidth.
Sample output:
# Runtime of PID 97 is 4.440000 seconds
# Runtime of PID 98 is 4.560000 seconds
# Runtime of PID 99 is 4.550000 seconds
ok 1 PASS
Notes about test that generated the sample output:
The test runs for 12 seconds. We check the runtimes at 9 seconds. We
expect the CFS task (PID 7) to get ~50% of the 9 seconds. The DL server
is configured for 50% bandwidth.
The RT tasks (PID 98, 99) each get 50% as well, because they run
concurrently on 2 hyperthreads of a core.
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
tools/testing/selftests/sched/Makefile | 13 +-
tools/testing/selftests/sched/common.c | 24 ++
tools/testing/selftests/sched/common.h | 8 +
.../selftests/sched/cs_dlserver_test.c | 254 ++++++++++++++++++
4 files changed, 290 insertions(+), 9 deletions(-)
create mode 100644 tools/testing/selftests/sched/common.c
create mode 100644 tools/testing/selftests/sched/common.h
create mode 100644 tools/testing/selftests/sched/cs_dlserver_test.c
diff --git a/tools/testing/selftests/sched/Makefile b/tools/testing/selftests/sched/Makefile
index 099ee9213557..f491d741cb45 100644
--- a/tools/testing/selftests/sched/Makefile
+++ b/tools/testing/selftests/sched/Makefile
@@ -1,14 +1,9 @@
# SPDX-License-Identifier: GPL-2.0+
+TEST_GEN_PROGS := cs_dlserver_test
-ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),)
-CLANG_FLAGS += -no-integrated-as
-endif
+cs_dlserver_test: cs_dlserver_test.c common.c
-CFLAGS += -O2 -Wall -g -I./ $(KHDR_INCLUDES) -Wl,-rpath=./ \
- $(CLANG_FLAGS)
-LDLIBS += -lpthread
-
-TEST_GEN_FILES := cs_prctl_test
-TEST_PROGS := cs_prctl_test
+CFLAGS += $(KHDR_INCLUDES)
+CFLAGS += -Wall
include ../lib.mk
diff --git a/tools/testing/selftests/sched/common.c b/tools/testing/selftests/sched/common.c
new file mode 100644
index 000000000000..5cf0022acc8d
--- /dev/null
+++ b/tools/testing/selftests/sched/common.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "common.h"
+
+bool hyperthreading_enabled(void)
+{
+ FILE *file = fopen("/sys/devices/system/cpu/smt/active", "r");
+ char smt_active[2];
+
+ if (file == NULL) {
+ ksft_print_msg("Could not determine if hyperthreading is enabled\n");
+ return false;
+ }
+
+ if (fgets(smt_active, sizeof(smt_active), file) == NULL) {
+ perror("Failed to read smt_active");
+ return false;
+ }
+ fclose(file);
+
+ if (smt_active[0] != '1')
+ return false;
+ return true;
+}
diff --git a/tools/testing/selftests/sched/common.h b/tools/testing/selftests/sched/common.h
new file mode 100644
index 000000000000..7bcedbd0ed99
--- /dev/null
+++ b/tools/testing/selftests/sched/common.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <stdio.h>
+#include <stdbool.h>
+#include <dirent.h>
+#include "../kselftest.h"
+
+bool hyperthreading_enabled(void);
diff --git a/tools/testing/selftests/sched/cs_dlserver_test.c b/tools/testing/selftests/sched/cs_dlserver_test.c
new file mode 100644
index 000000000000..9f2a74a25686
--- /dev/null
+++ b/tools/testing/selftests/sched/cs_dlserver_test.c
@@ -0,0 +1,254 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Use the DL server infrastructure to give CFS tasks a fixed bandwidth
+ * even when RT tasks are being "core scheduled" on a core. Verify that
+ * they are getting the expected bandwidth (and thus not being starved).
+ *
+ * Copyright (c) 2024 Google.
+ * Author: Joel Fernandes <[email protected]>
+ *
+ * This library is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This library is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License
+ * for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public License
+ * along with this library; if not, see <http://www.gnu.org/licenses>.
+ */
+
+#define _GNU_SOURCE
+
+#include <stdlib.h>
+#include <unistd.h>
+#include <sched.h>
+#include <time.h>
+#include <sys/wait.h>
+#include <sys/types.h>
+#include <sys/prctl.h>
+#include <fcntl.h>
+#include <string.h>
+
+#include "common.h"
+
+enum pid_type {PIDTYPE_PID = 0, PIDTYPE_TGID, PIDTYPE_PGID};
+
+#define RUN_TIME 12 // Running time of the test in seconds
+#define CORE_ID 0 // Assuming we're pinning processes to the first core
+#define DL_SERVER_DEBUGFS "/sys/kernel/debug/sched/fair_server"
+
+void write_server_debugfs(char *file, char *type, unsigned long value)
+{
+ char path[1024], buf[1024];
+ int fd, n;
+
+ snprintf(path, sizeof(path), "%s/%s/%s", DL_SERVER_DEBUGFS, file, type);
+ fd = open(path, O_WRONLY);
+ if (fd == -1) {
+ perror("Failed to open file for writing");
+ return;
+ }
+ n = snprintf(buf, sizeof(buf), "%lu\n", value);
+ n = write(fd, buf, n);
+ if (n == -1)
+ perror("Failed to write file");
+
+ close(fd);
+}
+
+void write_dl_server_params(void)
+{
+ DIR *dir;
+ struct dirent *entry;
+
+ if (access(DL_SERVER_DEBUGFS, F_OK) == -1) {
+ perror("DL server debugfs not found, cannot set DL parameters.");
+ exit(EXIT_FAILURE);
+ }
+
+ dir = opendir(DL_SERVER_DEBUGFS);
+ if (dir == NULL) {
+ perror("Failed to open directory");
+ exit(EXIT_FAILURE);
+ }
+
+ while ((entry = readdir(dir)) != NULL) {
+ if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0)
+ continue;
+
+ write_server_debugfs(entry->d_name, "period", 100000000);
+ write_server_debugfs(entry->d_name, "runtime", 50000000);
+ }
+ closedir(dir);
+}
+
+void process_func(void)
+{
+ unsigned long long count = 0;
+ time_t end;
+
+ // Busy loop for RUN_TIME seconds
+ end = time(NULL) + RUN_TIME;
+ while (time(NULL) < end) {
+ count++; // Just a dummy operation
+ }
+}
+
+void set_affinity(int cpu_id)
+{
+ cpu_set_t cpuset;
+
+ CPU_ZERO(&cpuset);
+ CPU_SET(cpu_id, &cpuset);
+ CPU_SET(cpu_id + 1, &cpuset);
+
+ if (sched_setaffinity(0, sizeof(cpu_set_t), &cpuset) != 0) {
+ perror("sched_setaffinity");
+ exit(EXIT_FAILURE);
+ }
+}
+
+void set_sched(int policy, int priority)
+{
+ struct sched_param param;
+
+ param.sched_priority = priority;
+ if (sched_setscheduler(0, policy, ¶m) != 0) {
+ perror("sched_setscheduler");
+ exit(EXIT_FAILURE);
+ }
+}
+
+float get_process_runtime(int pid)
+{
+ char path[256];
+ FILE *file;
+ long utime, stime;
+ int fields;
+
+ snprintf(path, sizeof(path), "/proc/%d/stat", pid);
+ file = fopen(path, "r");
+ if (file == NULL) {
+ perror("Failed to open stat file");
+ return -1; // Indicate failure
+ }
+
+ // Skip the first 13 fields and read the 14th and 15th
+ fields = fscanf(file,
+ "%*d %*s %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u %lu %lu",
+ &utime, &stime);
+ fclose(file);
+
+ if (fields != 2) {
+ fprintf(stderr, "Failed to read stat file\n");
+ return -1; // Indicate failure
+ }
+
+ // Calculate the total time spent in the process
+ long total_time = utime + stime;
+ long ticks_per_second = sysconf(_SC_CLK_TCK);
+ float runtime_seconds = total_time * 1.0 / ticks_per_second;
+
+ return runtime_seconds;
+}
+
+int main(void)
+{
+ float runtime1, runtime2, runtime3;
+ int pid1, pid2, pid3;
+
+ if (!hyperthreading_enabled())
+ ksft_test_result_skip("This test requires hyperthreading to be enabled\n");
+
+ write_dl_server_params();
+
+ ksft_print_header();
+ ksft_set_plan(1);
+
+ // Create and set up a CFS task
+ pid1 = fork();
+ if (pid1 == 0) {
+ set_affinity(CORE_ID);
+ process_func();
+ exit(0);
+ } else if (pid1 < 0) {
+ perror("fork for p1");
+ ksft_exit_fail();
+ }
+
+ // Create a new unique cookie for the CFS task
+ if (prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, pid1, PIDTYPE_TGID, 0) < 0) {
+ perror("prctl for pid1");
+ ksft_exit_fail();
+ }
+
+ // Create a new unique cookie for the current process. Future
+ // forks will inherit this cookie.
+ if (prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 0, PIDTYPE_TGID, 0) < 0) {
+ perror("prctl for current process");
+ ksft_exit_fail();
+ }
+
+ // Create an RT task which inherits the parent's cookie
+ pid2 = fork();
+ if (pid2 == 0) {
+ set_affinity(CORE_ID);
+ set_sched(SCHED_FIFO, 50);
+ process_func();
+ exit(0);
+ } else if (pid2 < 0) {
+ perror("fork for p2");
+ ksft_exit_fail();
+ }
+
+ // Create another RT task which inherits the parent's cookie
+ pid3 = fork();
+ if (pid3 == 0) {
+ set_affinity(CORE_ID);
+ set_sched(SCHED_FIFO, 50);
+ process_func();
+ exit(0);
+ } else if (pid3 < 0) {
+ perror("fork for p3");
+ ksft_exit_fail();
+ }
+
+ sleep(RUN_TIME * 3 / 4);
+ runtime1 = get_process_runtime(pid1);
+ if (runtime1 != -1)
+ ksft_print_msg("Runtime of PID %d is %f seconds\n", pid1, runtime1);
+ else
+ ksft_exit_fail_msg("Error getting runtime for PID %d\n", pid1);
+
+ runtime2 = get_process_runtime(pid2);
+ if (runtime2 != -1)
+ ksft_print_msg("Runtime of PID %d is %f seconds\n", pid2, runtime2);
+ else
+ ksft_exit_fail_msg("Error getting runtime for PID %d\n", pid2);
+
+ runtime3 = get_process_runtime(pid3);
+ if (runtime3 != -1)
+ ksft_print_msg("Runtime of PID %d is %f seconds\n", pid3, runtime3);
+ else
+ ksft_exit_fail_msg("Error getting runtime for PID %d\n", pid3);
+
+ // Make sure runtime1 is within 30% of runtime2
+ if (runtime1 < 0.7 * runtime2 || runtime1 > 1.3 * runtime2)
+ ksft_exit_fail_msg("Runtime of PID %d is not within 30%% of runtime of PID %d\n",
+ pid1, pid2);
+
+ // Make sure runtime1 is within 30% of runtime3
+ if (runtime1 < 0.7 * runtime3 || runtime1 > 1.3 * runtime3)
+ ksft_exit_fail_msg("Runtime of PID %d is not within 30%% of runtime of PID %d\n",
+ pid1, pid3);
+
+ waitpid(pid1, NULL, 0);
+ waitpid(pid2, NULL, 0);
+ waitpid(pid3, NULL, 0);
+
+ ksft_test_result_pass("PASS\n");
+ return 0;
+}
--
2.34.1
Correct the description for arg4 which appears to be outdated.
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
Documentation/admin-guide/hw-vuln/core-scheduling.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
index cf1eeefdfc32..338d639a8127 100644
--- a/Documentation/admin-guide/hw-vuln/core-scheduling.rst
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -62,8 +62,8 @@ arg3:
arg4:
``pid_type`` for which the operation applies. It is one of
- ``PR_SCHED_CORE_SCOPE_``-prefixed macro constants. For example, if arg4
- is ``PR_SCHED_CORE_SCOPE_THREAD_GROUP``, then the operation of this command
+ ``PIDTIME_``-prefixed macro constants. For example, if arg4
+ is ``PIDTYPE_TGID``, then the operation of this command
will be performed for all tasks in the task group of ``pid``.
arg5:
--
2.34.1
This fixes a build error introduced by "sched/rt: Remove default
bandwidth control". The issue happens because a function is unused
when !CONFIG_SMP. It could be squashed into the original patch.
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/sched/rt.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 37bee56a70f7..d3065fe35c61 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -964,8 +964,10 @@ struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
return &cpu_rq(cpu)->rt;
}
+#ifdef CONFIG_SMP
static void __enable_runtime(struct rq *rq) { }
static void __disable_runtime(struct rq *rq) { }
+#endif
#endif /* CONFIG_RT_GROUP_SCHED */
--
2.34.1
From: Suleiman Souhlal <[email protected]>
dl_time_before() seems to be incorrectly used, we need to check that the
0-laxity time is ahead of the rq_clock. Fix it.
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/sched/deadline.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1d54231fbaa6..dbba95d364e2 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -914,7 +914,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
* is in the future, throttle the server and arm the zero laxity timer.
*/
if (dl_se->dl_defer &&
- dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) {
+ dl_time_before(rq_clock(rq), dl_se->deadline - dl_se->runtime)) {
if (!is_dl_boosted(dl_se)) {
dl_se->dl_defer_armed = 1;
dl_se->dl_throttled = 1;
--
2.34.1
This test begs to be a kselftest, is in the kselftest hierarchy and does
not even use a single kselftest API. Convert it.
It simplifies some of the code and the output also looks much nicer now:
Totals: pass:17 fail:0 xfail:0 xpass:0 skip:0 error:0
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
tools/testing/selftests/sched/Makefile | 6 +-
tools/testing/selftests/sched/cs_prctl_test.c | 74 ++++++++++---------
2 files changed, 43 insertions(+), 37 deletions(-)
diff --git a/tools/testing/selftests/sched/Makefile b/tools/testing/selftests/sched/Makefile
index f491d741cb45..90c53bc1337e 100644
--- a/tools/testing/selftests/sched/Makefile
+++ b/tools/testing/selftests/sched/Makefile
@@ -1,9 +1,11 @@
# SPDX-License-Identifier: GPL-2.0+
TEST_GEN_PROGS := cs_dlserver_test
-
-cs_dlserver_test: cs_dlserver_test.c common.c
+TEST_GEN_PROGS += cs_prctl_test
CFLAGS += $(KHDR_INCLUDES)
CFLAGS += -Wall
include ../lib.mk
+
+$(OUTPUT)/cs_dlserver_test: cs_dlserver_test.c common.c
+$(OUTPUT)/cs_prctl_test: cs_prctl_test.c common.c
diff --git a/tools/testing/selftests/sched/cs_prctl_test.c b/tools/testing/selftests/sched/cs_prctl_test.c
index 7ba057154343..bb7aee703cdf 100644
--- a/tools/testing/selftests/sched/cs_prctl_test.c
+++ b/tools/testing/selftests/sched/cs_prctl_test.c
@@ -28,10 +28,11 @@
#include <unistd.h>
#include <time.h>
#include <errno.h>
-#include <stdio.h>
#include <stdlib.h>
#include <string.h>
+#include "common.h"
+
#if __GLIBC_PREREQ(2, 30) == 0
#include <sys/syscall.h>
static pid_t gettid(void)
@@ -80,7 +81,7 @@ static int _prctl(int option, unsigned long arg2, unsigned long arg3, unsigned l
int res;
res = prctl(option, arg2, arg3, arg4, arg5);
- printf("%d = prctl(%d, %ld, %ld, %ld, %lx)\n", res, option, (long)arg2, (long)arg3,
+ ksft_print_msg("%d = prctl(%d, %ld, %ld, %ld, %lx)\n", res, option, (long)arg2, (long)arg3,
(long)arg4, arg5);
return res;
}
@@ -91,21 +92,20 @@ static int _prctl(int option, unsigned long arg2, unsigned long arg3, unsigned l
static void __handle_error(char *fn, int ln, char *msg)
{
int pidx;
- printf("(%s:%d) - ", fn, ln);
+ ksft_print_msg("(%s:%d) - ", fn, ln);
perror(msg);
if (need_cleanup) {
for (pidx = 0; pidx < num_processes; ++pidx)
kill(procs[pidx].cpid, 15);
need_cleanup = 0;
}
- exit(EXIT_FAILURE);
+ ksft_exit_fail();
}
static void handle_usage(int rc, char *msg)
{
- puts(USAGE);
- puts(msg);
- putchar('\n');
+ ksft_print_msg("%s\n", USAGE);
+ ksft_print_msg("%s\n\n", msg);
exit(rc);
}
@@ -117,7 +117,7 @@ static unsigned long get_cs_cookie(int pid)
ret = prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, pid, PIDTYPE_PID,
(unsigned long)&cookie);
if (ret) {
- printf("Not a core sched system\n");
+ ksft_print_msg("Not a core sched system\n");
return -1UL;
}
@@ -160,7 +160,7 @@ static int child_func_process(void *arg)
ret = write(ca->pfd[1], &ca->thr_tids, sizeof(int) * ca->num_threads);
if (ret == -1)
- printf("write failed on pfd[%d] - error (%s)\n",
+ ksft_print_msg("write failed on pfd[%d] - error (%s)\n",
ca->pfd[1], strerror(errno));
close(ca->pfd[1]);
@@ -192,7 +192,7 @@ void create_processes(int num_processes, int num_threads, struct child_args proc
for (i = 0; i < num_processes; ++i) {
ret = read(proc[i].pfd[0], &proc[i].thr_tids, sizeof(int) * proc[i].num_threads);
if (ret == -1)
- printf("read failed on proc[%d].pfd[0] error (%s)\n",
+ ksft_print_msg("read failed on proc[%d].pfd[0] error (%s)\n",
i, strerror(errno));
close(proc[i].pfd[0]);
}
@@ -202,30 +202,29 @@ void disp_processes(int num_processes, struct child_args proc[])
{
int i, j;
- printf("tid=%d, / tgid=%d / pgid=%d: %lx\n", gettid(), getpid(), getpgid(0),
+ ksft_print_msg("tid=%d, / tgid=%d / pgid=%d: %lx\n", gettid(), getpid(), getpgid(0),
get_cs_cookie(getpid()));
for (i = 0; i < num_processes; ++i) {
- printf(" tid=%d, / tgid=%d / pgid=%d: %lx\n", proc[i].cpid, proc[i].cpid,
+ ksft_print_msg(" tid=%d, / tgid=%d / pgid=%d: %lx\n", proc[i].cpid, proc[i].cpid,
getpgid(proc[i].cpid), get_cs_cookie(proc[i].cpid));
for (j = 0; j < proc[i].num_threads; ++j) {
- printf(" tid=%d, / tgid=%d / pgid=%d: %lx\n", proc[i].thr_tids[j],
+ ksft_print_msg(" tid=%d, / tgid=%d / pgid=%d: %lx\n", proc[i].thr_tids[j],
proc[i].cpid, getpgid(0), get_cs_cookie(proc[i].thr_tids[j]));
}
}
puts("\n");
}
-static int errors;
-
#define validate(v) _validate(__LINE__, v, #v)
void _validate(int line, int val, char *msg)
{
if (!val) {
- ++errors;
- printf("(%d) FAILED: %s\n", line, msg);
+ ksft_print_msg("(%d) FAILED: %s\n", line, msg);
+ ksft_inc_fail_cnt();
} else {
- printf("(%d) PASSED: %s\n", line, msg);
+ ksft_print_msg("(%d) PASSED: %s\n", line, msg);
+ ksft_inc_pass_cnt();
}
}
@@ -254,13 +253,17 @@ int main(int argc, char *argv[])
keypress = 1;
break;
case 'h':
- printf(USAGE);
+ ksft_print_msg(USAGE);
exit(EXIT_SUCCESS);
default:
handle_usage(20, "unknown option");
}
}
+ if (!hyperthreading_enabled()) {
+ ksft_exit_skip("This test requires hyperthreading to be enabled\n");
+ }
+
if (num_processes < 1 || num_processes > MAX_PROCESSES)
handle_usage(1, "Bad processes value");
@@ -272,17 +275,22 @@ int main(int argc, char *argv[])
srand(time(NULL));
- /* put into separate process group */
+ /* Put into separate process group */
if (setpgid(0, 0) != 0)
handle_error("process group");
- printf("\n## Create a thread/process/process group hiearchy\n");
+ ksft_print_header();
+
+ /* Increase the count if adding more validate() statements. */
+ ksft_set_plan(17);
+
+ ksft_print_msg("\n## Create a thread/process/process group hiearchy\n");
create_processes(num_processes, num_threads, procs);
need_cleanup = 1;
disp_processes(num_processes, procs);
validate(get_cs_cookie(0) == 0);
- printf("\n## Set a cookie on entire process group\n");
+ ksft_print_msg("\n## Set a cookie on entire process group\n");
if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 0, PIDTYPE_PGID, 0) < 0)
handle_error("core_sched create failed -- PGID");
disp_processes(num_processes, procs);
@@ -296,7 +304,7 @@ int main(int argc, char *argv[])
validate(get_cs_cookie(0) == get_cs_cookie(pid));
validate(get_cs_cookie(0) == get_cs_cookie(procs[pidx].thr_tids[0]));
- printf("\n## Set a new cookie on entire process/TGID [%d]\n", pid);
+ ksft_print_msg("\n## Set a new cookie on entire process/TGID [%d]\n", pid);
if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, pid, PIDTYPE_TGID, 0) < 0)
handle_error("core_sched create failed -- TGID");
disp_processes(num_processes, procs);
@@ -305,7 +313,7 @@ int main(int argc, char *argv[])
validate(get_cs_cookie(pid) != 0);
validate(get_cs_cookie(pid) == get_cs_cookie(procs[pidx].thr_tids[0]));
- printf("\n## Copy the cookie of current/PGID[%d], to pid [%d] as PIDTYPE_PID\n",
+ ksft_print_msg("\n## Copy the cookie of current/PGID[%d], to pid [%d] as PIDTYPE_PID\n",
getpid(), pid);
if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, pid, PIDTYPE_PID, 0) < 0)
handle_error("core_sched share to itself failed -- PID");
@@ -315,7 +323,7 @@ int main(int argc, char *argv[])
validate(get_cs_cookie(pid) != 0);
validate(get_cs_cookie(pid) != get_cs_cookie(procs[pidx].thr_tids[0]));
- printf("\n## Copy cookie from a thread [%d] to current/PGID [%d] as PIDTYPE_PID\n",
+ ksft_print_msg("\n## Copy cookie from a thread [%d] to current/PGID [%d] as PIDTYPE_PID\n",
procs[pidx].thr_tids[0], getpid());
if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, procs[pidx].thr_tids[0],
PIDTYPE_PID, 0) < 0)
@@ -325,7 +333,7 @@ int main(int argc, char *argv[])
validate(get_cs_cookie(0) == get_cs_cookie(procs[pidx].thr_tids[0]));
validate(get_cs_cookie(pid) != get_cs_cookie(procs[pidx].thr_tids[0]));
- printf("\n## Copy cookie from current [%d] to current as pidtype PGID\n", getpid());
+ ksft_print_msg("\n## Copy cookie from current [%d] to current as pidtype PGID\n", getpid());
if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, 0, PIDTYPE_PGID, 0) < 0)
handle_error("core_sched share to self failed -- PGID");
disp_processes(num_processes, procs);
@@ -340,20 +348,16 @@ int main(int argc, char *argv[])
validate(_prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, 0, PIDTYPE_PGID, 1) < 0
&& errno == EINVAL);
- if (errors) {
- printf("TESTS FAILED. errors: %d\n", errors);
- res = 10;
- } else {
- printf("SUCCESS !!!\n");
- }
-
- if (keypress)
+ if (keypress) {
+ ksft_print_msg("Waiting for keypress to exit\n");
getchar();
- else
+ } else {
sleep(delay);
+ }
for (pidx = 0; pidx < num_processes; ++pidx)
kill(procs[pidx].cpid, 15);
+ ksft_finished();
return res;
}
--
2.34.1
For whatever reason, if start_dl_timer() returned 0 during replenish (it
did not start a new timer), then do not marked dl_defer_armed, because
we never really armed.
Further, we need to cancel any old timers,
This is similar to what dl_check_constrained_dl() does.
Add some guardrails for such situations.
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/sched/deadline.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index dbba95d364e2..e978e299381c 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -918,7 +918,16 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
if (!is_dl_boosted(dl_se)) {
dl_se->dl_defer_armed = 1;
dl_se->dl_throttled = 1;
- start_dl_timer(dl_se);
+ if (!start_dl_timer(dl_se)) {
+ /*
+ * If for whatever reason (delays), if a previous timer was
+ * queued but not serviced, cancel it.
+ */
+ hrtimer_try_to_cancel(&dl_se->dl_timer);
+ dl_se->dl_defer_armed = 0;
+ dl_se->dl_throttled = 0;
+ return;
+ }
}
}
}
@@ -1465,7 +1474,14 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
hrtimer_try_to_cancel(&dl_se->dl_timer);
replenish_dl_new_period(dl_se, dl_se->rq);
- start_dl_timer(dl_se);
+
+ /*
+ * Not being able to start the timer seems problematic. If it could not
+ * be started for whatever reason, we need to "unthrottle" the DL server
+ * and queue right away. Otherwise nothing might queue it. That's similar
+ * to what enqueue_dl_entity() does on start_dl_timer==0. For now, just warn.
+ */
+ WARN_ON_ONCE(!start_dl_timer(dl_se));
return;
}
--
2.34.1
The DL server may not have had its timer started if start_dl_timer()
returns 0 (say the zero-laxity time has already passed). In such cases,
mark the DL task which is about to be enqueued as not throttled and
cancel any previous timers, then do the enqueue.
This fixes the following crash:
[ 9.263331] kernel BUG at kernel/sched/deadline.c:1765!
[ 9.282382] Call Trace:
[ 9.282767] <TASK>
[ 9.283086] ? __die_body+0x62/0xb0
[ 9.283602] ? die+0x9b/0xc0
[ 9.284036] ? do_trap+0xa3/0x170
[ 9.284528] ? enqueue_dl_entity+0x45e/0x460
[ 9.285158] ? enqueue_dl_entity+0x45e/0x460
[ 9.285791] ? handle_invalid_op+0x65/0x80
[ 9.286392] ? enqueue_dl_entity+0x45e/0x460
[ 9.287021] ? exc_invalid_op+0x2f/0x40
[ 9.287585] ? asm_exc_invalid_op+0x16/0x20
[ 9.288200] ? find_later_rq+0x120/0x120
[ 9.288775] ? fair_server_init+0x40/0x40
[ 9.289364] ? enqueue_dl_entity+0x45e/0x460
[ 9.289989] ? find_later_rq+0x120/0x120
[ 9.290564] dl_task_timer+0x1d7/0x2f0
[ 9.291120] ? find_later_rq+0x120/0x120
[ 9.291695] __run_hrtimer+0x73/0x1b0
[ 9.292238] hrtimer_interrupt+0x216/0x2c0
[ 9.292841] __sysvec_apic_timer_interrupt+0x53/0x140
[ 9.293581] sysvec_apic_timer_interrupt+0x2d/0x80
[ 9.294285] asm_sysvec_apic_timer_interrupt+0x16/0x20
The crash can easily be reproduced by adding a 100ms delay as follows:
+int delay_inject_count;
+
static void
enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
{
@@ -1827,6 +1830,12 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
setup_new_dl_entity(dl_se);
}
+ // 100ms delay every 20 enqueues.
+ if (delay_inject_count++ > 20) {
+ mdelay(100);
+ delay_inject_count = 0;
+ }
+
/*
* If we are still throttled, eg. we got replenished but are a
* zero-laxity task and still got to wait, don't enqueue.
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/sched/deadline.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 5adfc15803c3..1d54231fbaa6 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1949,6 +1949,18 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
if (dl_se->dl_throttled && start_dl_timer(dl_se))
return;
+ /*
+ * We're about to enqueue, make sure we're not ->dl_throttled!
+ * In case the timer was not started, say because the 0-lax time
+ * has passed, mark as not throttled and mark unarmed.
+ * Also cancel earlier timers, since letting those run is pointless.
+ */
+ if (dl_se->dl_throttled) {
+ hrtimer_try_to_cancel(&dl_se->dl_timer);
+ dl_se->dl_defer_armed = 0;
+ dl_se->dl_throttled = 0;
+ }
+
__enqueue_dl_entity(dl_se);
}
--
2.34.1
There is no point in doing so if there are no CFS tasks to run. If there
were, we would be doing ENQUEUE_REPLENISH from the dl_task_timer,
instead of calling replenish_dl_entity(). Fix that.
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/sched/deadline.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e978e299381c..179369d27f66 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -915,7 +915,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
*/
if (dl_se->dl_defer &&
dl_time_before(rq_clock(rq), dl_se->deadline - dl_se->runtime)) {
- if (!is_dl_boosted(dl_se)) {
+ if (!is_dl_boosted(dl_se) && dl_se->server_has_tasks(dl_se)) {
dl_se->dl_defer_armed = 1;
dl_se->dl_throttled = 1;
if (!start_dl_timer(dl_se)) {
--
2.34.1
We believe that this is the right thing to do. The unit test
(cs_dlserver_test) also agrees. If we let the CFS run without starting a
new period, while the server is regularly throttled, then the test fails
because CFS does not appear to get enough bandwidth.
Intuitively, this makes sense to do as well. If CFS used up all the CFS
bandwidth, while the DL server was in a throttled state, it got the
bandwidth it wanted and some. Now, we can start all over from scratch to
guarantee it a minimum bandwidth.
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/sched/deadline.c | 17 -----------------
1 file changed, 17 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 179369d27f66..a0ea668ac1bf 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1454,23 +1454,6 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
* starting a new period, pushing the activation to the zero-lax time.
*/
if (dl_se->dl_defer && dl_se->dl_throttled && dl_runtime_exceeded(dl_se)) {
- s64 runtime_diff = dl_se->runtime + dl_se->dl_runtime;
-
- /*
- * If this is a regular throttling case, let it run negative until
- * the dl_runtime - runtime > 0. The reason being is that the next
- * replenishment will result in a positive runtime one period ahead.
- *
- * Otherwise, the deadline will be pushed more than one period, not
- * providing runtime/period anymore.
- *
- * If the dl_runtime - runtime < 0, then the server was able to get
- * the runtime/period before the replenishment. So it is safe
- * to start a new deffered period.
- */
- if (!dl_se->dl_defer_armed && runtime_diff > 0)
- return;
-
hrtimer_try_to_cancel(&dl_se->dl_timer);
replenish_dl_new_period(dl_se, dl_se->rq);
--
2.34.1
On 3/12/24 21:24, Joel Fernandes (Google) wrote:
> This test begs to be a kselftest, is in the kselftest hierarchy and does
> not even use a single kselftest API. Convert it.
>
> It simplifies some of the code and the output also looks much nicer now:
>
> Totals: pass:17 fail:0 xfail:0 xpass:0 skip:0 error:0
>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
Reviewed-by: Chris Hyser <[email protected]>
> ---
> tools/testing/selftests/sched/Makefile | 6 +-
> tools/testing/selftests/sched/cs_prctl_test.c | 74 ++++++++++---------
> 2 files changed, 43 insertions(+), 37 deletions(-)
>
> diff --git a/tools/testing/selftests/sched/Makefile b/tools/testing/selftests/sched/Makefile
> index f491d741cb45..90c53bc1337e 100644
> --- a/tools/testing/selftests/sched/Makefile
> +++ b/tools/testing/selftests/sched/Makefile
> @@ -1,9 +1,11 @@
> # SPDX-License-Identifier: GPL-2.0+
> TEST_GEN_PROGS := cs_dlserver_test
> -
> -cs_dlserver_test: cs_dlserver_test.c common.c
> +TEST_GEN_PROGS += cs_prctl_test
>
> CFLAGS += $(KHDR_INCLUDES)
> CFLAGS += -Wall
>
> include ../lib.mk
> +
> +$(OUTPUT)/cs_dlserver_test: cs_dlserver_test.c common.c
> +$(OUTPUT)/cs_prctl_test: cs_prctl_test.c common.c
> diff --git a/tools/testing/selftests/sched/cs_prctl_test.c b/tools/testing/selftests/sched/cs_prctl_test.c
> index 7ba057154343..bb7aee703cdf 100644
> --- a/tools/testing/selftests/sched/cs_prctl_test.c
> +++ b/tools/testing/selftests/sched/cs_prctl_test.c
> @@ -28,10 +28,11 @@
> #include <unistd.h>
> #include <time.h>
> #include <errno.h>
> -#include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> +#include "common.h"
> +
> #if __GLIBC_PREREQ(2, 30) == 0
> #include <sys/syscall.h>
> static pid_t gettid(void)
> @@ -80,7 +81,7 @@ static int _prctl(int option, unsigned long arg2, unsigned long arg3, unsigned l
> int res;
>
> res = prctl(option, arg2, arg3, arg4, arg5);
> - printf("%d = prctl(%d, %ld, %ld, %ld, %lx)\n", res, option, (long)arg2, (long)arg3,
> + ksft_print_msg("%d = prctl(%d, %ld, %ld, %ld, %lx)\n", res, option, (long)arg2, (long)arg3,
> (long)arg4, arg5);
> return res;
> }
> @@ -91,21 +92,20 @@ static int _prctl(int option, unsigned long arg2, unsigned long arg3, unsigned l
> static void __handle_error(char *fn, int ln, char *msg)
> {
> int pidx;
> - printf("(%s:%d) - ", fn, ln);
> + ksft_print_msg("(%s:%d) - ", fn, ln);
> perror(msg);
> if (need_cleanup) {
> for (pidx = 0; pidx < num_processes; ++pidx)
> kill(procs[pidx].cpid, 15);
> need_cleanup = 0;
> }
> - exit(EXIT_FAILURE);
> + ksft_exit_fail();
> }
>
> static void handle_usage(int rc, char *msg)
> {
> - puts(USAGE);
> - puts(msg);
> - putchar('\n');
> + ksft_print_msg("%s\n", USAGE);
> + ksft_print_msg("%s\n\n", msg);
> exit(rc);
> }
>
> @@ -117,7 +117,7 @@ static unsigned long get_cs_cookie(int pid)
> ret = prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, pid, PIDTYPE_PID,
> (unsigned long)&cookie);
> if (ret) {
> - printf("Not a core sched system\n");
> + ksft_print_msg("Not a core sched system\n");
> return -1UL;
> }
>
> @@ -160,7 +160,7 @@ static int child_func_process(void *arg)
>
> ret = write(ca->pfd[1], &ca->thr_tids, sizeof(int) * ca->num_threads);
> if (ret == -1)
> - printf("write failed on pfd[%d] - error (%s)\n",
> + ksft_print_msg("write failed on pfd[%d] - error (%s)\n",
> ca->pfd[1], strerror(errno));
>
> close(ca->pfd[1]);
> @@ -192,7 +192,7 @@ void create_processes(int num_processes, int num_threads, struct child_args proc
> for (i = 0; i < num_processes; ++i) {
> ret = read(proc[i].pfd[0], &proc[i].thr_tids, sizeof(int) * proc[i].num_threads);
> if (ret == -1)
> - printf("read failed on proc[%d].pfd[0] error (%s)\n",
> + ksft_print_msg("read failed on proc[%d].pfd[0] error (%s)\n",
> i, strerror(errno));
> close(proc[i].pfd[0]);
> }
> @@ -202,30 +202,29 @@ void disp_processes(int num_processes, struct child_args proc[])
> {
> int i, j;
>
> - printf("tid=%d, / tgid=%d / pgid=%d: %lx\n", gettid(), getpid(), getpgid(0),
> + ksft_print_msg("tid=%d, / tgid=%d / pgid=%d: %lx\n", gettid(), getpid(), getpgid(0),
> get_cs_cookie(getpid()));
>
> for (i = 0; i < num_processes; ++i) {
> - printf(" tid=%d, / tgid=%d / pgid=%d: %lx\n", proc[i].cpid, proc[i].cpid,
> + ksft_print_msg(" tid=%d, / tgid=%d / pgid=%d: %lx\n", proc[i].cpid, proc[i].cpid,
> getpgid(proc[i].cpid), get_cs_cookie(proc[i].cpid));
> for (j = 0; j < proc[i].num_threads; ++j) {
> - printf(" tid=%d, / tgid=%d / pgid=%d: %lx\n", proc[i].thr_tids[j],
> + ksft_print_msg(" tid=%d, / tgid=%d / pgid=%d: %lx\n", proc[i].thr_tids[j],
> proc[i].cpid, getpgid(0), get_cs_cookie(proc[i].thr_tids[j]));
> }
> }
> puts("\n");
> }
>
> -static int errors;
> -
> #define validate(v) _validate(__LINE__, v, #v)
> void _validate(int line, int val, char *msg)
> {
> if (!val) {
> - ++errors;
> - printf("(%d) FAILED: %s\n", line, msg);
> + ksft_print_msg("(%d) FAILED: %s\n", line, msg);
> + ksft_inc_fail_cnt();
> } else {
> - printf("(%d) PASSED: %s\n", line, msg);
> + ksft_print_msg("(%d) PASSED: %s\n", line, msg);
> + ksft_inc_pass_cnt();
> }
> }
>
> @@ -254,13 +253,17 @@ int main(int argc, char *argv[])
> keypress = 1;
> break;
> case 'h':
> - printf(USAGE);
> + ksft_print_msg(USAGE);
> exit(EXIT_SUCCESS);
> default:
> handle_usage(20, "unknown option");
> }
> }
>
> + if (!hyperthreading_enabled()) {
> + ksft_exit_skip("This test requires hyperthreading to be enabled\n");
> + }
> +
> if (num_processes < 1 || num_processes > MAX_PROCESSES)
> handle_usage(1, "Bad processes value");
>
> @@ -272,17 +275,22 @@ int main(int argc, char *argv[])
>
> srand(time(NULL));
>
> - /* put into separate process group */
> + /* Put into separate process group */
> if (setpgid(0, 0) != 0)
> handle_error("process group");
>
> - printf("\n## Create a thread/process/process group hiearchy\n");
> + ksft_print_header();
> +
> + /* Increase the count if adding more validate() statements. */
> + ksft_set_plan(17);
> +
> + ksft_print_msg("\n## Create a thread/process/process group hiearchy\n");
> create_processes(num_processes, num_threads, procs);
> need_cleanup = 1;
> disp_processes(num_processes, procs);
> validate(get_cs_cookie(0) == 0);
>
> - printf("\n## Set a cookie on entire process group\n");
> + ksft_print_msg("\n## Set a cookie on entire process group\n");
> if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 0, PIDTYPE_PGID, 0) < 0)
> handle_error("core_sched create failed -- PGID");
> disp_processes(num_processes, procs);
> @@ -296,7 +304,7 @@ int main(int argc, char *argv[])
> validate(get_cs_cookie(0) == get_cs_cookie(pid));
> validate(get_cs_cookie(0) == get_cs_cookie(procs[pidx].thr_tids[0]));
>
> - printf("\n## Set a new cookie on entire process/TGID [%d]\n", pid);
> + ksft_print_msg("\n## Set a new cookie on entire process/TGID [%d]\n", pid);
> if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, pid, PIDTYPE_TGID, 0) < 0)
> handle_error("core_sched create failed -- TGID");
> disp_processes(num_processes, procs);
> @@ -305,7 +313,7 @@ int main(int argc, char *argv[])
> validate(get_cs_cookie(pid) != 0);
> validate(get_cs_cookie(pid) == get_cs_cookie(procs[pidx].thr_tids[0]));
>
> - printf("\n## Copy the cookie of current/PGID[%d], to pid [%d] as PIDTYPE_PID\n",
> + ksft_print_msg("\n## Copy the cookie of current/PGID[%d], to pid [%d] as PIDTYPE_PID\n",
> getpid(), pid);
> if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, pid, PIDTYPE_PID, 0) < 0)
> handle_error("core_sched share to itself failed -- PID");
> @@ -315,7 +323,7 @@ int main(int argc, char *argv[])
> validate(get_cs_cookie(pid) != 0);
> validate(get_cs_cookie(pid) != get_cs_cookie(procs[pidx].thr_tids[0]));
>
> - printf("\n## Copy cookie from a thread [%d] to current/PGID [%d] as PIDTYPE_PID\n",
> + ksft_print_msg("\n## Copy cookie from a thread [%d] to current/PGID [%d] as PIDTYPE_PID\n",
> procs[pidx].thr_tids[0], getpid());
> if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, procs[pidx].thr_tids[0],
> PIDTYPE_PID, 0) < 0)
> @@ -325,7 +333,7 @@ int main(int argc, char *argv[])
> validate(get_cs_cookie(0) == get_cs_cookie(procs[pidx].thr_tids[0]));
> validate(get_cs_cookie(pid) != get_cs_cookie(procs[pidx].thr_tids[0]));
>
> - printf("\n## Copy cookie from current [%d] to current as pidtype PGID\n", getpid());
> + ksft_print_msg("\n## Copy cookie from current [%d] to current as pidtype PGID\n", getpid());
> if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, 0, PIDTYPE_PGID, 0) < 0)
> handle_error("core_sched share to self failed -- PGID");
> disp_processes(num_processes, procs);
> @@ -340,20 +348,16 @@ int main(int argc, char *argv[])
> validate(_prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, 0, PIDTYPE_PGID, 1) < 0
> && errno == EINVAL);
>
> - if (errors) {
> - printf("TESTS FAILED. errors: %d\n", errors);
> - res = 10;
> - } else {
> - printf("SUCCESS !!!\n");
> - }
> -
> - if (keypress)
> + if (keypress) {
> + ksft_print_msg("Waiting for keypress to exit\n");
> getchar();
> - else
> + } else {
> sleep(delay);
> + }
>
> for (pidx = 0; pidx < num_processes; ++pidx)
> kill(procs[pidx].cpid, 15);
>
> + ksft_finished();
> return res;
> }
On 3/12/24 21:24, Joel Fernandes (Google) wrote:
> Correct the description for arg4 which appears to be outdated.
Technically, the constants referenced in the documentation are still
defined in prctl.h, and there are BUILD_BUG_ON() checks to ensure they
match their PIDTYPE_ equivalents. I would presume these should be left
in as it is technically a defined interface and user code could break.
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> ---
> Documentation/admin-guide/hw-vuln/core-scheduling.rst | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
> index cf1eeefdfc32..338d639a8127 100644
> --- a/Documentation/admin-guide/hw-vuln/core-scheduling.rst
> +++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
> @@ -62,8 +62,8 @@ arg3:
>
> arg4:
> ``pid_type`` for which the operation applies. It is one of
> - ``PR_SCHED_CORE_SCOPE_``-prefixed macro constants. For example, if arg4
> - is ``PR_SCHED_CORE_SCOPE_THREAD_GROUP``, then the operation of this command
> + ``PIDTIME_``-prefixed macro constants. For example, if arg4
Presumably you mean PIDTYPE_ here? Otherwise looks good and in
retrospect this is probably what should have been done in the first place.
> + is ``PIDTYPE_TGID``, then the operation of this command
> will be performed for all tasks in the task group of ``pid``.
>
> arg5:
Reviewed-by: Chris Hyser <[email protected]>
Sorry. This is a resend that should have better formatting.
On 3/13/24 15:14, Chris Hyser wrote:
>
> On 3/12/24 21:24, Joel Fernandes (Google) wrote:
>> Correct the description for arg4 which appears to be outdated.
>
> Technically, the constants referenced in the documentation are still
> defined in prctl.h, and there are BUILD_BUG_ON() checks to ensure they
> match their PIDTYPE_ equivalents. I would presume these should be left
> in as it is technically a defined interface and user code could break.
>
>
>> Signed-off-by: Joel Fernandes (Google) <[email protected]>
>> ---
>> Documentation/admin-guide/hw-vuln/core-scheduling.rst | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst
>> b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
>> index cf1eeefdfc32..338d639a8127 100644
>> --- a/Documentation/admin-guide/hw-vuln/core-scheduling.rst
>> +++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
>> @@ -62,8 +62,8 @@ arg3:
>> arg4:
>> ``pid_type`` for which the operation applies. It is one of
>> - ``PR_SCHED_CORE_SCOPE_``-prefixed macro constants. For example,
>> if arg4
>> - is ``PR_SCHED_CORE_SCOPE_THREAD_GROUP``, then the operation of
>> this command
>> + ``PIDTIME_``-prefixed macro constants. For example, if arg4
Presumably you mean PIDTYPE_ here? Otherwise looks good and in
retrospect this is probably what should have been done in the first place.
>> + is ``PIDTYPE_TGID``, then the operation of this command
>> will be performed for all tasks in the task group of ``pid``.
>> arg5:
>
>
> Reviewed-by: Chris Hyser <[email protected]>
>
On 3/12/24 21:24, Joel Fernandes (Google) wrote:
> This avoids compiler errors seen with clang:
>
> In file included from kernel/sched/build_utility.c:72:
> kernel/sched/debug.c:532:47: error: cast to 'void *' from smaller integer
> type 'unsigned int' [-Werror,-Wint-to-void-pointer-cast]
> debugfs_create_file("runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops);
>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
Reviewed-by: Chris Hyser <[email protected]>
> ---
> kernel/sched/debug.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index c6c0120ff105..2590041696bc 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -516,7 +516,7 @@ static struct dentry *debugfs_sched;
> static void debugfs_fair_server_init(void)
> {
> struct dentry *d_fair;
> - unsigned int cpu;
> + unsigned long cpu;
>
> d_fair = debugfs_create_dir("fair_server", debugfs_sched);
> if (!d_fair)
> @@ -526,7 +526,7 @@ static void debugfs_fair_server_init(void)
> struct dentry *d_cpu;
> char buf[32];
>
> - snprintf(buf, sizeof(buf), "cpu%d", cpu);
> + snprintf(buf, sizeof(buf), "cpu%lu", cpu);
> d_cpu = debugfs_create_dir(buf, d_fair);
>
> debugfs_create_file("runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops);
On 3/12/24 21:24, Joel Fernandes (Google) wrote:
> This fixes a build error introduced by "sched/rt: Remove default
> bandwidth control". The issue happens because a function is unused
> when !CONFIG_SMP. It could be squashed into the original patch.
>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
Reviewed-by: Chris Hyser <[email protected]>
> ---
> kernel/sched/rt.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 37bee56a70f7..d3065fe35c61 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -964,8 +964,10 @@ struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
> return &cpu_rq(cpu)->rt;
> }
>
> +#ifdef CONFIG_SMP
> static void __enable_runtime(struct rq *rq) { }
> static void __disable_runtime(struct rq *rq) { }
> +#endif
>
> #endif /* CONFIG_RT_GROUP_SCHED */
>
On 3/13/24 02:24, Joel Fernandes (Google) wrote:
> Paths using put_prev_task_balance() need to do a pick shortly after. Make sure
> they also clear the ->dl_server on prev as a part of that.
> > Signed-off-by: Joel Fernandes (Google) <[email protected]>
Makes sense. Added to the v6.
-- Daniel
On 3/13/24 02:24, Joel Fernandes (Google) wrote:
> From: Youssef Esmat <[email protected]>
>
> In case the previous pick was a DL server pick, ->dl_server might be
> set. Clear it in the fast path as well.
>
> Signed-off-by: Youssef Esmat <[email protected]>
Makes sense, I added the code to the v6.... embedded in the patch it fixes.
Thanks!
-- Daniel
On 3/13/24 02:24, Joel Fernandes (Google) wrote:
> From: Suleiman Souhlal <[email protected]>
>
> Otherwise, we might start it even for tasks in a sched class that should
> have it off.
>
> Signed-off-by: Suleiman Souhlal <[email protected]>
> ---
> kernel/sched/deadline.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 8fafe3f8b59c..5adfc15803c3 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2325,11 +2325,12 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
> if (!p)
> return p;
>
> - if (!p->dl_server)
> + if (!p->dl_server) {
> set_next_task_dl(rq, p, true);
>
> - if (hrtick_enabled(rq))
> - start_hrtick_dl(rq, &p->dl);
> + if (hrtick_enabled(rq))
> + start_hrtick_dl(rq, &p->dl);
> + }
The rational for having hrtick for the dl server too is the following:
This hrtick is reponsible for adding a hr timer for throttling. The throttling
serves as a protection for the other tasks, to avoid them missing their deadlines
because the current task overun for too long. Like, a task with 200us of runtime
actually running for 1 ms because of the non-hr-tick.
For example, let's get the case we have at red hat, where we want to use the
dl server to provide a minimum bandwidth to avoid starvation, keeping the noise
on real-time tasks low.
On this case, we will set the runtime for the fair server with number as low
as 20 us... 40 us. With hrtick, when the fair server is enqueued, it will be
throttle in the us scale... Without the hrtick, the server can run for an entire
tick before being throttled.
here is an example of this scenario using osnoise with/withoutout arming the hrtick
for the server that was set for 20 us of runtime:
There is a kernel compilation in CPU 1. Then osnoise is dispatched as fifo,
on CPU 1.
# osnoise -c 1 -P f:1
removing hrtick:
############
duration: 0 00:02:00 | time is in us
CPU Period Runtime Noise % CPU Aval Max Noise Max Single HW NMI IRQ Softirq Thread
1 #119 119000000 6482 99.99455 1220 1009 0 0 244 59 50
############
See max single noise... it when longer than the 20 us i've set...
With hrtick:
############
duration: 0 00:02:30 | time is in us
CPU Period Runtime Noise % CPU Aval Max Noise Max Single HW NMI IRQ Softirq Thread
1 #149 149000472 3730 99.99749 33 33 0 0 641 3 155
############
the max single goes down to 33 us. It is not exactly 20 because there is
one timer to enqueue the server, and another timer to throttle it. Still,
the granularity is in the same order.
So, maybe, what we can do is to get it back to hrtick_enabled_dl() instead
of hrtick_enabled(), to have it only when HRTICK_DL is set.
am I missing a point?
-- Daniel
On 3/13/24 02:24, Joel Fernandes (Google) wrote:
> The DL server may not have had its timer started if start_dl_timer()
> returns 0 (say the zero-laxity time has already passed). In such cases,
> mark the DL task which is about to be enqueued as not throttled and
> cancel any previous timers, then do the enqueue.
>
> This fixes the following crash:
>
> [ 9.263331] kernel BUG at kernel/sched/deadline.c:1765!
> [ 9.282382] Call Trace:
> [ 9.282767] <TASK>
> [ 9.283086] ? __die_body+0x62/0xb0
> [ 9.283602] ? die+0x9b/0xc0
> [ 9.284036] ? do_trap+0xa3/0x170
> [ 9.284528] ? enqueue_dl_entity+0x45e/0x460
> [ 9.285158] ? enqueue_dl_entity+0x45e/0x460
> [ 9.285791] ? handle_invalid_op+0x65/0x80
> [ 9.286392] ? enqueue_dl_entity+0x45e/0x460
> [ 9.287021] ? exc_invalid_op+0x2f/0x40
> [ 9.287585] ? asm_exc_invalid_op+0x16/0x20
> [ 9.288200] ? find_later_rq+0x120/0x120
> [ 9.288775] ? fair_server_init+0x40/0x40
> [ 9.289364] ? enqueue_dl_entity+0x45e/0x460
> [ 9.289989] ? find_later_rq+0x120/0x120
> [ 9.290564] dl_task_timer+0x1d7/0x2f0
> [ 9.291120] ? find_later_rq+0x120/0x120
> [ 9.291695] __run_hrtimer+0x73/0x1b0
> [ 9.292238] hrtimer_interrupt+0x216/0x2c0
> [ 9.292841] __sysvec_apic_timer_interrupt+0x53/0x140
> [ 9.293581] sysvec_apic_timer_interrupt+0x2d/0x80
> [ 9.294285] asm_sysvec_apic_timer_interrupt+0x16/0x20
>
> The crash can easily be reproduced by adding a 100ms delay as follows:
>
> +int delay_inject_count;
> +
> static void
> enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
> {
> @@ -1827,6 +1830,12 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
> setup_new_dl_entity(dl_se);
> }
>
> + // 100ms delay every 20 enqueues.
> + if (delay_inject_count++ > 20) {
> + mdelay(100);
> + delay_inject_count = 0;
> + }
> +
> /*
> * If we are still throttled, eg. we got replenished but are a
> * zero-laxity task and still got to wait, don't enqueue.
Makes sense, I am adding this in the defer patch v6 as it is a fix for it...
-- Daniel
On 3/13/24 02:24, Joel Fernandes (Google) wrote:
> From: Suleiman Souhlal <[email protected]>
>
> dl_time_before() seems to be incorrectly used, we need to check that the
> 0-laxity time is ahead of the rq_clock. Fix it.
>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
Makes sense, I am adding it to the defer patch in v6 as it is a fix for it...
-- Daniel
On 3/13/24 02:24, Joel Fernandes (Google) wrote:
> There is no point in doing so if there are no CFS tasks to run. If there
> were, we would be doing ENQUEUE_REPLENISH from the dl_task_timer,
> instead of calling replenish_dl_entity(). Fix that.
>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
Makes sense, adding as part of the defer patch.
-- Daniel
On 3/13/24 02:24, Joel Fernandes (Google) wrote:
> For whatever reason, if start_dl_timer() returned 0 during replenish (it
> did not start a new timer), then do not marked dl_defer_armed, because
> we never really armed.
>
> Further, we need to cancel any old timers,
>
> This is similar to what dl_check_constrained_dl() does.
>
> Add some guardrails for such situations.
>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
Makes sense, added as part of the defer patch.
-- Daniel
On 3/13/24 02:24, Joel Fernandes (Google) wrote:
> We believe that this is the right thing to do. The unit test
> (cs_dlserver_test) also agrees. If we let the CFS run without starting a
> new period, while the server is regularly throttled, then the test fails
> because CFS does not appear to get enough bandwidth.
>
> Intuitively, this makes sense to do as well. If CFS used up all the CFS
> bandwidth, while the DL server was in a throttled state, it got the
> bandwidth it wanted and some. Now, we can start all over from scratch to
> guarantee it a minimum bandwidth.
So, this part of the code is not actually fundamental for the defer part, it was
added as an optimization... but it has a problem...
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> ---
> kernel/sched/deadline.c | 17 -----------------
> 1 file changed, 17 deletions(-)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 179369d27f66..a0ea668ac1bf 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1454,23 +1454,6 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
> * starting a new period, pushing the activation to the zero-lax time.
> */
> if (dl_se->dl_defer && dl_se->dl_throttled && dl_runtime_exceeded(dl_se)) {
> - s64 runtime_diff = dl_se->runtime + dl_se->dl_runtime;
> -
> - /*
> - * If this is a regular throttling case, let it run negative until
> - * the dl_runtime - runtime > 0. The reason being is that the next
> - * replenishment will result in a positive runtime one period ahead.
> - *
> - * Otherwise, the deadline will be pushed more than one period, not
> - * providing runtime/period anymore.
> - *
> - * If the dl_runtime - runtime < 0, then the server was able to get
> - * the runtime/period before the replenishment. So it is safe
> - * to start a new deffered period.
> - */
> - if (!dl_se->dl_defer_armed && runtime_diff > 0)
> - return;
The idea was to reduce the frequency in which the timer is reset, aiming to avoid
regressions in the regular case in which the dl server never actually fires. It works
fine *if* the runtime is relatively low to the period... like 5%.... as it gets bigger,
it starts breaking things. In the case of > 50% of runtime, it breaks. That is
the case you guys seem to have.
At LPC, I actually expressed the concern to Vincent about resetting this timer. But
he mentioned that it was not a big of deal because it does not happen that often
to cause problems.
so, yeah, it is better to remove... one can always get back to this and think
on a logic that postpones the reset depending on the % of runtime of the dl server.
Removed on v6
-- Daniel
> hrtimer_try_to_cancel(&dl_se->dl_timer);
>
> replenish_dl_new_period(dl_se, dl_se->rq);
On 3/13/24 02:24, Joel Fernandes (Google) wrote:
> Correct the description for arg4 which appears to be outdated.
>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
Maybe it is better to keep these tests into a separated thread (?) Because they
are touching things that are not about the server, so it might create conflicts...
What I did in the past when adding a series that was part in the kernel (tracer)
and part of tools (rtla) was to split them into two series, then referencing
one on the other - Steven suggested... and it was a good thing... and
reduced the number of versions.
So I am not adding these self-tests on top of the dl series for now... because
there is also discussion about the patches... and I am not an expert in the
selftests.... I will add a reference to this thread in the v6 cover and keep
updating it as new versions show up... in this way, one thing does not block
the other.
Thoughts?
-- Daniel
> ---
> Documentation/admin-guide/hw-vuln/core-scheduling.rst | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
> index cf1eeefdfc32..338d639a8127 100644
> --- a/Documentation/admin-guide/hw-vuln/core-scheduling.rst
> +++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
> @@ -62,8 +62,8 @@ arg3:
>
> arg4:
> ``pid_type`` for which the operation applies. It is one of
> - ``PR_SCHED_CORE_SCOPE_``-prefixed macro constants. For example, if arg4
> - is ``PR_SCHED_CORE_SCOPE_THREAD_GROUP``, then the operation of this command
> + ``PIDTIME_``-prefixed macro constants. For example, if arg4
> + is ``PIDTYPE_TGID``, then the operation of this command
> will be performed for all tasks in the task group of ``pid``.
>
> arg5:
On 3/13/24 02:24, Joel Fernandes (Google) wrote:
> * Use simple CFS pick_task for DL pick_task
>
> DL server's pick_task calls CFS's pick_next_task_fair(), this is wrong
> because core scheduling's pick_task only calls CFS's pick_task() for
> evaluation / checking of the CFS task (comparing across CPUs), not for
> actually affirmatively picking the next task. This causes RB tree corruption
> issues in CFS that were found by syzbot.
>
> * Make pick_task_fair clear DL server
>
> A DL task pick might set ->dl_server, but it is possible the task will
> never run (say the other HT has a stop task). If the CFS task is picked
> in the future directly (say without DL server), ->dl_server will be
> set. So clear it in pick_task_fair().
>
> This fixes the KASAN issue reported by syzbot in set_next_entity().
These two sched/core patches seem to make sense.... things are working with them.
But I am not an expert in the CORE_SCHED, so I am adding them on top of the dl
server series in the v6... it is easier for people to review them...
The only thing I did was to adjust the log to fit into the 75 char that
checkpatch warns...
Thoughts?
-- Daniel