2020-11-17 23:25:33

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

Add a per-thread core scheduling interface which allows a thread to share a
core with another thread, or have a core exclusively for itself.

ChromeOS uses core-scheduling to securely enable hyperthreading. This cuts
down the keypress latency in Google docs from 150ms to 50ms while improving
the camera streaming frame rate by ~3%.

Tested-by: Julien Desfossez <[email protected]>
Reviewed-by: Aubrey Li <[email protected]>
Co-developed-by: Chris Hyser <[email protected]>
Signed-off-by: Chris Hyser <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
include/linux/sched.h | 1 +
include/uapi/linux/prctl.h | 3 ++
kernel/sched/core.c | 51 +++++++++++++++++++++++++++++---
kernel/sys.c | 3 ++
tools/include/uapi/linux/prctl.h | 3 ++
5 files changed, 57 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c6a3b0fa952b..79d76c78cc8e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2083,6 +2083,7 @@ void sched_core_unsafe_enter(void);
void sched_core_unsafe_exit(void);
bool sched_core_wait_till_safe(unsigned long ti_check);
bool sched_core_kernel_protected(void);
+int sched_core_share_pid(pid_t pid);
#else
#define sched_core_unsafe_enter(ignore) do { } while (0)
#define sched_core_unsafe_exit(ignore) do { } while (0)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c334e6a02e5f..217b0482aea1 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -248,4 +248,7 @@ struct prctl_mm_map {
#define PR_SET_IO_FLUSHER 57
#define PR_GET_IO_FLUSHER 58

+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE 59
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7ccca355623a..a95898c75bdf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -310,6 +310,7 @@ static int __sched_core_stopper(void *data)
}

static DEFINE_MUTEX(sched_core_mutex);
+static DEFINE_MUTEX(sched_core_tasks_mutex);
static int sched_core_count;

static void __sched_core_enable(void)
@@ -4037,8 +4038,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
RB_CLEAR_NODE(&p->core_node);

/*
- * Tag child via per-task cookie only if parent is tagged via per-task
- * cookie. This is independent of, but can be additive to the CGroup tagging.
+ * If parent is tagged via per-task cookie, tag the child (either with
+ * the parent's cookie, or a new one). The final cookie is calculated
+ * by concatenating the per-task cookie with that of the CGroup's.
*/
if (current->core_task_cookie) {

@@ -9855,7 +9857,7 @@ static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2
unsigned long cookie;
int ret = -ENOMEM;

- mutex_lock(&sched_core_mutex);
+ mutex_lock(&sched_core_tasks_mutex);

/*
* NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
@@ -9954,10 +9956,51 @@ static int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2

ret = 0;
out_unlock:
- mutex_unlock(&sched_core_mutex);
+ mutex_unlock(&sched_core_tasks_mutex);
return ret;
}

+/* Called from prctl interface: PR_SCHED_CORE_SHARE */
+int sched_core_share_pid(pid_t pid)
+{
+ struct task_struct *task;
+ int err;
+
+ if (pid == 0) { /* Recent current task's cookie. */
+ /* Resetting a cookie requires privileges. */
+ if (current->core_task_cookie)
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+ task = NULL;
+ } else {
+ rcu_read_lock();
+ task = pid ? find_task_by_vpid(pid) : current;
+ if (!task) {
+ rcu_read_unlock();
+ return -ESRCH;
+ }
+
+ get_task_struct(task);
+
+ /*
+ * Check if this process has the right to modify the specified
+ * process. Use the regular "ptrace_may_access()" checks.
+ */
+ if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+ rcu_read_unlock();
+ err = -EPERM;
+ goto out_put;
+ }
+ rcu_read_unlock();
+ }
+
+ err = sched_core_share_tasks(current, task);
+out_put:
+ if (task)
+ put_task_struct(task);
+ return err;
+}
+
/* CGroup interface */
static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
{
diff --git a/kernel/sys.c b/kernel/sys.c
index a730c03ee607..61a3c98e36de 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2530,6 +2530,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,

error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
break;
+ case PR_SCHED_CORE_SHARE:
+ error = sched_core_share_pid(arg2);
+ break;
default:
error = -EINVAL;
break;
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index 7f0827705c9a..4c45b7dcd92d 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -247,4 +247,7 @@ struct prctl_mm_map {
#define PR_SET_IO_FLUSHER 57
#define PR_GET_IO_FLUSHER 58

+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE 59
+
#endif /* _LINUX_PRCTL_H */
--
2.29.2.299.gdc1121823c-goog


2020-11-25 13:13:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

On Tue, Nov 17, 2020 at 06:19:53PM -0500, Joel Fernandes (Google) wrote:
> +/* Called from prctl interface: PR_SCHED_CORE_SHARE */
> +int sched_core_share_pid(pid_t pid)
> +{
> + struct task_struct *task;
> + int err;
> +
> + if (pid == 0) { /* Recent current task's cookie. */
> + /* Resetting a cookie requires privileges. */
> + if (current->core_task_cookie)
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;

Coding-Style fail.

Also, why?!? I realize it is true for your case, because hardware fail.
But in general this just isn't true. This wants to be some configurable
policy.

2020-12-01 19:40:27

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

On Wed, Nov 25, 2020 at 02:08:08PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:53PM -0500, Joel Fernandes (Google) wrote:
> > +/* Called from prctl interface: PR_SCHED_CORE_SHARE */
> > +int sched_core_share_pid(pid_t pid)
> > +{
> > + struct task_struct *task;
> > + int err;
> > +
> > + if (pid == 0) { /* Recent current task's cookie. */
> > + /* Resetting a cookie requires privileges. */
> > + if (current->core_task_cookie)
> > + if (!capable(CAP_SYS_ADMIN))
> > + return -EPERM;
>
> Coding-Style fail.
>
> Also, why?!? I realize it is true for your case, because hardware fail.
> But in general this just isn't true. This wants to be some configurable
> policy.

True. I think me and you discussed eons ago though that it needs to
privileged. For our case, actually we use seccomp so we don't let
untrusted task set a cookie anyway, let alone reset it. We do it before we
enter the seccomp sandbox. So we don't really need this security check here.

Since you dislike this part of the patch, I am Ok with just dropping it as
below:

---8<-----------------------

diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
index 8fce3f4b7cae..9b587a1245f5 100644
--- a/kernel/sched/coretag.c
+++ b/kernel/sched/coretag.c
@@ -443,11 +443,7 @@ int sched_core_share_pid(pid_t pid)
struct task_struct *task;
int err;

- if (pid == 0) { /* Recent current task's cookie. */
- /* Resetting a cookie requires privileges. */
- if (current->core_task_cookie)
- if (!capable(CAP_SYS_ADMIN))
- return -EPERM;
+ if (!pid) { /* Reset current task's cookie. */
task = NULL;
} else {
rcu_read_lock();

2020-12-02 21:52:28

by Chris Hyser

[permalink] [raw]
Subject: Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

On Tue, Nov 17, 2020 at 06:19:53PM -0500, Joel Fernandes (Google) wrote:
> Add a per-thread core scheduling interface which allows a thread to share a
> core with another thread, or have a core exclusively for itself.
>
> ChromeOS uses core-scheduling to securely enable hyperthreading. This cuts
> down the keypress latency in Google docs from 150ms to 50ms while improving
> the camera streaming frame rate by ~3%.
>

Inline is a patch for comment to extend this interface to make it more useful.
This patch would still need to provide doc and selftests updates as well.

-chrish

---8<-----------------------

From ec3d6506fee89022d93789e1ba44d49c1b1b04dd Mon Sep 17 00:00:00 2001
From: chris hyser <[email protected]>
Date: Tue, 10 Nov 2020 15:35:59 -0500
Subject: [PATCH] sched: Provide a more extensive prctl interface for core
scheduling.

The current prctl interface is a "from" only interface allowing a task to
join an existing core scheduling group by getting the "cookie" from a
specified task/pid.

Additionally, sharing from a task without an existing core scheduling group
(cookie == 0) creates a new cookie shared between the two.

"From" functionality works well for programs modified to use the prctl(),
but many applications will not be modified simply for core scheduling.
Using a wrapper program to share the core scheduling cookie from another
task followed by an "exec" can work, but there is no means to assign a
cookie for an unmodified running task.

Simply inverting the interface to a "to"-only interface, i.e. the calling
task shares it's cookie with the specified task/pid also has limitations,
there being no means to enable tasks to join an existing core scheduling
group for instance.

The solution is to allow both, or more specifically provide a flags
argument to allow various core scheduling commands, currently FROM, TO, and
CLEAR.

The combination of FROM and TO allow a helper program to share the core
scheduling cookie of one task/core scheduling group, with additional tasks.

if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_FROM, src_pid, 0, 0) < 0)
handle_error("src_pid sched_core failed");
if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_TO, dest_pid, 0, 0) < 0)
handle_error("dest_pid sched_core failed");

Signed-off-by: chris hyser <[email protected]>

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c9efdf8..eed002e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2084,14 +2084,14 @@ void sched_core_unsafe_enter(void);
void sched_core_unsafe_exit(void);
bool sched_core_wait_till_safe(unsigned long ti_check);
bool sched_core_kernel_protected(void);
-int sched_core_share_pid(pid_t pid);
+int sched_core_share_pid(unsigned long flags, pid_t pid);
void sched_tsk_free(struct task_struct *tsk);
#else
#define sched_core_unsafe_enter(ignore) do { } while (0)
#define sched_core_unsafe_exit(ignore) do { } while (0)
#define sched_core_wait_till_safe(ignore) do { } while (0)
#define sched_core_kernel_protected(ignore) do { } while (0)
-#define sched_core_share_pid(pid) do { } while (0)
+#define sched_core_share_pid(flags, pid) do { } while (0)
#define sched_tsk_free(tsk) do { } while (0)
#endif

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 217b048..f8e4e96 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -250,5 +250,8 @@ struct prctl_mm_map {

/* Request the scheduler to share a core */
#define PR_SCHED_CORE_SHARE 59
+# define PR_SCHED_CORE_CLEAR 0 /* clear core_sched cookie of pid */
+# define PR_SCHED_CORE_SHARE_FROM 1 /* get core_sched cookie from pid */
+# define PR_SCHED_CORE_SHARE_TO 2 /* push core_sched cookie to pid */

#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
index 800c0f8..14feac1 100644
--- a/kernel/sched/coretag.c
+++ b/kernel/sched/coretag.c
@@ -9,6 +9,7 @@
*/

#include "sched.h"
+#include "linux/prctl.h"

/*
* Wrapper representing a complete cookie. The address of the cookie is used as
@@ -456,40 +457,45 @@ int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
}

/* Called from prctl interface: PR_SCHED_CORE_SHARE */
-int sched_core_share_pid(pid_t pid)
+int sched_core_share_pid(unsigned long flags, pid_t pid)
{
+ struct task_struct *dest;
+ struct task_struct *src;
struct task_struct *task;
int err;

- if (pid == 0) { /* Recent current task's cookie. */
- /* Resetting a cookie requires privileges. */
- if (current->core_task_cookie)
- if (!capable(CAP_SYS_ADMIN))
- return -EPERM;
- task = NULL;
- } else {
- rcu_read_lock();
- task = pid ? find_task_by_vpid(pid) : current;
- if (!task) {
- rcu_read_unlock();
- return -ESRCH;
- }
-
- get_task_struct(task);
-
- /*
- * Check if this process has the right to modify the specified
- * process. Use the regular "ptrace_may_access()" checks.
- */
- if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
- rcu_read_unlock();
- err = -EPERM;
- goto out;
- }
+ rcu_read_lock();
+ task = find_task_by_vpid(pid);
+ if (!task) {
rcu_read_unlock();
+ return -ESRCH;
}

- err = sched_core_share_tasks(current, task);
+ get_task_struct(task);
+
+ /*
+ * Check if this process has the right to modify the specified
+ * process. Use the regular "ptrace_may_access()" checks.
+ */
+ if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+ rcu_read_unlock();
+ err = -EPERM;
+ goto out;
+ }
+ rcu_read_unlock();
+
+ if (flags == PR_SCHED_CORE_CLEAR) {
+ dest = task;
+ src = NULL;
+ } else if (flags == PR_SCHED_CORE_SHARE_TO) {
+ dest = task;
+ src = current;
+ } else if (flags == PR_SCHED_CORE_SHARE_FROM) {
+ dest = current;
+ src = task;
+ }
+
+ err = sched_core_share_tasks(dest, src);
out:
if (task)
put_task_struct(task);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index cffdfab..50c31f3 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1030,6 +1030,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,

#ifdef CONFIG_SCHED_CORE
__PS("core_cookie", p->core_cookie);
+ __PS("core_task_cookie", p->core_task_cookie);
#endif

sched_show_numa(p, m);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b3b89bd..eafb399 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1202,7 +1202,7 @@ void sched_core_dequeue(struct rq *rq, struct task_struct *p);
void sched_core_get(void);
void sched_core_put(void);

-int sched_core_share_pid(pid_t pid);
+int sched_core_share_pid(unsigned long flags, pid_t pid);
int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2);

#ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sys.c b/kernel/sys.c
index 61a3c98..da52a0d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2530,9 +2530,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,

error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
break;
+#ifdef CONFIG_SCHED_CORE
case PR_SCHED_CORE_SHARE:
- error = sched_core_share_pid(arg2);
+ if (arg4 || arg5)
+ return -EINVAL;
+ error = sched_core_share_pid(arg2, arg3);
break;
+#endif
default:
error = -EINVAL;
break;

2020-12-02 23:22:21

by Chris Hyser

[permalink] [raw]
Subject: Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

On 12/2/20 4:47 PM, Chris Hyser wrote:

> + get_task_struct(task);
> +
> + /*
> + * Check if this process has the right to modify the specified
> + * process. Use the regular "ptrace_may_access()" checks.
> + */
> + if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
> + rcu_read_unlock();
> + err = -EPERM;
> + goto out;
> + }
> + rcu_read_unlock();
> +
> + if (flags == PR_SCHED_CORE_CLEAR) {
> + dest = task;
> + src = NULL;
> + } else if (flags == PR_SCHED_CORE_SHARE_TO) {
> + dest = task;
> + src = current;
> + } else if (flags == PR_SCHED_CORE_SHARE_FROM) {
> + dest = current;
> + src = task;
> + }

I should have put in an else clause to catch bad input.

> +
> + err = sched_core_share_tasks(dest, src);
> out:
> if (task)
> put_task_struct(task);

2020-12-06 17:39:20

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

On Wed, Dec 02, 2020 at 04:47:17PM -0500, Chris Hyser wrote:
> On Tue, Nov 17, 2020 at 06:19:53PM -0500, Joel Fernandes (Google) wrote:
> > Add a per-thread core scheduling interface which allows a thread to share a
> > core with another thread, or have a core exclusively for itself.
> >
> > ChromeOS uses core-scheduling to securely enable hyperthreading. This cuts
> > down the keypress latency in Google docs from 150ms to 50ms while improving
> > the camera streaming frame rate by ~3%.
> >
>
> Inline is a patch for comment to extend this interface to make it more useful.
> This patch would still need to provide doc and selftests updates as well.
>
> -chrish
>
> ---8<-----------------------
[..]
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 217b048..f8e4e96 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -250,5 +250,8 @@ struct prctl_mm_map {
>
> /* Request the scheduler to share a core */
> #define PR_SCHED_CORE_SHARE 59
> +# define PR_SCHED_CORE_CLEAR 0 /* clear core_sched cookie of pid */
> +# define PR_SCHED_CORE_SHARE_FROM 1 /* get core_sched cookie from pid */
> +# define PR_SCHED_CORE_SHARE_TO 2 /* push core_sched cookie to pid */
>
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
> index 800c0f8..14feac1 100644
> --- a/kernel/sched/coretag.c
> +++ b/kernel/sched/coretag.c
> @@ -9,6 +9,7 @@
> */
>
> #include "sched.h"
> +#include "linux/prctl.h"
>
> /*
> * Wrapper representing a complete cookie. The address of the cookie is used as
> @@ -456,40 +457,45 @@ int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
> }
>
> /* Called from prctl interface: PR_SCHED_CORE_SHARE */
> -int sched_core_share_pid(pid_t pid)
> +int sched_core_share_pid(unsigned long flags, pid_t pid)
> {
> + struct task_struct *dest;
> + struct task_struct *src;
> struct task_struct *task;
> int err;
>
> - if (pid == 0) { /* Recent current task's cookie. */
> - /* Resetting a cookie requires privileges. */
> - if (current->core_task_cookie)
> - if (!capable(CAP_SYS_ADMIN))
> - return -EPERM;
> - task = NULL;
> - } else {
> - rcu_read_lock();
> - task = pid ? find_task_by_vpid(pid) : current;
> - if (!task) {
> - rcu_read_unlock();
> - return -ESRCH;
> - }
> -
> - get_task_struct(task);
> -
> - /*
> - * Check if this process has the right to modify the specified
> - * process. Use the regular "ptrace_may_access()" checks.
> - */
> - if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
> - rcu_read_unlock();
> - err = -EPERM;
> - goto out;
> - }
> + rcu_read_lock();
> + task = find_task_by_vpid(pid);
> + if (!task) {
> rcu_read_unlock();
> + return -ESRCH;
> }
>
> - err = sched_core_share_tasks(current, task);
> + get_task_struct(task);
> +
> + /*
> + * Check if this process has the right to modify the specified
> + * process. Use the regular "ptrace_may_access()" checks.
> + */
> + if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
> + rcu_read_unlock();
> + err = -EPERM;
> + goto out;
> + }
> + rcu_read_unlock();
> +
> + if (flags == PR_SCHED_CORE_CLEAR) {
> + dest = task;
> + src = NULL;
> + } else if (flags == PR_SCHED_CORE_SHARE_TO) {
> + dest = task;
> + src = current;
> + } else if (flags == PR_SCHED_CORE_SHARE_FROM) {
> + dest = current;
> + src = task;
> + }

Looks ok to me except the missing else { } clause you found. Also, maybe
dest/src can be renamed to from/to to make meaning of variables more clear?

Also looking forward to the docs/test updates.

thanks!

- Joel


> +
> + err = sched_core_share_tasks(dest, src);
> out:
> if (task)
> put_task_struct(task);
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index cffdfab..50c31f3 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -1030,6 +1030,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
>
> #ifdef CONFIG_SCHED_CORE
> __PS("core_cookie", p->core_cookie);
> + __PS("core_task_cookie", p->core_task_cookie);
> #endif
>
> sched_show_numa(p, m);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b3b89bd..eafb399 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1202,7 +1202,7 @@ void sched_core_dequeue(struct rq *rq, struct task_struct *p);
> void sched_core_get(void);
> void sched_core_put(void);
>
> -int sched_core_share_pid(pid_t pid);
> +int sched_core_share_pid(unsigned long flags, pid_t pid);
> int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2);
>
> #ifdef CONFIG_CGROUP_SCHED
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 61a3c98..da52a0d 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2530,9 +2530,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>
> error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
> break;
> +#ifdef CONFIG_SCHED_CORE
> case PR_SCHED_CORE_SHARE:
> - error = sched_core_share_pid(arg2);
> + if (arg4 || arg5)
> + return -EINVAL;
> + error = sched_core_share_pid(arg2, arg3);
> break;
> +#endif
> default:
> error = -EINVAL;
> break;

2020-12-07 21:57:44

by Chris Hyser

[permalink] [raw]
Subject: Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

On 12/6/20 12:34 PM, Joel Fernandes wrote:
> Looks ok to me except the missing else { } clause you found. Also, maybe
> dest/src can be renamed to from/to to make meaning of variables more clear?
>

yes.

> Also looking forward to the docs/test updates.

on it.

Thanks.

-chrish

2020-12-09 20:23:49

by Chris Hyser

[permalink] [raw]
Subject: Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

On Sun, Dec 06, 2020 at 12:34:18PM -0500, Joel Fernandes wrote:
>
...
> Looks ok to me except the missing else { } clause you found. Also, maybe
> dest/src can be renamed to from/to to make meaning of variables more clear?
>
> Also looking forward to the docs/test updates.
>
> thanks!
>
> - Joel

I fixed the else clause, made the name changes, added the documentation and
fixed the selftests so they pass. Patch inline.

-chrish

.../admin-guide/hw-vuln/core-scheduling.rst | 98 ++++++++++++++++------
include/linux/sched.h | 4 +-
include/uapi/linux/prctl.h | 3 +
kernel/sched/coretag.c | 59 ++++++++-----
kernel/sched/debug.c | 1 +
kernel/sched/sched.h | 2 +-
kernel/sys.c | 6 +-
tools/include/uapi/linux/prctl.h | 3 +
tools/testing/selftests/sched/test_coresched.c | 14 +++-
9 files changed, 136 insertions(+), 54 deletions(-)

-- >8 --
From beca9bae6750a66d8c30bbed1d6b8b26b2da05f4 Mon Sep 17 00:00:00 2001
From: chris hyser <[email protected]>
Date: Tue, 10 Nov 2020 15:35:59 -0500
Subject: [PATCH] sched: Provide a more extensive prctl interface for core
scheduling.

The current prctl interface is a "from" only interface allowing a task to
join an existing core scheduling group by getting the "cookie" from a
specified task/pid.

Additionally, sharing from a task without an existing core scheduling group
(cookie == 0) creates a new cookie shared between the two.

"From" functionality works well for programs modified to use the prctl(),
but many applications will not be modified simply for core scheduling.
Using a wrapper program to share the core scheduling cookie from another
task followed by an "exec" can work, but there is no means to assign a
cookie for an unmodified running task.

Simply inverting the interface to a "to"-only interface, i.e. the calling
task shares it's cookie with the specified task/pid also has limitations,
there being no means to enable tasks to join an existing core scheduling
group for instance.

The solution is to allow both, or more specifically provide a flags
argument to allow various core scheduling commands, currently FROM, TO, and
CLEAR.

The combination of FROM and TO allow a helper program to share the core
scheduling cookie of one task/core scheduling group, with additional tasks.

if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_FROM, src_pid, 0, 0) < 0)
handle_error("src_pid sched_core failed");
if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_TO, dest_pid, 0, 0) < 0)
handle_error("dest_pid sched_core failed");

Signed-off-by: chris hyser <[email protected]>

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
index c211a7b..0eefa9b 100644
--- a/Documentation/admin-guide/hw-vuln/core-scheduling.rst
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -100,33 +100,83 @@ The color is an 8-bit value allowing for up to 256 unique colors.

prctl interface
###############
-A ``prtcl(2)`` command ``PR_SCHED_CORE_SHARE`` is available to a process to request
-sharing a core with another process. For example, consider 2 processes ``P1``
-and ``P2`` with PIDs 100 and 200. If process ``P1`` calls
-``prctl(PR_SCHED_CORE_SHARE, 200)``, the kernel makes ``P1`` share a core with ``P2``.
-The kernel performs ptrace access mode checks before granting the request.
+A ``prtcl(2)`` command ``PR_SCHED_CORE_SHARE`` provides an interface for the
+creation of and admission and removal of tasks from core scheduling groups.

-.. note:: This operation is not commutative. P1 calling
- ``prctl(PR_SCHED_CORE_SHARE, pidof(P2)`` is not the same as P2 calling the
- same for P1. The former case is P1 joining P2's group of processes
- (which P2 would have joined with ``prctl(2)`` prior to P1's ``prctl(2)``).
+::

-.. note:: The core-sharing granted with prctl(2) will be subject to
- core-sharing restrictions specified by the CGroup interface. For example
- if P1 and P2 are a part of 2 different tagged CGroups, then they will
- not share a core even if a prctl(2) call is made. This is analogous
- to how affinities are set using the cpuset interface.
+ #include <sys/prctl.h>

-It is important to note that, on a ``CLONE_THREAD`` ``clone(2)`` syscall, the child
-will be assigned the same tag as its parent and thus be allowed to share a core
-with them. This design choice is because, for the security usecase, a
-``CLONE_THREAD`` child can access its parent's address space anyway, so there's
-no point in not allowing them to share a core. If a different behavior is
-desired, the child thread can call ``prctl(2)`` as needed. This behavior is
-specific to the ``prctl(2)`` interface. For the CGroup interface, the child of a
-fork always shares a core with its parent. On the other hand, if a parent
-was previously tagged via ``prctl(2)`` and does a regular ``fork(2)`` syscall, the
-child will receive a unique tag.
+ int prctl(int option, unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5);
+
+option:
+ ``PR_SCHED_CORE_SHARE``
+
+arg2:
+ - ``PR_SCHED_CORE_CLEAR 0 -- clear core_sched cookie of pid``
+ - ``PR_SCHED_CORE_SHARE_FROM 1 -- get core_sched cookie from pid``
+ - ``PR_SCHED_CORE_SHARE_TO 2 -- push core_sched cookie to pid``
+
+arg3:
+ ``tid`` of the task for which the operation applies
+
+arg4 and arg5:
+ MUST be equal to 0.
+
+Creation
+~~~~~~~~
+Creation is accomplished by sharing a ''cookie'' from a process not currently in
+a core scheduling group.
+
+::
+
+ if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_FROM, src_tid, 0, 0) < 0)
+ handle_error("src_tid sched_core failed");
+
+Removal
+~~~~~~~
+Removing a task from a core scheduling group is done by:
+
+::
+
+ if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_CLEAR, clr_tid, 0, 0) < 0)
+ handle_error("clr_tid sched_core failed");
+
+Cookie Transferal
+~~~~~~~~~~~~~~~~~
+Transferring a cookie between the current and other tasks is possible using
+PR_SCHED_CORE_SHARE_FROM and PR_SCHED_CORE_SHARE_TO to inherit a cookie from a
+specified task or a share a cookie with a task. In combination this allows a
+simple helper program to pull a cookie from a task in an existing core
+scheduling group and share it with already running tasks.
+
+::
+
+ if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_FROM, from_tid, 0, 0) < 0)
+ handle_error("from_tid sched_core failed");
+
+ if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_TO, to_tid, 0, 0) < 0)
+ handle_error("to_tid sched_core failed");
+
+
+.. note:: The core-sharing granted with ``prctl(2)`` will be subject to
+ core-sharing restrictions specified by the CGroup interface. For example,
+ if tasks T1 and T2 are a part of 2 different tagged CGroups, then they will
+ not share a core even if ``prctl(2)`` is called. This is analogous to how
+ affinities are set using the cpuset interface.
+
+It is important to note that, on a ``clone(2)`` syscall with ``CLONE_THREAD`` set,
+the child will be assigned the same ''cookie'' as its parent and thus in the
+same core scheduling group. In the security usecase, a ``CLONE_THREAD`` child
+can access its parent's address space anyway (``CLONE_THREAD`` requires
+``CLONE_SIGHAND`` which requires ``CLONE_VM``), so there's no point in not
+allowing them to share a core. If a different behavior is desired, the child
+thread can call ``prctl(2)`` as needed. This behavior is specific to the
+``prctl(2)`` interface. For the CGroup interface, the child of a fork always
+shares a core with its parent. On the other hand, if a parent was previously
+tagged via ``prctl(2)`` and does a regular ``fork(2)`` syscall, the child will
+receive a unique tag.

Design/Implementation
---------------------
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c9efdf8..eed002e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2084,14 +2084,14 @@ void sched_core_unsafe_enter(void);
void sched_core_unsafe_exit(void);
bool sched_core_wait_till_safe(unsigned long ti_check);
bool sched_core_kernel_protected(void);
-int sched_core_share_pid(pid_t pid);
+int sched_core_share_pid(unsigned long flags, pid_t pid);
void sched_tsk_free(struct task_struct *tsk);
#else
#define sched_core_unsafe_enter(ignore) do { } while (0)
#define sched_core_unsafe_exit(ignore) do { } while (0)
#define sched_core_wait_till_safe(ignore) do { } while (0)
#define sched_core_kernel_protected(ignore) do { } while (0)
-#define sched_core_share_pid(pid) do { } while (0)
+#define sched_core_share_pid(flags, pid) do { } while (0)
#define sched_tsk_free(tsk) do { } while (0)
#endif

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 217b048..f8e4e96 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -250,5 +250,8 @@ struct prctl_mm_map {

/* Request the scheduler to share a core */
#define PR_SCHED_CORE_SHARE 59
+# define PR_SCHED_CORE_CLEAR 0 /* clear core_sched cookie of pid */
+# define PR_SCHED_CORE_SHARE_FROM 1 /* get core_sched cookie from pid */
+# define PR_SCHED_CORE_SHARE_TO 2 /* push core_sched cookie to pid */

#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
index 9b587a1..6452137 100644
--- a/kernel/sched/coretag.c
+++ b/kernel/sched/coretag.c
@@ -8,6 +8,7 @@
* Initial interfacing code by Peter Ziljstra.
*/

+#include <linux/prctl.h>
#include "sched.h"

/*
@@ -438,36 +439,48 @@ int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
}

/* Called from prctl interface: PR_SCHED_CORE_SHARE */
-int sched_core_share_pid(pid_t pid)
+int sched_core_share_pid(unsigned long flags, pid_t pid)
{
+ struct task_struct *to;
+ struct task_struct *from;
struct task_struct *task;
int err;

- if (!pid) { /* Reset current task's cookie. */
- task = NULL;
- } else {
- rcu_read_lock();
- task = pid ? find_task_by_vpid(pid) : current;
- if (!task) {
- rcu_read_unlock();
- return -ESRCH;
- }
-
- get_task_struct(task);
-
- /*
- * Check if this process has the right to modify the specified
- * process. Use the regular "ptrace_may_access()" checks.
- */
- if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
- rcu_read_unlock();
- err = -EPERM;
- goto out;
- }
+ rcu_read_lock();
+ task = find_task_by_vpid(pid);
+ if (!task) {
rcu_read_unlock();
+ return -ESRCH;
}

- err = sched_core_share_tasks(current, task);
+ get_task_struct(task);
+
+ /*
+ * Check if this process has the right to modify the specified
+ * process. Use the regular "ptrace_may_access()" checks.
+ */
+ if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+ rcu_read_unlock();
+ err = -EPERM;
+ goto out;
+ }
+ rcu_read_unlock();
+
+ if (flags == PR_SCHED_CORE_CLEAR) {
+ to = task;
+ from = NULL;
+ } else if (flags == PR_SCHED_CORE_SHARE_TO) {
+ to = task;
+ from = current;
+ } else if (flags == PR_SCHED_CORE_SHARE_FROM) {
+ to = current;
+ from = task;
+ } else {
+ err = -EINVAL;
+ goto out;
+ }
+
+ err = sched_core_share_tasks(to, from);
out:
if (task)
put_task_struct(task);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index cffdfab..50c31f3 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1030,6 +1030,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,

#ifdef CONFIG_SCHED_CORE
__PS("core_cookie", p->core_cookie);
+ __PS("core_task_cookie", p->core_task_cookie);
#endif

sched_show_numa(p, m);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 53f0f58..3e0b9df 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1229,7 +1229,7 @@ void sched_core_dequeue(struct rq *rq, struct task_struct *p);
void sched_core_get(void);
void sched_core_put(void);

-int sched_core_share_pid(pid_t pid);
+int sched_core_share_pid(unsigned long flags, pid_t pid);
int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2);

#ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sys.c b/kernel/sys.c
index 61a3c98..da52a0d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2530,9 +2530,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,

error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
break;
+#ifdef CONFIG_SCHED_CORE
case PR_SCHED_CORE_SHARE:
- error = sched_core_share_pid(arg2);
+ if (arg4 || arg5)
+ return -EINVAL;
+ error = sched_core_share_pid(arg2, arg3);
break;
+#endif
default:
error = -EINVAL;
break;
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index 4c45b7d..cd37bbf 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -249,5 +249,8 @@ struct prctl_mm_map {

/* Request the scheduler to share a core */
#define PR_SCHED_CORE_SHARE 59
+# define PR_SCHED_CORE_CLEAR 0 /* clear core_sched cookie of pid */
+# define PR_SCHED_CORE_SHARE_FROM 1 /* get core_sched cookie from pid */
+# define PR_SCHED_CORE_SHARE_TO 2 /* push core_sched cookie to pid */

#endif /* _LINUX_PRCTL_H */
diff --git a/tools/testing/selftests/sched/test_coresched.c b/tools/testing/selftests/sched/test_coresched.c
index 70ed275..6ce612e 100644
--- a/tools/testing/selftests/sched/test_coresched.c
+++ b/tools/testing/selftests/sched/test_coresched.c
@@ -23,6 +23,9 @@

#ifndef PR_SCHED_CORE_SHARE
#define PR_SCHED_CORE_SHARE 59
+# define PR_SCHED_CORE_CLEAR 0
+# define PR_SCHED_CORE_SHARE_FROM 1
+# define PR_SCHED_CORE_SHARE_TO 2
#endif

#ifndef DEBUG_PRINT
@@ -391,9 +394,14 @@ struct task_state *add_task(char *p)

pid = mem->pid_share;
mem->pid_share = 0;
- if (pid == -1)
- pid = 0;
- prctl(PR_SCHED_CORE_SHARE, pid);
+
+ if (pid == -1) {
+ if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_CLEAR, getpid(), 0, 0))
+ perror("prctl() PR_SCHED_CORE_CLEAR failed");
+ } else {
+ if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_FROM, pid, 0, 0))
+ perror("prctl() PR_SCHED_CORE_SHARE_FROM failed");
+ }
pthread_mutex_unlock(&mem->m);
pthread_cond_signal(&mem->cond_par);
}

2020-12-14 19:35:35

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index cffdfab..50c31f3 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -1030,6 +1030,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
>
> #ifdef CONFIG_SCHED_CORE
> __PS("core_cookie", p->core_cookie);
> + __PS("core_task_cookie", p->core_task_cookie);
> #endif

Hmm, so the final cookie of the task is always p->core_cookie. This is what
the scheduler uses. All other fields are ingredients to derive the final
cookie value.

I will drop this hunk from your overall diff, but let me know if you
disagree!

thanks,

- Joel

2020-12-14 19:52:38

by Chris Hyser

[permalink] [raw]
Subject: Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

On 12/14/20 2:31 PM, Joel Fernandes wrote:
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index cffdfab..50c31f3 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -1030,6 +1030,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
>>
>> #ifdef CONFIG_SCHED_CORE
>> __PS("core_cookie", p->core_cookie);
>> + __PS("core_task_cookie", p->core_task_cookie);
>> #endif
>
> Hmm, so the final cookie of the task is always p->core_cookie. This is what
> the scheduler uses. All other fields are ingredients to derive the final
> cookie value.
>
> I will drop this hunk from your overall diff, but let me know if you
> disagree!


No problem. That was there primarily for debugging.

Thanks.

-chrish

2020-12-15 00:31:27

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

On Mon, Dec 14, 2020 at 02:44:09PM -0500, chris hyser wrote:
> On 12/14/20 2:31 PM, Joel Fernandes wrote:
> > > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> > > index cffdfab..50c31f3 100644
> > > --- a/kernel/sched/debug.c
> > > +++ b/kernel/sched/debug.c
> > > @@ -1030,6 +1030,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
> > > #ifdef CONFIG_SCHED_CORE
> > > __PS("core_cookie", p->core_cookie);
> > > + __PS("core_task_cookie", p->core_task_cookie);
> > > #endif
> >
> > Hmm, so the final cookie of the task is always p->core_cookie. This is what
> > the scheduler uses. All other fields are ingredients to derive the final
> > cookie value.
> >
> > I will drop this hunk from your overall diff, but let me know if you
> > disagree!
>
>
> No problem. That was there primarily for debugging.

Ok. I squashed Josh's changes into this patch and several of my fixups. So
there'll be 3 patches:
1. CGroup + prctl (single patch as it is hell to split it)
2. Documentation
3. ksefltests

Below is the diff of #1. I still have to squash in the stop_machine removal
and some more review changes. But other than that, please take a look and let
me know anything that's odd. I will test further as well.

Also next series will only be interface as I want to see if I can get lucky
enough to have Peter look at it before he leaves for PTO next week.
For the other features, I will post different series as I prepare them. One
series for interface, and another for kernel protection / migration changes.

---8<-----------------------

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a60868165590..73baca11d743 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -688,6 +688,8 @@ struct task_struct {
#ifdef CONFIG_SCHED_CORE
struct rb_node core_node;
unsigned long core_cookie;
+ unsigned long core_task_cookie;
+ unsigned long core_group_cookie;
unsigned int core_occupation;
#endif

@@ -2081,11 +2083,15 @@ void sched_core_unsafe_enter(void);
void sched_core_unsafe_exit(void);
bool sched_core_wait_till_safe(unsigned long ti_check);
bool sched_core_kernel_protected(void);
+int sched_core_share_pid(unsigned long flags, pid_t pid);
+void sched_tsk_free(struct task_struct *tsk);
#else
#define sched_core_unsafe_enter(ignore) do { } while (0)
#define sched_core_unsafe_exit(ignore) do { } while (0)
#define sched_core_wait_till_safe(ignore) do { } while (0)
#define sched_core_kernel_protected(ignore) do { } while (0)
+#define sched_core_share_pid(flags, pid) do { } while (0)
+#define sched_tsk_free(tsk) do { } while (0)
#endif

#endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c334e6a02e5f..3752006842e1 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -248,4 +248,10 @@ struct prctl_mm_map {
#define PR_SET_IO_FLUSHER 57
#define PR_GET_IO_FLUSHER 58

+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE 59
+#define PR_SCHED_CORE_CLEAR 0 /* clear core_sched cookie of pid */
+#define PR_SCHED_CORE_SHARE_FROM 1 /* get core_sched cookie from pid */
+#define PR_SCHED_CORE_SHARE_TO 2 /* push core_sched cookie to pid */
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 7199d359690c..5468c93829c5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -736,6 +736,7 @@ void __put_task_struct(struct task_struct *tsk)
exit_creds(tsk);
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
+ sched_tsk_free(tsk);

if (!profile_handoff_task(tsk))
free_task(tsk);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5fc9c9b70862..c526c20adf9d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
obj-$(CONFIG_MEMBARRIER) += membarrier.o
obj-$(CONFIG_CPU_ISOLATION) += isolation.o
obj-$(CONFIG_PSI) += psi.o
+obj-$(CONFIG_SCHED_CORE) += coretag.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f807a84cc30..80daca9c5930 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -157,7 +157,33 @@ static inline bool __sched_core_less(struct task_struct *a, struct task_struct *
return false;
}

-static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+static bool sched_core_empty(struct rq *rq)
+{
+ return RB_EMPTY_ROOT(&rq->core_tree);
+}
+
+static struct task_struct *sched_core_first(struct rq *rq)
+{
+ struct task_struct *task;
+
+ task = container_of(rb_first(&rq->core_tree), struct task_struct, core_node);
+ return task;
+}
+
+static void sched_core_flush(int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+ struct task_struct *task;
+
+ while (!sched_core_empty(rq)) {
+ task = sched_core_first(rq);
+ rb_erase(&task->core_node, &rq->core_tree);
+ RB_CLEAR_NODE(&task->core_node);
+ }
+ rq->core->core_task_seq++;
+}
+
+void sched_core_enqueue(struct rq *rq, struct task_struct *p)
{
struct rb_node *parent, **node;
struct task_struct *node_task;
@@ -184,14 +210,15 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
rb_insert_color(&p->core_node, &rq->core_tree);
}

-static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+void sched_core_dequeue(struct rq *rq, struct task_struct *p)
{
rq->core->core_task_seq++;

- if (!p->core_cookie)
+ if (!sched_core_enqueued(p))
return;

rb_erase(&p->core_node, &rq->core_tree);
+ RB_CLEAR_NODE(&p->core_node);
}

/*
@@ -255,8 +282,24 @@ static int __sched_core_stopper(void *data)
bool enabled = !!(unsigned long)data;
int cpu;

- for_each_possible_cpu(cpu)
- cpu_rq(cpu)->core_enabled = enabled;
+ for_each_possible_cpu(cpu) {
+ struct rq *rq = cpu_rq(cpu);
+
+ WARN_ON_ONCE(enabled == rq->core_enabled);
+
+ if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) {
+ /*
+ * All active and migrating tasks will have already
+ * been removed from core queue when we clear the
+ * cgroup tags. However, dying tasks could still be
+ * left in core queue. Flush them here.
+ */
+ if (!enabled)
+ sched_core_flush(cpu);
+
+ rq->core_enabled = enabled;
+ }
+ }

return 0;
}
@@ -266,7 +309,11 @@ static int sched_core_count;

static void __sched_core_enable(void)
{
- // XXX verify there are no cookie tasks (yet)
+ int cpu;
+
+ /* verify there are no cookie tasks (yet) */
+ for_each_online_cpu(cpu)
+ BUG_ON(!sched_core_empty(cpu_rq(cpu)));

static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
@@ -274,8 +321,6 @@ static void __sched_core_enable(void)

static void __sched_core_disable(void)
{
- // XXX verify there are no cookie tasks (left)
-
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
}
@@ -295,12 +340,6 @@ void sched_core_put(void)
__sched_core_disable();
mutex_unlock(&sched_core_mutex);
}
-
-#else /* !CONFIG_SCHED_CORE */
-
-static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
-static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
-
#endif /* CONFIG_SCHED_CORE */

/*
@@ -3779,6 +3818,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->capture_control = NULL;
#endif
init_numa_balancing(clone_flags, p);
+#ifdef CONFIG_SCHED_CORE
+ p->core_task_cookie = 0;
+#endif
#ifdef CONFIG_SMP
p->wake_entry.u_flags = CSD_TYPE_TTWU;
p->migration_pending = NULL;
@@ -3903,6 +3945,7 @@ static inline void init_schedstats(void) {}
int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
unsigned long flags;
+ int __maybe_unused ret;

__sched_fork(clone_flags, p);
/*
@@ -3978,6 +4021,13 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
#ifdef CONFIG_SMP
plist_node_init(&p->pushable_tasks, MAX_PRIO);
RB_CLEAR_NODE(&p->pushable_dl_tasks);
+#endif
+#ifdef CONFIG_SCHED_CORE
+ RB_CLEAR_NODE(&p->core_node);
+
+ ret = sched_core_fork(p, clone_flags);
+ if (ret)
+ return ret;
#endif
return 0;
}
@@ -7979,6 +8029,9 @@ void init_idle(struct task_struct *idle, int cpu)
#ifdef CONFIG_SMP
sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
#endif
+#ifdef CONFIG_SCHED_CORE
+ RB_CLEAR_NODE(&idle->core_node);
+#endif
}

#ifdef CONFIG_SMP
@@ -8983,6 +9036,9 @@ void sched_offline_group(struct task_group *tg)
spin_unlock_irqrestore(&task_group_lock, flags);
}

+void cpu_core_get_group_cookie(struct task_group *tg,
+ unsigned long *group_cookie_ptr);
+
static void sched_change_group(struct task_struct *tsk, int type)
{
struct task_group *tg;
@@ -8995,6 +9051,11 @@ static void sched_change_group(struct task_struct *tsk, int type)
tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
struct task_group, css);
tg = autogroup_task_group(tsk, tg);
+
+#ifdef CONFIG_SCHED_CORE
+ sched_core_change_group(tsk, tg);
+#endif
+
tsk->sched_task_group = tg;

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -9047,11 +9108,6 @@ void sched_move_task(struct task_struct *tsk)
task_rq_unlock(rq, tsk, &rf);
}

-static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
-{
- return css ? container_of(css, struct task_group, css) : NULL;
-}
-
static struct cgroup_subsys_state *
cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
{
@@ -9087,6 +9143,18 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
return 0;
}

+static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
+{
+#ifdef CONFIG_SCHED_CORE
+ struct task_group *tg = css_tg(css);
+
+ if (tg->core_tagged) {
+ sched_core_put();
+ tg->core_tagged = 0;
+ }
+#endif
+}
+
static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
{
struct task_group *tg = css_tg(css);
@@ -9688,6 +9756,22 @@ static struct cftype cpu_legacy_files[] = {
.write_u64 = cpu_rt_period_write_uint,
},
#endif
+#ifdef CONFIG_SCHED_CORE
+ {
+ .name = "core_tag",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_core_tag_read_u64,
+ .write_u64 = cpu_core_tag_write_u64,
+ },
+#ifdef CONFIG_SCHED_DEBUG
+ /* Read the group cookie. */
+ {
+ .name = "core_group_cookie",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_core_group_cookie_read_u64,
+ },
+#endif
+#endif
#ifdef CONFIG_UCLAMP_TASK_GROUP
{
.name = "uclamp.min",
@@ -9861,6 +9945,22 @@ static struct cftype cpu_files[] = {
.write_s64 = cpu_weight_nice_write_s64,
},
#endif
+#ifdef CONFIG_SCHED_CORE
+ {
+ .name = "core_tag",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_core_tag_read_u64,
+ .write_u64 = cpu_core_tag_write_u64,
+ },
+#ifdef CONFIG_SCHED_DEBUG
+ /* Read the group cookie. */
+ {
+ .name = "core_group_cookie",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_core_group_cookie_read_u64,
+ },
+#endif
+#endif
#ifdef CONFIG_CFS_BANDWIDTH
{
.name = "max",
@@ -9889,6 +9989,7 @@ static struct cftype cpu_files[] = {
struct cgroup_subsys cpu_cgrp_subsys = {
.css_alloc = cpu_cgroup_css_alloc,
.css_online = cpu_cgroup_css_online,
+ .css_offline = cpu_cgroup_css_offline,
.css_released = cpu_cgroup_css_released,
.css_free = cpu_cgroup_css_free,
.css_extra_stat_show = cpu_extra_stat_show,
diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
new file mode 100644
index 000000000000..4eeb956382ee
--- /dev/null
+++ b/kernel/sched/coretag.c
@@ -0,0 +1,734 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * kernel/sched/core-tag.c
+ *
+ * Core-scheduling tagging interface support.
+ *
+ * Copyright(C) 2020, Joel Fernandes.
+ * Initial interfacing code by Peter Ziljstra.
+ */
+
+#include <linux/prctl.h>
+#include "sched.h"
+
+/*
+ * Wrapper representing a complete cookie. The address of the cookie is used as
+ * a unique identifier. Each cookie has a unique permutation of the internal
+ * cookie fields.
+ */
+struct sched_core_cookie {
+ unsigned long task_cookie;
+ unsigned long group_cookie;
+
+ struct rb_node node;
+ refcount_t refcnt;
+};
+
+/*
+ * A simple wrapper around refcount. An allocated sched_core_task_cookie's
+ * address is used to compute the cookie of the task.
+ */
+struct sched_core_task_cookie {
+ refcount_t refcnt;
+ struct work_struct work; /* to free in WQ context. */;
+};
+
+static DEFINE_MUTEX(sched_core_tasks_mutex);
+
+/* All active sched_core_cookies */
+static struct rb_root sched_core_cookies = RB_ROOT;
+static DEFINE_RAW_SPINLOCK(sched_core_cookies_lock);
+
+/*
+ * Returns the following:
+ * a < b => -1
+ * a == b => 0
+ * a > b => 1
+ */
+static int sched_core_cookie_cmp(const struct sched_core_cookie *a,
+ const struct sched_core_cookie *b)
+{
+#define COOKIE_CMP_RETURN(field) do { \
+ if (a->field < b->field) \
+ return -1; \
+ else if (a->field > b->field) \
+ return 1; \
+} while (0) \
+
+ COOKIE_CMP_RETURN(task_cookie);
+ COOKIE_CMP_RETURN(group_cookie);
+
+ /* all cookie fields match */
+ return 0;
+
+#undef COOKIE_CMP_RETURN
+}
+
+static inline void __sched_core_erase_cookie(struct sched_core_cookie *cookie)
+{
+ lockdep_assert_held(&sched_core_cookies_lock);
+
+ /* Already removed */
+ if (RB_EMPTY_NODE(&cookie->node))
+ return;
+
+ rb_erase(&cookie->node, &sched_core_cookies);
+ RB_CLEAR_NODE(&cookie->node);
+}
+
+/* Called when a task no longer points to the cookie in question */
+static void sched_core_put_cookie(struct sched_core_cookie *cookie)
+{
+ unsigned long flags;
+
+ if (!cookie)
+ return;
+
+ if (refcount_dec_and_test(&cookie->refcnt)) {
+ raw_spin_lock_irqsave(&sched_core_cookies_lock, flags);
+ __sched_core_erase_cookie(cookie);
+ raw_spin_unlock_irqrestore(&sched_core_cookies_lock, flags);
+ kfree(cookie);
+ }
+}
+
+/*
+ * A task's core cookie is a compound structure composed of various cookie
+ * fields (task_cookie, group_cookie). The overall core_cookie is
+ * a pointer to a struct containing those values. This function either finds
+ * an existing core_cookie or creates a new one, and then updates the task's
+ * core_cookie to point to it. Additionally, it handles the necessary reference
+ * counting.
+ *
+ * REQUIRES: task_rq(p) lock or called from cpu_stopper.
+ * Doing so ensures that we do not cause races/corruption by modifying/reading
+ * task cookie fields.
+ */
+static void __sched_core_update_cookie(struct task_struct *p)
+{
+ struct rb_node *parent, **node;
+ struct sched_core_cookie *node_core_cookie, *match;
+ static const struct sched_core_cookie zero_cookie;
+ struct sched_core_cookie temp = {
+ .task_cookie = p->core_task_cookie,
+ .group_cookie = p->core_group_cookie,
+ };
+ const bool is_zero_cookie =
+ (sched_core_cookie_cmp(&temp, &zero_cookie) == 0);
+ struct sched_core_cookie *const curr_cookie =
+ (struct sched_core_cookie *)p->core_cookie;
+ unsigned long flags;
+
+ /*
+ * Already have a cookie matching the requested settings? Nothing to
+ * do.
+ */
+ if ((curr_cookie && sched_core_cookie_cmp(curr_cookie, &temp) == 0) ||
+ (!curr_cookie && is_zero_cookie))
+ return;
+
+ raw_spin_lock_irqsave(&sched_core_cookies_lock, flags);
+
+ if (is_zero_cookie) {
+ match = NULL;
+ goto finish;
+ }
+
+retry:
+ match = NULL;
+
+ node = &sched_core_cookies.rb_node;
+ parent = *node;
+ while (*node) {
+ int cmp;
+
+ node_core_cookie =
+ container_of(*node, struct sched_core_cookie, node);
+ parent = *node;
+
+ cmp = sched_core_cookie_cmp(&temp, node_core_cookie);
+ if (cmp < 0) {
+ node = &parent->rb_left;
+ } else if (cmp > 0) {
+ node = &parent->rb_right;
+ } else {
+ match = node_core_cookie;
+ break;
+ }
+ }
+
+ if (!match) {
+ /* No existing cookie; create and insert one */
+ match = kmalloc(sizeof(struct sched_core_cookie), GFP_ATOMIC);
+
+ /* Fall back to zero cookie */
+ if (WARN_ON_ONCE(!match))
+ goto finish;
+
+ match->task_cookie = temp.task_cookie;
+ match->group_cookie = temp.group_cookie;
+ refcount_set(&match->refcnt, 1);
+
+ rb_link_node(&match->node, parent, node);
+ rb_insert_color(&match->node, &sched_core_cookies);
+ } else {
+ /*
+ * Cookie exists, increment refcnt. If refcnt is currently 0,
+ * we're racing with a put() (refcnt decremented but cookie not
+ * yet removed from the tree). In this case, we can simply
+ * perform the removal ourselves and retry.
+ * sched_core_put_cookie() will still function correctly.
+ */
+ if (unlikely(!refcount_inc_not_zero(&match->refcnt))) {
+ __sched_core_erase_cookie(match);
+ goto retry;
+ }
+ }
+
+finish:
+ /*
+ * Set the core_cookie under the cookies lock. This guarantees that
+ * p->core_cookie cannot be freed while the cookies lock is held in
+ * sched_core_fork().
+ */
+ p->core_cookie = (unsigned long)match;
+
+ raw_spin_unlock_irqrestore(&sched_core_cookies_lock, flags);
+
+ sched_core_put_cookie(curr_cookie);
+}
+
+/*
+ * sched_core_update_cookie - Common helper to update a task's core cookie. This
+ * updates the selected cookie field and then updates the overall cookie.
+ * @p: The task whose cookie should be updated.
+ * @cookie: The new cookie.
+ * @cookie_type: The cookie field to which the cookie corresponds.
+ *
+ * REQUIRES: either task_rq(p)->lock held or called from a stop-machine handler.
+ * Doing so ensures that we do not cause races/corruption by modifying/reading
+ * task cookie fields.
+ */
+static void sched_core_update_cookie(struct task_struct *p, unsigned long cookie,
+ enum sched_core_cookie_type cookie_type)
+{
+ if (!p)
+ return;
+
+ switch (cookie_type) {
+ case sched_core_no_update:
+ break;
+ case sched_core_task_cookie_type:
+ p->core_task_cookie = cookie;
+ break;
+ case sched_core_group_cookie_type:
+ p->core_group_cookie = cookie;
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ }
+
+ /* Set p->core_cookie, which is the overall cookie */
+ __sched_core_update_cookie(p);
+
+ if (sched_core_enqueued(p)) {
+ sched_core_dequeue(task_rq(p), p);
+ if (!p->core_cookie)
+ return;
+ }
+
+ if (sched_core_enabled(task_rq(p)) &&
+ p->core_cookie && task_on_rq_queued(p))
+ sched_core_enqueue(task_rq(p), p);
+}
+
+#ifdef CONFIG_CGROUP_SCHED
+void cpu_core_get_group_cookie(struct task_group *tg,
+ unsigned long *group_cookie_ptr);
+
+void sched_core_change_group(struct task_struct *p, struct task_group *new_tg)
+{
+ unsigned long new_group_cookie;
+
+ cpu_core_get_group_cookie(new_tg, &new_group_cookie);
+
+ if (p->core_group_cookie == new_group_cookie)
+ return;
+
+ p->core_group_cookie = new_group_cookie;
+
+ __sched_core_update_cookie(p);
+}
+#endif
+
+/* Per-task interface: Used by fork(2) and prctl(2). */
+static void sched_core_put_cookie_work(struct work_struct *ws);
+
+/* Caller has to call sched_core_get() if non-zero value is returned. */
+static unsigned long sched_core_alloc_task_cookie(void)
+{
+ struct sched_core_task_cookie *ck =
+ kmalloc(sizeof(struct sched_core_task_cookie), GFP_KERNEL);
+
+ if (!ck)
+ return 0;
+ refcount_set(&ck->refcnt, 1);
+ INIT_WORK(&ck->work, sched_core_put_cookie_work);
+
+ return (unsigned long)ck;
+}
+
+static void sched_core_get_task_cookie(unsigned long cookie)
+{
+ struct sched_core_task_cookie *ptr =
+ (struct sched_core_task_cookie *)cookie;
+
+ refcount_inc(&ptr->refcnt);
+}
+
+static void sched_core_put_task_cookie(unsigned long cookie)
+{
+ struct sched_core_task_cookie *ptr =
+ (struct sched_core_task_cookie *)cookie;
+
+ if (refcount_dec_and_test(&ptr->refcnt))
+ kfree(ptr);
+}
+
+static void sched_core_put_cookie_work(struct work_struct *ws)
+{
+ struct sched_core_task_cookie *ck =
+ container_of(ws, struct sched_core_task_cookie, work);
+
+ sched_core_put_task_cookie((unsigned long)ck);
+ sched_core_put();
+}
+
+struct sched_core_task_write_tag {
+ struct task_struct *tasks[2];
+ unsigned long cookies[2];
+};
+
+/*
+ * Ensure that the task has been requeued. The stopper ensures that the task cannot
+ * be migrated to a different CPU while its core scheduler queue state is being updated.
+ * It also makes sure to requeue a task if it was running actively on another CPU.
+ */
+static int sched_core_task_join_stopper(void *data)
+{
+ struct sched_core_task_write_tag *tag = (struct sched_core_task_write_tag *)data;
+ int i;
+
+ for (i = 0; i < 2; i++)
+ sched_core_update_cookie(tag->tasks[i], tag->cookies[i],
+ sched_core_task_cookie_type);
+
+ return 0;
+}
+
+int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
+{
+ struct sched_core_task_write_tag wr = {}; /* for stop machine. */
+ bool sched_core_put_after_stopper = false;
+ unsigned long cookie;
+ int ret = -ENOMEM;
+
+ mutex_lock(&sched_core_tasks_mutex);
+
+ if (!t2) {
+ if (t1->core_task_cookie) {
+ sched_core_put_task_cookie(t1->core_task_cookie);
+ sched_core_put_after_stopper = true;
+ wr.tasks[0] = t1; /* Keep wr.cookies[0] reset for t1. */
+ }
+ } else if (t1 == t2) {
+ /* Assign a unique per-task cookie solely for t1. */
+
+ cookie = sched_core_alloc_task_cookie();
+ if (!cookie)
+ goto out_unlock;
+ sched_core_get();
+
+ if (t1->core_task_cookie) {
+ sched_core_put_task_cookie(t1->core_task_cookie);
+ sched_core_put_after_stopper = true;
+ }
+ wr.tasks[0] = t1;
+ wr.cookies[0] = cookie;
+ } else if (!t1->core_task_cookie && !t2->core_task_cookie) {
+ /*
+ * t1 joining t2
+ * CASE 1:
+ * before 0 0
+ * after new cookie new cookie
+ *
+ * CASE 2:
+ * before X (non-zero) 0
+ * after 0 0
+ *
+ * CASE 3:
+ * before 0 X (non-zero)
+ * after X X
+ *
+ * CASE 4:
+ * before Y (non-zero) X (non-zero)
+ * after X X
+ */
+
+ /* CASE 1. */
+ cookie = sched_core_alloc_task_cookie();
+ if (!cookie)
+ goto out_unlock;
+ sched_core_get(); /* For the alloc. */
+
+ /* Add another reference for the other task. */
+ sched_core_get_task_cookie(cookie);
+ sched_core_get(); /* For the other task. */
+
+ wr.tasks[0] = t1;
+ wr.tasks[1] = t2;
+ wr.cookies[0] = wr.cookies[1] = cookie;
+
+ } else if (t1->core_task_cookie && !t2->core_task_cookie) {
+ /* CASE 2. */
+ sched_core_put_task_cookie(t1->core_task_cookie);
+ sched_core_put_after_stopper = true;
+
+ wr.tasks[0] = t1; /* Reset cookie for t1. */
+
+ } else if (!t1->core_task_cookie && t2->core_task_cookie) {
+ /* CASE 3. */
+ sched_core_get_task_cookie(t2->core_task_cookie);
+ sched_core_get();
+
+ wr.tasks[0] = t1;
+ wr.cookies[0] = t2->core_task_cookie;
+
+ } else {
+ /* CASE 4. */
+ sched_core_get_task_cookie(t2->core_task_cookie);
+ sched_core_get();
+
+ sched_core_put_task_cookie(t1->core_task_cookie);
+ sched_core_put_after_stopper = true;
+
+ wr.tasks[0] = t1;
+ wr.cookies[0] = t2->core_task_cookie;
+ }
+
+ stop_machine(sched_core_task_join_stopper, (void *)&wr, NULL);
+
+ if (sched_core_put_after_stopper)
+ sched_core_put();
+
+ ret = 0;
+out_unlock:
+ mutex_unlock(&sched_core_tasks_mutex);
+ return ret;
+}
+
+/* Called from prctl interface: PR_SCHED_CORE_SHARE */
+int sched_core_share_pid(unsigned long flags, pid_t pid)
+{
+ struct task_struct *to;
+ struct task_struct *from;
+ struct task_struct *task;
+ int err;
+
+ rcu_read_lock();
+ task = find_task_by_vpid(pid);
+ if (!task) {
+ rcu_read_unlock();
+ return -ESRCH;
+ }
+
+ get_task_struct(task);
+
+ /*
+ * Check if this process has the right to modify the specified
+ * process. Use the regular "ptrace_may_access()" checks.
+ */
+ if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+ rcu_read_unlock();
+ err = -EPERM;
+ goto out;
+ }
+ rcu_read_unlock();
+
+ if (flags == PR_SCHED_CORE_CLEAR) {
+ to = task;
+ from = NULL;
+ } else if (flags == PR_SCHED_CORE_SHARE_TO) {
+ to = task;
+ from = current;
+ } else if (flags == PR_SCHED_CORE_SHARE_FROM) {
+ to = current;
+ from = task;
+ } else {
+ err = -EINVAL;
+ goto out;
+ }
+
+ err = sched_core_share_tasks(to, from);
+out:
+ if (task)
+ put_task_struct(task);
+ return err;
+}
+
+/* CGroup core-scheduling interface support. */
+#ifdef CONFIG_CGROUP_SCHED
+/*
+ * Helper to get the group cookie in a hierarchy. Any ancestor can have a
+ * cookie.
+ *
+ * Sets *group_cookie_ptr to the hierarchical group cookie.
+ */
+void cpu_core_get_group_cookie(struct task_group *tg,
+ unsigned long *group_cookie_ptr)
+{
+ unsigned long group_cookie = 0UL;
+
+ if (!tg)
+ goto out;
+
+ for (; tg; tg = tg->parent) {
+
+ if (tg->core_tagged) {
+ group_cookie = (unsigned long)tg;
+ break;
+ }
+ }
+
+out:
+ *group_cookie_ptr = group_cookie;
+}
+
+/* Determine if any group in @tg's children are tagged. */
+static bool cpu_core_check_descendants(struct task_group *tg, bool check_tag)
+{
+ struct task_group *child;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(child, &tg->children, siblings) {
+ if ((child->core_tagged && check_tag)) {
+ rcu_read_unlock();
+ return true;
+ }
+
+ rcu_read_unlock();
+ return cpu_core_check_descendants(child, check_tag);
+ }
+
+ rcu_read_unlock();
+ return false;
+}
+
+u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ struct task_group *tg = css_tg(css);
+
+ return !!tg->core_tagged;
+}
+
+#ifdef CONFIG_SCHED_DEBUG
+u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ unsigned long group_cookie;
+
+ cpu_core_get_group_cookie(css_tg(css), &group_cookie);
+
+ return group_cookie;
+}
+#endif
+
+struct write_core_tag {
+ struct cgroup_subsys_state *css;
+ unsigned long cookie;
+ enum sched_core_cookie_type cookie_type;
+};
+
+static int __sched_write_tag(void *data)
+{
+ struct write_core_tag *tag = (struct write_core_tag *)data;
+ struct task_struct *p;
+ struct cgroup_subsys_state *css;
+
+ rcu_read_lock();
+ css_for_each_descendant_pre(css, tag->css) {
+ struct css_task_iter it;
+
+ css_task_iter_start(css, 0, &it);
+ /*
+ * Note: css_task_iter_next will skip dying tasks.
+ * There could still be dying tasks left in the core queue
+ * when we set cgroup tag to 0 when the loop is done below.
+ */
+ while ((p = css_task_iter_next(&it)))
+ sched_core_update_cookie(p, tag->cookie,
+ tag->cookie_type);
+
+ css_task_iter_end(&it);
+ }
+ rcu_read_unlock();
+
+ return 0;
+}
+
+int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
+ u64 val)
+{
+ struct task_group *tg = css_tg(css);
+ struct write_core_tag wtag;
+ unsigned long group_cookie;
+
+ if (val > 1)
+ return -ERANGE;
+
+ if (!static_branch_likely(&sched_smt_present))
+ return -EINVAL;
+
+ if (!tg->core_tagged && val) {
+ /* Tag is being set. Check ancestors and descendants. */
+ cpu_core_get_group_cookie(tg, &group_cookie);
+ if (group_cookie ||
+ cpu_core_check_descendants(tg, true /* tag */))
+ return -EBUSY;
+ } else if (tg->core_tagged && !val) {
+ /* Tag is being reset. Check descendants. */
+ if (cpu_core_check_descendants(tg, true /* tag */))
+ return -EBUSY;
+ } else {
+ return 0;
+ }
+
+ if (!!val)
+ sched_core_get();
+
+ wtag.css = css;
+ wtag.cookie = (unsigned long)tg;
+ wtag.cookie_type = sched_core_group_cookie_type;
+
+ tg->core_tagged = val;
+
+ stop_machine(__sched_write_tag, (void *)&wtag, NULL);
+ if (!val)
+ sched_core_put();
+
+ return 0;
+}
+#endif
+
+/*
+ * Tagging support when fork(2) is called:
+ * If it is a CLONE_THREAD fork, share parent's tag. Otherwise assign a unique per-task tag.
+ */
+static int sched_update_core_tag_stopper(void *data)
+{
+ struct task_struct *p = (struct task_struct *)data;
+
+ /* Recalculate core cookie */
+ sched_core_update_cookie(p, 0, sched_core_no_update);
+
+ return 0;
+}
+
+/* Called from sched_fork() */
+int sched_core_fork(struct task_struct *p, unsigned long clone_flags)
+{
+ struct sched_core_cookie *parent_cookie =
+ (struct sched_core_cookie *)current->core_cookie;
+
+ /*
+ * core_cookie is ref counted; avoid an uncounted reference.
+ * If p should have a cookie, it will be set below.
+ */
+ p->core_cookie = 0UL;
+
+ /*
+ * If parent is tagged via per-task cookie, tag the child (either with
+ * the parent's cookie, or a new one).
+ *
+ * We can return directly in this case, because sched_core_share_tasks()
+ * will set the core_cookie (so there is no need to try to inherit from
+ * the parent). The cookie will have the proper sub-fields (ie. group
+ * cookie, etc.), because these come from p's task_struct, which is
+ * dup'd from the parent.
+ */
+ if (current->core_task_cookie) {
+ int ret;
+
+ /* If it is not CLONE_THREAD fork, assign a unique per-task tag. */
+ if (!(clone_flags & CLONE_THREAD)) {
+ ret = sched_core_share_tasks(p, p);
+ } else {
+ /* Otherwise share the parent's per-task tag. */
+ ret = sched_core_share_tasks(p, current);
+ }
+
+ if (ret)
+ return ret;
+
+ /*
+ * We expect sched_core_share_tasks() to always update p's
+ * core_cookie.
+ */
+ WARN_ON_ONCE(!p->core_cookie);
+
+ return 0;
+ }
+
+ /*
+ * If parent is tagged, inherit the cookie and ensure that the reference
+ * count is updated.
+ *
+ * Technically, we could instead zero-out the task's group_cookie and
+ * allow sched_core_change_group() to handle this post-fork, but
+ * inheriting here has a performance advantage, since we don't
+ * need to traverse the core_cookies RB tree and can instead grab the
+ * parent's cookie directly.
+ */
+ if (parent_cookie) {
+ bool need_stopper = false;
+ unsigned long flags;
+
+ /*
+ * cookies lock prevents task->core_cookie from changing or
+ * being freed
+ */
+ raw_spin_lock_irqsave(&sched_core_cookies_lock, flags);
+
+ if (likely(refcount_inc_not_zero(&parent_cookie->refcnt))) {
+ p->core_cookie = (unsigned long)parent_cookie;
+ } else {
+ /*
+ * Raced with put(). We'll use stop_machine to get
+ * a core_cookie
+ */
+ need_stopper = true;
+ }
+
+ raw_spin_unlock_irqrestore(&sched_core_cookies_lock, flags);
+
+ if (need_stopper)
+ stop_machine(sched_update_core_tag_stopper,
+ (void *)p, NULL);
+ }
+
+ return 0;
+}
+
+void sched_tsk_free(struct task_struct *tsk)
+{
+ struct sched_core_task_cookie *ck;
+
+ sched_core_put_cookie((struct sched_core_cookie *)tsk->core_cookie);
+
+ if (!tsk->core_task_cookie)
+ return;
+
+ ck = (struct sched_core_task_cookie *)tsk->core_task_cookie;
+ queue_work(system_wq, &ck->work);
+}
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 60a922d3f46f..8c452b8010ad 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1024,6 +1024,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
__PS("clock-delta", t1-t0);
}

+#ifdef CONFIG_SCHED_CORE
+ __PS("core_cookie", p->core_cookie);
+#endif
+
sched_show_numa(p, m);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b03f261b95b3..94e07c271528 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -377,6 +377,10 @@ struct cfs_bandwidth {
struct task_group {
struct cgroup_subsys_state css;

+#ifdef CONFIG_SCHED_CORE
+ int core_tagged;
+#endif
+
#ifdef CONFIG_FAIR_GROUP_SCHED
/* schedulable entities of this group on each CPU */
struct sched_entity **se;
@@ -425,6 +429,11 @@ struct task_group {

};

+static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
+{
+ return css ? container_of(css, struct task_group, css) : NULL;
+}
+
#ifdef CONFIG_FAIR_GROUP_SCHED
#define ROOT_TASK_GROUP_LOAD NICE_0_LOAD

@@ -1127,6 +1136,12 @@ static inline bool is_migration_disabled(struct task_struct *p)
DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
static inline struct cpumask *sched_group_span(struct sched_group *sg);

+enum sched_core_cookie_type {
+ sched_core_no_update = 0,
+ sched_core_task_cookie_type,
+ sched_core_group_cookie_type,
+};
+
static inline bool sched_core_enabled(struct rq *rq)
{
return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
@@ -1197,12 +1212,53 @@ static inline bool sched_group_cookie_match(struct rq *rq,
return false;
}

-extern void queue_core_balance(struct rq *rq);
+void sched_core_change_group(struct task_struct *p, struct task_group *new_tg);
+int sched_core_fork(struct task_struct *p, unsigned long clone_flags);
+
+static inline bool sched_core_enqueued(struct task_struct *task)
+{
+ return !RB_EMPTY_NODE(&task->core_node);
+}
+
+void queue_core_balance(struct rq *rq);
+
+void sched_core_enqueue(struct rq *rq, struct task_struct *p);
+void sched_core_dequeue(struct rq *rq, struct task_struct *p);
+void sched_core_get(void);
+void sched_core_put(void);
+
+int sched_core_share_pid(unsigned long flags, pid_t pid);
+int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2);
+
+#ifdef CONFIG_CGROUP_SCHED
+u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft);
+
+#ifdef CONFIG_SCHED_DEBUG
+u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft);
+#endif
+
+int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
+ u64 val);
+#endif
+
+#ifndef TIF_UNSAFE_RET
+#define TIF_UNSAFE_RET (0)
+#endif

bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);

#else /* !CONFIG_SCHED_CORE */

+static inline bool sched_core_enqueued(struct task_struct *task) { return false; }
+static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
+static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static inline int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
+{
+ return -ENOTSUPP;
+}
+
static inline bool sched_core_enabled(struct rq *rq)
{
return false;
@@ -2899,7 +2955,4 @@ void swake_up_all_locked(struct swait_queue_head *q);
void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);

#ifdef CONFIG_SCHED_CORE
-#ifndef TIF_UNSAFE_RET
-#define TIF_UNSAFE_RET (0)
-#endif
#endif
diff --git a/kernel/sys.c b/kernel/sys.c
index a730c03ee607..da52a0da24ef 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2530,6 +2530,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,

error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
break;
+#ifdef CONFIG_SCHED_CORE
+ case PR_SCHED_CORE_SHARE:
+ if (arg4 || arg5)
+ return -EINVAL;
+ error = sched_core_share_pid(arg2, arg3);
+ break;
+#endif
default:
error = -EINVAL;
break;
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index 7f0827705c9a..8fbf9d164ec4 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -247,4 +247,10 @@ struct prctl_mm_map {
#define PR_SET_IO_FLUSHER 57
#define PR_GET_IO_FLUSHER 58

+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE 59
+#define PR_SCHED_CORE_CLEAR 0 /* clear core_sched cookie of pid */
+#define PR_SCHED_CORE_SHARE_FROM 1 /* get core_sched cookie from pid */
+#define PR_SCHED_CORE_SHARE_TO 2 /* push core_sched cookie to pid */
+
#endif /* _LINUX_PRCTL_H */
--
2.29.2.684.gfbc64c5ab5-goog

2020-12-15 15:03:25

by Chris Hyser

[permalink] [raw]
Subject: Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface



On 12/14/20 6:25 PM, Joel Fernandes wrote:
> On Mon, Dec 14, 2020 at 02:44:09PM -0500, chris hyser wrote:
>> On 12/14/20 2:31 PM, Joel Fernandes wrote:
>>>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>>>> index cffdfab..50c31f3 100644
>>>> --- a/kernel/sched/debug.c
>>>> +++ b/kernel/sched/debug.c
>>>> @@ -1030,6 +1030,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
>>>> #ifdef CONFIG_SCHED_CORE
>>>> __PS("core_cookie", p->core_cookie);
>>>> + __PS("core_task_cookie", p->core_task_cookie);
>>>> #endif
>>>
>>> Hmm, so the final cookie of the task is always p->core_cookie. This is what
>>> the scheduler uses. All other fields are ingredients to derive the final
>>> cookie value.
>>>
>>> I will drop this hunk from your overall diff, but let me know if you
>>> disagree!
>>
>>
>> No problem. That was there primarily for debugging.
>
> Ok. I squashed Josh's changes into this patch and several of my fixups. So
> there'll be 3 patches:
> 1. CGroup + prctl (single patch as it is hell to split it)
> 2. Documentation
> 3. ksefltests
>
> Below is the diff of #1. I still have to squash in the stop_machine removal
> and some more review changes. But other than that, please take a look and let
> me know anything that's odd. I will test further as well.

Will do. Looking at it now and just started a build.

-chrish

2020-12-15 16:29:06

by Chris Hyser

[permalink] [raw]
Subject: Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

On 12/14/20 6:25 PM, Joel Fernandes wrote:
> On Mon, Dec 14, 2020 at 02:44:09PM -0500, chris hyser wrote:
>> On 12/14/20 2:31 PM, Joel Fernandes wrote:
>>>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>>>> index cffdfab..50c31f3 100644
>>>> --- a/kernel/sched/debug.c
>>>> +++ b/kernel/sched/debug.c
>>>> @@ -1030,6 +1030,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
>>>> #ifdef CONFIG_SCHED_CORE
>>>> __PS("core_cookie", p->core_cookie);
>>>> + __PS("core_task_cookie", p->core_task_cookie);
>>>> #endif
>>>
>>> Hmm, so the final cookie of the task is always p->core_cookie. This is what
>>> the scheduler uses. All other fields are ingredients to derive the final
>>> cookie value.
>>>
>>> I will drop this hunk from your overall diff, but let me know if you
>>> disagree!
>>
>>
>> No problem. That was there primarily for debugging.
>
> Ok. I squashed Josh's changes into this patch and several of my fixups. So
> there'll be 3 patches:
> 1. CGroup + prctl (single patch as it is hell to split it)
> 2. Documentation
> 3. ksefltests
>
> Below is the diff of #1. I still have to squash in the stop_machine removal
> and some more review changes. But other than that, please take a look and let
> me know anything that's odd. I will test further as well.
>
> Also next series will only be interface as I want to see if I can get lucky
> enough to have Peter look at it before he leaves for PTO next week.
> For the other features, I will post different series as I prepare them. One
> series for interface, and another for kernel protection / migration changes.
>
> ---8<-----------------------
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a60868165590..73baca11d743 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -688,6 +688,8 @@ struct task_struct {
> #ifdef CONFIG_SCHED_CORE
> struct rb_node core_node;
> unsigned long core_cookie;
> + unsigned long core_task_cookie;
> + unsigned long core_group_cookie;
> unsigned int core_occupation;
> #endif
>
> @@ -2081,11 +2083,15 @@ void sched_core_unsafe_enter(void);
> void sched_core_unsafe_exit(void);
> bool sched_core_wait_till_safe(unsigned long ti_check);
> bool sched_core_kernel_protected(void);
> +int sched_core_share_pid(unsigned long flags, pid_t pid);
> +void sched_tsk_free(struct task_struct *tsk);
> #else
> #define sched_core_unsafe_enter(ignore) do { } while (0)
> #define sched_core_unsafe_exit(ignore) do { } while (0)
> #define sched_core_wait_till_safe(ignore) do { } while (0)
> #define sched_core_kernel_protected(ignore) do { } while (0)
> +#define sched_core_share_pid(flags, pid) do { } while (0)
> +#define sched_tsk_free(tsk) do { } while (0)
> #endif
>
> #endif
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index c334e6a02e5f..3752006842e1 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -248,4 +248,10 @@ struct prctl_mm_map {
> #define PR_SET_IO_FLUSHER 57
> #define PR_GET_IO_FLUSHER 58
>
> +/* Request the scheduler to share a core */
> +#define PR_SCHED_CORE_SHARE 59
> +#define PR_SCHED_CORE_CLEAR 0 /* clear core_sched cookie of pid */
> +#define PR_SCHED_CORE_SHARE_FROM 1 /* get core_sched cookie from pid */
> +#define PR_SCHED_CORE_SHARE_TO 2 /* push core_sched cookie to pid */
> +
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 7199d359690c..5468c93829c5 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -736,6 +736,7 @@ void __put_task_struct(struct task_struct *tsk)
> exit_creds(tsk);
> delayacct_tsk_free(tsk);
> put_signal_struct(tsk->signal);
> + sched_tsk_free(tsk);
>
> if (!profile_handoff_task(tsk))
> free_task(tsk);
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 5fc9c9b70862..c526c20adf9d 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
> obj-$(CONFIG_MEMBARRIER) += membarrier.o
> obj-$(CONFIG_CPU_ISOLATION) += isolation.o
> obj-$(CONFIG_PSI) += psi.o
> +obj-$(CONFIG_SCHED_CORE) += coretag.o
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 7f807a84cc30..80daca9c5930 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -157,7 +157,33 @@ static inline bool __sched_core_less(struct task_struct *a, struct task_struct *
> return false;
> }
>
> -static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
> +static bool sched_core_empty(struct rq *rq)
> +{
> + return RB_EMPTY_ROOT(&rq->core_tree);
> +}
> +
> +static struct task_struct *sched_core_first(struct rq *rq)
> +{
> + struct task_struct *task;
> +
> + task = container_of(rb_first(&rq->core_tree), struct task_struct, core_node);
> + return task;
> +}
> +
> +static void sched_core_flush(int cpu)
> +{
> + struct rq *rq = cpu_rq(cpu);
> + struct task_struct *task;
> +
> + while (!sched_core_empty(rq)) {
> + task = sched_core_first(rq);
> + rb_erase(&task->core_node, &rq->core_tree);
> + RB_CLEAR_NODE(&task->core_node);
> + }
> + rq->core->core_task_seq++;
> +}
> +
> +void sched_core_enqueue(struct rq *rq, struct task_struct *p)
> {
> struct rb_node *parent, **node;
> struct task_struct *node_task;
> @@ -184,14 +210,15 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
> rb_insert_color(&p->core_node, &rq->core_tree);
> }
>
> -static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
> +void sched_core_dequeue(struct rq *rq, struct task_struct *p)
> {
> rq->core->core_task_seq++;
>
> - if (!p->core_cookie)
> + if (!sched_core_enqueued(p))
> return;
>
> rb_erase(&p->core_node, &rq->core_tree);
> + RB_CLEAR_NODE(&p->core_node);
> }
>
> /*
> @@ -255,8 +282,24 @@ static int __sched_core_stopper(void *data)
> bool enabled = !!(unsigned long)data;
> int cpu;
>
> - for_each_possible_cpu(cpu)
> - cpu_rq(cpu)->core_enabled = enabled;
> + for_each_possible_cpu(cpu) {
> + struct rq *rq = cpu_rq(cpu);
> +
> + WARN_ON_ONCE(enabled == rq->core_enabled);
> +
> + if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) {
> + /*
> + * All active and migrating tasks will have already
> + * been removed from core queue when we clear the
> + * cgroup tags. However, dying tasks could still be
> + * left in core queue. Flush them here.
> + */
> + if (!enabled)
> + sched_core_flush(cpu);
> +
> + rq->core_enabled = enabled;
> + }
> + }
>
> return 0;
> }
> @@ -266,7 +309,11 @@ static int sched_core_count;
>
> static void __sched_core_enable(void)
> {
> - // XXX verify there are no cookie tasks (yet)
> + int cpu;
> +
> + /* verify there are no cookie tasks (yet) */
> + for_each_online_cpu(cpu)
> + BUG_ON(!sched_core_empty(cpu_rq(cpu)));
>
> static_branch_enable(&__sched_core_enabled);
> stop_machine(__sched_core_stopper, (void *)true, NULL);
> @@ -274,8 +321,6 @@ static void __sched_core_enable(void)
>
> static void __sched_core_disable(void)
> {
> - // XXX verify there are no cookie tasks (left)
> -
> stop_machine(__sched_core_stopper, (void *)false, NULL);
> static_branch_disable(&__sched_core_enabled);
> }
> @@ -295,12 +340,6 @@ void sched_core_put(void)
> __sched_core_disable();
> mutex_unlock(&sched_core_mutex);
> }
> -
> -#else /* !CONFIG_SCHED_CORE */
> -
> -static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
> -static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
> -
> #endif /* CONFIG_SCHED_CORE */
>
> /*
> @@ -3779,6 +3818,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> p->capture_control = NULL;
> #endif
> init_numa_balancing(clone_flags, p);
> +#ifdef CONFIG_SCHED_CORE
> + p->core_task_cookie = 0;
> +#endif
> #ifdef CONFIG_SMP
> p->wake_entry.u_flags = CSD_TYPE_TTWU;
> p->migration_pending = NULL;
> @@ -3903,6 +3945,7 @@ static inline void init_schedstats(void) {}
> int sched_fork(unsigned long clone_flags, struct task_struct *p)
> {
> unsigned long flags;
> + int __maybe_unused ret;
>
> __sched_fork(clone_flags, p);
> /*
> @@ -3978,6 +4021,13 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
> #ifdef CONFIG_SMP
> plist_node_init(&p->pushable_tasks, MAX_PRIO);
> RB_CLEAR_NODE(&p->pushable_dl_tasks);
> +#endif
> +#ifdef CONFIG_SCHED_CORE
> + RB_CLEAR_NODE(&p->core_node);
> +
> + ret = sched_core_fork(p, clone_flags);
> + if (ret)
> + return ret;
> #endif
> return 0;
> }
> @@ -7979,6 +8029,9 @@ void init_idle(struct task_struct *idle, int cpu)
> #ifdef CONFIG_SMP
> sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
> #endif
> +#ifdef CONFIG_SCHED_CORE
> + RB_CLEAR_NODE(&idle->core_node);
> +#endif
> }
>
> #ifdef CONFIG_SMP
> @@ -8983,6 +9036,9 @@ void sched_offline_group(struct task_group *tg)
> spin_unlock_irqrestore(&task_group_lock, flags);
> }
>
> +void cpu_core_get_group_cookie(struct task_group *tg,
> + unsigned long *group_cookie_ptr);
> +
> static void sched_change_group(struct task_struct *tsk, int type)
> {
> struct task_group *tg;
> @@ -8995,6 +9051,11 @@ static void sched_change_group(struct task_struct *tsk, int type)
> tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
> struct task_group, css);
> tg = autogroup_task_group(tsk, tg);
> +
> +#ifdef CONFIG_SCHED_CORE
> + sched_core_change_group(tsk, tg);
> +#endif
> +
> tsk->sched_task_group = tg;
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> @@ -9047,11 +9108,6 @@ void sched_move_task(struct task_struct *tsk)
> task_rq_unlock(rq, tsk, &rf);
> }
>
> -static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
> -{
> - return css ? container_of(css, struct task_group, css) : NULL;
> -}
> -
> static struct cgroup_subsys_state *
> cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> {
> @@ -9087,6 +9143,18 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
> return 0;
> }
>
> +static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
> +{
> +#ifdef CONFIG_SCHED_CORE
> + struct task_group *tg = css_tg(css);
> +
> + if (tg->core_tagged) {
> + sched_core_put();
> + tg->core_tagged = 0;
> + }
> +#endif
> +}
> +
> static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
> {
> struct task_group *tg = css_tg(css);
> @@ -9688,6 +9756,22 @@ static struct cftype cpu_legacy_files[] = {
> .write_u64 = cpu_rt_period_write_uint,
> },
> #endif
> +#ifdef CONFIG_SCHED_CORE
> + {
> + .name = "core_tag",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .read_u64 = cpu_core_tag_read_u64,
> + .write_u64 = cpu_core_tag_write_u64,
> + },
> +#ifdef CONFIG_SCHED_DEBUG
> + /* Read the group cookie. */
> + {
> + .name = "core_group_cookie",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .read_u64 = cpu_core_group_cookie_read_u64,
> + },
> +#endif
> +#endif
> #ifdef CONFIG_UCLAMP_TASK_GROUP
> {
> .name = "uclamp.min",
> @@ -9861,6 +9945,22 @@ static struct cftype cpu_files[] = {
> .write_s64 = cpu_weight_nice_write_s64,
> },
> #endif
> +#ifdef CONFIG_SCHED_CORE
> + {
> + .name = "core_tag",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .read_u64 = cpu_core_tag_read_u64,
> + .write_u64 = cpu_core_tag_write_u64,
> + },
> +#ifdef CONFIG_SCHED_DEBUG
> + /* Read the group cookie. */
> + {
> + .name = "core_group_cookie",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .read_u64 = cpu_core_group_cookie_read_u64,
> + },
> +#endif
> +#endif
> #ifdef CONFIG_CFS_BANDWIDTH
> {
> .name = "max",
> @@ -9889,6 +9989,7 @@ static struct cftype cpu_files[] = {
> struct cgroup_subsys cpu_cgrp_subsys = {
> .css_alloc = cpu_cgroup_css_alloc,
> .css_online = cpu_cgroup_css_online,
> + .css_offline = cpu_cgroup_css_offline,
> .css_released = cpu_cgroup_css_released,
> .css_free = cpu_cgroup_css_free,
> .css_extra_stat_show = cpu_extra_stat_show,
> diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
> new file mode 100644
> index 000000000000..4eeb956382ee
> --- /dev/null
> +++ b/kernel/sched/coretag.c
> @@ -0,0 +1,734 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * kernel/sched/core-tag.c
> + *
> + * Core-scheduling tagging interface support.
> + *
> + * Copyright(C) 2020, Joel Fernandes.
> + * Initial interfacing code by Peter Ziljstra.
> + */
> +
> +#include <linux/prctl.h>
> +#include "sched.h"
> +
> +/*
> + * Wrapper representing a complete cookie. The address of the cookie is used as
> + * a unique identifier. Each cookie has a unique permutation of the internal
> + * cookie fields.
> + */
> +struct sched_core_cookie {
> + unsigned long task_cookie;
> + unsigned long group_cookie;
> +
> + struct rb_node node;
> + refcount_t refcnt;
> +};
> +
> +/*
> + * A simple wrapper around refcount. An allocated sched_core_task_cookie's
> + * address is used to compute the cookie of the task.
> + */
> +struct sched_core_task_cookie {
> + refcount_t refcnt;
> + struct work_struct work; /* to free in WQ context. */;
> +};
> +
> +static DEFINE_MUTEX(sched_core_tasks_mutex);
> +
> +/* All active sched_core_cookies */
> +static struct rb_root sched_core_cookies = RB_ROOT;
> +static DEFINE_RAW_SPINLOCK(sched_core_cookies_lock);
> +
> +/*
> + * Returns the following:
> + * a < b => -1
> + * a == b => 0
> + * a > b => 1
> + */
> +static int sched_core_cookie_cmp(const struct sched_core_cookie *a,
> + const struct sched_core_cookie *b)
> +{
> +#define COOKIE_CMP_RETURN(field) do { \
> + if (a->field < b->field) \
> + return -1; \
> + else if (a->field > b->field) \
> + return 1; \
> +} while (0) \
> +
> + COOKIE_CMP_RETURN(task_cookie);
> + COOKIE_CMP_RETURN(group_cookie);
> +
> + /* all cookie fields match */
> + return 0;
> +
> +#undef COOKIE_CMP_RETURN
> +}
> +
> +static inline void __sched_core_erase_cookie(struct sched_core_cookie *cookie)
> +{
> + lockdep_assert_held(&sched_core_cookies_lock);
> +
> + /* Already removed */
> + if (RB_EMPTY_NODE(&cookie->node))
> + return;
> +
> + rb_erase(&cookie->node, &sched_core_cookies);
> + RB_CLEAR_NODE(&cookie->node);
> +}
> +
> +/* Called when a task no longer points to the cookie in question */
> +static void sched_core_put_cookie(struct sched_core_cookie *cookie)
> +{
> + unsigned long flags;
> +
> + if (!cookie)
> + return;
> +
> + if (refcount_dec_and_test(&cookie->refcnt)) {
> + raw_spin_lock_irqsave(&sched_core_cookies_lock, flags);
> + __sched_core_erase_cookie(cookie);
> + raw_spin_unlock_irqrestore(&sched_core_cookies_lock, flags);
> + kfree(cookie);
> + }
> +}
> +
> +/*
> + * A task's core cookie is a compound structure composed of various cookie
> + * fields (task_cookie, group_cookie). The overall core_cookie is
> + * a pointer to a struct containing those values. This function either finds
> + * an existing core_cookie or creates a new one, and then updates the task's
> + * core_cookie to point to it. Additionally, it handles the necessary reference
> + * counting.
> + *
> + * REQUIRES: task_rq(p) lock or called from cpu_stopper.
> + * Doing so ensures that we do not cause races/corruption by modifying/reading
> + * task cookie fields.
> + */
> +static void __sched_core_update_cookie(struct task_struct *p)
> +{
> + struct rb_node *parent, **node;
> + struct sched_core_cookie *node_core_cookie, *match;
> + static const struct sched_core_cookie zero_cookie;
> + struct sched_core_cookie temp = {
> + .task_cookie = p->core_task_cookie,
> + .group_cookie = p->core_group_cookie,
> + };
> + const bool is_zero_cookie =
> + (sched_core_cookie_cmp(&temp, &zero_cookie) == 0);
> + struct sched_core_cookie *const curr_cookie =
> + (struct sched_core_cookie *)p->core_cookie;
> + unsigned long flags;
> +
> + /*
> + * Already have a cookie matching the requested settings? Nothing to
> + * do.
> + */
> + if ((curr_cookie && sched_core_cookie_cmp(curr_cookie, &temp) == 0) ||
> + (!curr_cookie && is_zero_cookie))
> + return;
> +
> + raw_spin_lock_irqsave(&sched_core_cookies_lock, flags);
> +
> + if (is_zero_cookie) {
> + match = NULL;
> + goto finish;
> + }
> +
> +retry:
> + match = NULL;
> +
> + node = &sched_core_cookies.rb_node;
> + parent = *node;
> + while (*node) {
> + int cmp;
> +
> + node_core_cookie =
> + container_of(*node, struct sched_core_cookie, node);
> + parent = *node;
> +
> + cmp = sched_core_cookie_cmp(&temp, node_core_cookie);
> + if (cmp < 0) {
> + node = &parent->rb_left;
> + } else if (cmp > 0) {
> + node = &parent->rb_right;
> + } else {
> + match = node_core_cookie;
> + break;
> + }
> + }
> +
> + if (!match) {
> + /* No existing cookie; create and insert one */
> + match = kmalloc(sizeof(struct sched_core_cookie), GFP_ATOMIC);
> +
> + /* Fall back to zero cookie */
> + if (WARN_ON_ONCE(!match))
> + goto finish;
> +
> + match->task_cookie = temp.task_cookie;
> + match->group_cookie = temp.group_cookie;
> + refcount_set(&match->refcnt, 1);
> +
> + rb_link_node(&match->node, parent, node);
> + rb_insert_color(&match->node, &sched_core_cookies);
> + } else {
> + /*
> + * Cookie exists, increment refcnt. If refcnt is currently 0,
> + * we're racing with a put() (refcnt decremented but cookie not
> + * yet removed from the tree). In this case, we can simply
> + * perform the removal ourselves and retry.
> + * sched_core_put_cookie() will still function correctly.
> + */
> + if (unlikely(!refcount_inc_not_zero(&match->refcnt))) {
> + __sched_core_erase_cookie(match);
> + goto retry;
> + }
> + }
> +
> +finish:
> + /*
> + * Set the core_cookie under the cookies lock. This guarantees that
> + * p->core_cookie cannot be freed while the cookies lock is held in
> + * sched_core_fork().
> + */
> + p->core_cookie = (unsigned long)match;
> +
> + raw_spin_unlock_irqrestore(&sched_core_cookies_lock, flags);
> +
> + sched_core_put_cookie(curr_cookie);
> +}
> +
> +/*
> + * sched_core_update_cookie - Common helper to update a task's core cookie. This
> + * updates the selected cookie field and then updates the overall cookie.
> + * @p: The task whose cookie should be updated.
> + * @cookie: The new cookie.
> + * @cookie_type: The cookie field to which the cookie corresponds.
> + *
> + * REQUIRES: either task_rq(p)->lock held or called from a stop-machine handler.
> + * Doing so ensures that we do not cause races/corruption by modifying/reading
> + * task cookie fields.
> + */
> +static void sched_core_update_cookie(struct task_struct *p, unsigned long cookie,
> + enum sched_core_cookie_type cookie_type)
> +{
> + if (!p)
> + return;
> +
> + switch (cookie_type) {
> + case sched_core_no_update:
> + break;
> + case sched_core_task_cookie_type:
> + p->core_task_cookie = cookie;
> + break;
> + case sched_core_group_cookie_type:
> + p->core_group_cookie = cookie;
> + break;
> + default:
> + WARN_ON_ONCE(1);
> + }
> +
> + /* Set p->core_cookie, which is the overall cookie */
> + __sched_core_update_cookie(p);
> +
> + if (sched_core_enqueued(p)) {
> + sched_core_dequeue(task_rq(p), p);
> + if (!p->core_cookie)
> + return;
> + }
> +
> + if (sched_core_enabled(task_rq(p)) &&
> + p->core_cookie && task_on_rq_queued(p))
> + sched_core_enqueue(task_rq(p), p);
> +}
> +
> +#ifdef CONFIG_CGROUP_SCHED
> +void cpu_core_get_group_cookie(struct task_group *tg,
> + unsigned long *group_cookie_ptr);
> +
> +void sched_core_change_group(struct task_struct *p, struct task_group *new_tg)
> +{
> + unsigned long new_group_cookie;
> +
> + cpu_core_get_group_cookie(new_tg, &new_group_cookie);
> +
> + if (p->core_group_cookie == new_group_cookie)
> + return;
> +
> + p->core_group_cookie = new_group_cookie;
> +
> + __sched_core_update_cookie(p);
> +}
> +#endif
> +
> +/* Per-task interface: Used by fork(2) and prctl(2). */
> +static void sched_core_put_cookie_work(struct work_struct *ws);
> +
> +/* Caller has to call sched_core_get() if non-zero value is returned. */
> +static unsigned long sched_core_alloc_task_cookie(void)
> +{
> + struct sched_core_task_cookie *ck =
> + kmalloc(sizeof(struct sched_core_task_cookie), GFP_KERNEL);
> +
> + if (!ck)
> + return 0;
> + refcount_set(&ck->refcnt, 1);
> + INIT_WORK(&ck->work, sched_core_put_cookie_work);
> +
> + return (unsigned long)ck;
> +}
> +
> +static void sched_core_get_task_cookie(unsigned long cookie)
> +{
> + struct sched_core_task_cookie *ptr =
> + (struct sched_core_task_cookie *)cookie;
> +
> + refcount_inc(&ptr->refcnt);
> +}
> +
> +static void sched_core_put_task_cookie(unsigned long cookie)
> +{
> + struct sched_core_task_cookie *ptr =
> + (struct sched_core_task_cookie *)cookie;
> +
> + if (refcount_dec_and_test(&ptr->refcnt))
> + kfree(ptr);
> +}
> +
> +static void sched_core_put_cookie_work(struct work_struct *ws)
> +{
> + struct sched_core_task_cookie *ck =
> + container_of(ws, struct sched_core_task_cookie, work);
> +
> + sched_core_put_task_cookie((unsigned long)ck);
> + sched_core_put();
> +}
> +
> +struct sched_core_task_write_tag {
> + struct task_struct *tasks[2];
> + unsigned long cookies[2];
> +};
> +
> +/*
> + * Ensure that the task has been requeued. The stopper ensures that the task cannot
> + * be migrated to a different CPU while its core scheduler queue state is being updated.
> + * It also makes sure to requeue a task if it was running actively on another CPU.
> + */
> +static int sched_core_task_join_stopper(void *data)
> +{
> + struct sched_core_task_write_tag *tag = (struct sched_core_task_write_tag *)data;
> + int i;
> +
> + for (i = 0; i < 2; i++)
> + sched_core_update_cookie(tag->tasks[i], tag->cookies[i],
> + sched_core_task_cookie_type);
> +
> + return 0;
> +}
> +
> +int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
> +{
> + struct sched_core_task_write_tag wr = {}; /* for stop machine. */
> + bool sched_core_put_after_stopper = false;
> + unsigned long cookie;
> + int ret = -ENOMEM;
> +
> + mutex_lock(&sched_core_tasks_mutex);
> +
> + if (!t2) {
> + if (t1->core_task_cookie) {
> + sched_core_put_task_cookie(t1->core_task_cookie);
> + sched_core_put_after_stopper = true;
> + wr.tasks[0] = t1; /* Keep wr.cookies[0] reset for t1. */
> + }
> + } else if (t1 == t2) {
> + /* Assign a unique per-task cookie solely for t1. */
> +
> + cookie = sched_core_alloc_task_cookie();
> + if (!cookie)
> + goto out_unlock;
> + sched_core_get();
> +
> + if (t1->core_task_cookie) {
> + sched_core_put_task_cookie(t1->core_task_cookie);
> + sched_core_put_after_stopper = true;
> + }
> + wr.tasks[0] = t1;
> + wr.cookies[0] = cookie;
> + } else if (!t1->core_task_cookie && !t2->core_task_cookie) {
> + /*
> + * t1 joining t2
> + * CASE 1:
> + * before 0 0
> + * after new cookie new cookie
> + *
> + * CASE 2:
> + * before X (non-zero) 0
> + * after 0 0
> + *
> + * CASE 3:
> + * before 0 X (non-zero)
> + * after X X
> + *
> + * CASE 4:
> + * before Y (non-zero) X (non-zero)
> + * after X X
> + */
> +
> + /* CASE 1. */
> + cookie = sched_core_alloc_task_cookie();
> + if (!cookie)
> + goto out_unlock;
> + sched_core_get(); /* For the alloc. */
> +
> + /* Add another reference for the other task. */
> + sched_core_get_task_cookie(cookie);
> + sched_core_get(); /* For the other task. */
> +
> + wr.tasks[0] = t1;
> + wr.tasks[1] = t2;
> + wr.cookies[0] = wr.cookies[1] = cookie;
> +
> + } else if (t1->core_task_cookie && !t2->core_task_cookie) {
> + /* CASE 2. */
> + sched_core_put_task_cookie(t1->core_task_cookie);
> + sched_core_put_after_stopper = true;
> +
> + wr.tasks[0] = t1; /* Reset cookie for t1. */
> +
> + } else if (!t1->core_task_cookie && t2->core_task_cookie) {
> + /* CASE 3. */
> + sched_core_get_task_cookie(t2->core_task_cookie);
> + sched_core_get();
> +
> + wr.tasks[0] = t1;
> + wr.cookies[0] = t2->core_task_cookie;
> +
> + } else {
> + /* CASE 4. */
> + sched_core_get_task_cookie(t2->core_task_cookie);
> + sched_core_get();
> +
> + sched_core_put_task_cookie(t1->core_task_cookie);
> + sched_core_put_after_stopper = true;
> +
> + wr.tasks[0] = t1;
> + wr.cookies[0] = t2->core_task_cookie;
> + }
> +
> + stop_machine(sched_core_task_join_stopper, (void *)&wr, NULL);
> +
> + if (sched_core_put_after_stopper)
> + sched_core_put();
> +
> + ret = 0;
> +out_unlock:
> + mutex_unlock(&sched_core_tasks_mutex);
> + return ret;
> +}
> +
> +/* Called from prctl interface: PR_SCHED_CORE_SHARE */
> +int sched_core_share_pid(unsigned long flags, pid_t pid)
> +{
> + struct task_struct *to;
> + struct task_struct *from;
> + struct task_struct *task;
> + int err;
> +
> + rcu_read_lock();
> + task = find_task_by_vpid(pid);
> + if (!task) {
> + rcu_read_unlock();
> + return -ESRCH;
> + }
> +
> + get_task_struct(task);
> +
> + /*
> + * Check if this process has the right to modify the specified
> + * process. Use the regular "ptrace_may_access()" checks.
> + */
> + if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
> + rcu_read_unlock();
> + err = -EPERM;
> + goto out;
> + }
> + rcu_read_unlock();
> +
> + if (flags == PR_SCHED_CORE_CLEAR) {
> + to = task;
> + from = NULL;
> + } else if (flags == PR_SCHED_CORE_SHARE_TO) {
> + to = task;
> + from = current;
> + } else if (flags == PR_SCHED_CORE_SHARE_FROM) {
> + to = current;
> + from = task;
> + } else {
> + err = -EINVAL;
> + goto out;
> + }
> +
> + err = sched_core_share_tasks(to, from);
> +out:
> + if (task)
> + put_task_struct(task);
> + return err;
> +}
> +
> +/* CGroup core-scheduling interface support. */
> +#ifdef CONFIG_CGROUP_SCHED
> +/*
> + * Helper to get the group cookie in a hierarchy. Any ancestor can have a
> + * cookie.
> + *
> + * Sets *group_cookie_ptr to the hierarchical group cookie.
> + */
> +void cpu_core_get_group_cookie(struct task_group *tg,
> + unsigned long *group_cookie_ptr)
> +{
> + unsigned long group_cookie = 0UL;
> +
> + if (!tg)
> + goto out;
> +
> + for (; tg; tg = tg->parent) {
> +
> + if (tg->core_tagged) {
> + group_cookie = (unsigned long)tg;
> + break;
> + }
> + }
> +
> +out:
> + *group_cookie_ptr = group_cookie;
> +}
> +
> +/* Determine if any group in @tg's children are tagged. */
> +static bool cpu_core_check_descendants(struct task_group *tg, bool check_tag)
> +{
> + struct task_group *child;
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(child, &tg->children, siblings) {
> + if ((child->core_tagged && check_tag)) {
> + rcu_read_unlock();
> + return true;
> + }
> +
> + rcu_read_unlock();
> + return cpu_core_check_descendants(child, check_tag);
> + }
> +
> + rcu_read_unlock();
> + return false;
> +}
> +
> +u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css,
> + struct cftype *cft)
> +{
> + struct task_group *tg = css_tg(css);
> +
> + return !!tg->core_tagged;
> +}
> +
> +#ifdef CONFIG_SCHED_DEBUG
> +u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css,
> + struct cftype *cft)
> +{
> + unsigned long group_cookie;
> +
> + cpu_core_get_group_cookie(css_tg(css), &group_cookie);
> +
> + return group_cookie;
> +}
> +#endif
> +
> +struct write_core_tag {
> + struct cgroup_subsys_state *css;
> + unsigned long cookie;
> + enum sched_core_cookie_type cookie_type;
> +};
> +
> +static int __sched_write_tag(void *data)
> +{
> + struct write_core_tag *tag = (struct write_core_tag *)data;
> + struct task_struct *p;
> + struct cgroup_subsys_state *css;
> +
> + rcu_read_lock();
> + css_for_each_descendant_pre(css, tag->css) {
> + struct css_task_iter it;
> +
> + css_task_iter_start(css, 0, &it);
> + /*
> + * Note: css_task_iter_next will skip dying tasks.
> + * There could still be dying tasks left in the core queue
> + * when we set cgroup tag to 0 when the loop is done below.
> + */
> + while ((p = css_task_iter_next(&it)))
> + sched_core_update_cookie(p, tag->cookie,
> + tag->cookie_type);
> +
> + css_task_iter_end(&it);
> + }
> + rcu_read_unlock();
> +
> + return 0;
> +}
> +
> +int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
> + u64 val)
> +{
> + struct task_group *tg = css_tg(css);
> + struct write_core_tag wtag;
> + unsigned long group_cookie;
> +
> + if (val > 1)
> + return -ERANGE;
> +
> + if (!static_branch_likely(&sched_smt_present))
> + return -EINVAL;
> +
> + if (!tg->core_tagged && val) {
> + /* Tag is being set. Check ancestors and descendants. */
> + cpu_core_get_group_cookie(tg, &group_cookie);
> + if (group_cookie ||
> + cpu_core_check_descendants(tg, true /* tag */))
> + return -EBUSY;
> + } else if (tg->core_tagged && !val) {
> + /* Tag is being reset. Check descendants. */
> + if (cpu_core_check_descendants(tg, true /* tag */))
> + return -EBUSY;
> + } else {
> + return 0;
> + }
> +
> + if (!!val)
> + sched_core_get();
> +
> + wtag.css = css;
> + wtag.cookie = (unsigned long)tg;
> + wtag.cookie_type = sched_core_group_cookie_type;
> +
> + tg->core_tagged = val;
> +
> + stop_machine(__sched_write_tag, (void *)&wtag, NULL);
> + if (!val)
> + sched_core_put();
> +
> + return 0;
> +}
> +#endif
> +
> +/*
> + * Tagging support when fork(2) is called:
> + * If it is a CLONE_THREAD fork, share parent's tag. Otherwise assign a unique per-task tag.
> + */
> +static int sched_update_core_tag_stopper(void *data)
> +{
> + struct task_struct *p = (struct task_struct *)data;
> +
> + /* Recalculate core cookie */
> + sched_core_update_cookie(p, 0, sched_core_no_update);
> +
> + return 0;
> +}
> +
> +/* Called from sched_fork() */
> +int sched_core_fork(struct task_struct *p, unsigned long clone_flags)
> +{
> + struct sched_core_cookie *parent_cookie =
> + (struct sched_core_cookie *)current->core_cookie;
> +
> + /*
> + * core_cookie is ref counted; avoid an uncounted reference.
> + * If p should have a cookie, it will be set below.
> + */
> + p->core_cookie = 0UL;
> +
> + /*
> + * If parent is tagged via per-task cookie, tag the child (either with
> + * the parent's cookie, or a new one).
> + *
> + * We can return directly in this case, because sched_core_share_tasks()
> + * will set the core_cookie (so there is no need to try to inherit from
> + * the parent). The cookie will have the proper sub-fields (ie. group
> + * cookie, etc.), because these come from p's task_struct, which is
> + * dup'd from the parent.
> + */
> + if (current->core_task_cookie) {
> + int ret;
> +
> + /* If it is not CLONE_THREAD fork, assign a unique per-task tag. */
> + if (!(clone_flags & CLONE_THREAD)) {
> + ret = sched_core_share_tasks(p, p);
> + } else {
> + /* Otherwise share the parent's per-task tag. */
> + ret = sched_core_share_tasks(p, current);
> + }
> +
> + if (ret)
> + return ret;
> +
> + /*
> + * We expect sched_core_share_tasks() to always update p's
> + * core_cookie.
> + */
> + WARN_ON_ONCE(!p->core_cookie);
> +
> + return 0;
> + }
> +
> + /*
> + * If parent is tagged, inherit the cookie and ensure that the reference
> + * count is updated.
> + *
> + * Technically, we could instead zero-out the task's group_cookie and
> + * allow sched_core_change_group() to handle this post-fork, but
> + * inheriting here has a performance advantage, since we don't
> + * need to traverse the core_cookies RB tree and can instead grab the
> + * parent's cookie directly.
> + */
> + if (parent_cookie) {
> + bool need_stopper = false;
> + unsigned long flags;
> +
> + /*
> + * cookies lock prevents task->core_cookie from changing or
> + * being freed
> + */
> + raw_spin_lock_irqsave(&sched_core_cookies_lock, flags);
> +
> + if (likely(refcount_inc_not_zero(&parent_cookie->refcnt))) {
> + p->core_cookie = (unsigned long)parent_cookie;
> + } else {
> + /*
> + * Raced with put(). We'll use stop_machine to get
> + * a core_cookie
> + */
> + need_stopper = true;
> + }
> +
> + raw_spin_unlock_irqrestore(&sched_core_cookies_lock, flags);
> +
> + if (need_stopper)
> + stop_machine(sched_update_core_tag_stopper,
> + (void *)p, NULL);
> + }
> +
> + return 0;
> +}
> +
> +void sched_tsk_free(struct task_struct *tsk)
> +{
> + struct sched_core_task_cookie *ck;
> +
> + sched_core_put_cookie((struct sched_core_cookie *)tsk->core_cookie);
> +
> + if (!tsk->core_task_cookie)
> + return;
> +
> + ck = (struct sched_core_task_cookie *)tsk->core_task_cookie;
> + queue_work(system_wq, &ck->work);
> +}
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 60a922d3f46f..8c452b8010ad 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -1024,6 +1024,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
> __PS("clock-delta", t1-t0);
> }
>
> +#ifdef CONFIG_SCHED_CORE
> + __PS("core_cookie", p->core_cookie);
> +#endif
> +
> sched_show_numa(p, m);
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b03f261b95b3..94e07c271528 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -377,6 +377,10 @@ struct cfs_bandwidth {
> struct task_group {
> struct cgroup_subsys_state css;
>
> +#ifdef CONFIG_SCHED_CORE
> + int core_tagged;
> +#endif
> +
> #ifdef CONFIG_FAIR_GROUP_SCHED
> /* schedulable entities of this group on each CPU */
> struct sched_entity **se;
> @@ -425,6 +429,11 @@ struct task_group {
>
> };
>
> +static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
> +{
> + return css ? container_of(css, struct task_group, css) : NULL;
> +}
> +
> #ifdef CONFIG_FAIR_GROUP_SCHED
> #define ROOT_TASK_GROUP_LOAD NICE_0_LOAD
>
> @@ -1127,6 +1136,12 @@ static inline bool is_migration_disabled(struct task_struct *p)
> DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
> static inline struct cpumask *sched_group_span(struct sched_group *sg);
>
> +enum sched_core_cookie_type {
> + sched_core_no_update = 0,
> + sched_core_task_cookie_type,
> + sched_core_group_cookie_type,
> +};
> +
> static inline bool sched_core_enabled(struct rq *rq)
> {
> return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
> @@ -1197,12 +1212,53 @@ static inline bool sched_group_cookie_match(struct rq *rq,
> return false;
> }
>
> -extern void queue_core_balance(struct rq *rq);
> +void sched_core_change_group(struct task_struct *p, struct task_group *new_tg);
> +int sched_core_fork(struct task_struct *p, unsigned long clone_flags);
> +
> +static inline bool sched_core_enqueued(struct task_struct *task)
> +{
> + return !RB_EMPTY_NODE(&task->core_node);
> +}
> +
> +void queue_core_balance(struct rq *rq);
> +
> +void sched_core_enqueue(struct rq *rq, struct task_struct *p);
> +void sched_core_dequeue(struct rq *rq, struct task_struct *p);
> +void sched_core_get(void);
> +void sched_core_put(void);
> +
> +int sched_core_share_pid(unsigned long flags, pid_t pid);
> +int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2);
> +
> +#ifdef CONFIG_CGROUP_SCHED
> +u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css,
> + struct cftype *cft);
> +
> +#ifdef CONFIG_SCHED_DEBUG
> +u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css,
> + struct cftype *cft);
> +#endif
> +
> +int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
> + u64 val);
> +#endif
> +
> +#ifndef TIF_UNSAFE_RET
> +#define TIF_UNSAFE_RET (0)
> +#endif
>
> bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
>
> #else /* !CONFIG_SCHED_CORE */
>
> +static inline bool sched_core_enqueued(struct task_struct *task) { return false; }
> +static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
> +static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
> +static inline int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
> +{
> + return -ENOTSUPP;
> +}
> +
> static inline bool sched_core_enabled(struct rq *rq)
> {
> return false;
> @@ -2899,7 +2955,4 @@ void swake_up_all_locked(struct swait_queue_head *q);
> void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
>
> #ifdef CONFIG_SCHED_CORE
> -#ifndef TIF_UNSAFE_RET
> -#define TIF_UNSAFE_RET (0)
> -#endif
> #endif
> diff --git a/kernel/sys.c b/kernel/sys.c
> index a730c03ee607..da52a0da24ef 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2530,6 +2530,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>
> error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
> break;
> +#ifdef CONFIG_SCHED_CORE
> + case PR_SCHED_CORE_SHARE:
> + if (arg4 || arg5)
> + return -EINVAL;
> + error = sched_core_share_pid(arg2, arg3);
> + break;
> +#endif
> default:
> error = -EINVAL;
> break;
> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
> index 7f0827705c9a..8fbf9d164ec4 100644
> --- a/tools/include/uapi/linux/prctl.h
> +++ b/tools/include/uapi/linux/prctl.h
> @@ -247,4 +247,10 @@ struct prctl_mm_map {
> #define PR_SET_IO_FLUSHER 57
> #define PR_GET_IO_FLUSHER 58
>
> +/* Request the scheduler to share a core */
> +#define PR_SCHED_CORE_SHARE 59
> +#define PR_SCHED_CORE_CLEAR 0 /* clear core_sched cookie of pid */
> +#define PR_SCHED_CORE_SHARE_FROM 1 /* get core_sched cookie from pid */
> +#define PR_SCHED_CORE_SHARE_TO 2 /* push core_sched cookie to pid */

My original patch had this:

# define PR_SCHED_CORE_CLEAR 0 /* clear core_sched cookie of pid */
# define PR_SCHED_CORE_SHARE_FROM 1 /* get core_sched cookie from pid */
# define PR_SCHED_CORE_SHARE_TO 2 /* push core_sched cookie to pid */

The space between the '#' and 'define' is the trick apparently used to prevent a script run during 'perf' build that
pulls options out of this file to generate a 'beauty' array. Without the space, you can't build perf.

-chrish

> +
> #endif /* _LINUX_PRCTL_H */
>

2020-12-15 18:20:08

by Dhaval Giani

[permalink] [raw]
Subject: Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

On 12/14/20 3:25 PM, Joel Fernandes wrote:

>> No problem. That was there primarily for debugging.
> Ok. I squashed Josh's changes into this patch and several of my fixups. So
> there'll be 3 patches:
> 1. CGroup + prctl (single patch as it is hell to split it)

Please don't do that. I am not sure we have thought the cgroup interface through
(looking at all the discussions). IMHO, it would be better for us to get a simpler
interface (prctl) right and then once we learn the lessons using it, apply it for
the cgroup interface. I think we all agree we don't want to maintain a messy interface
forever.

Dhaval