2021-04-22 12:44:02

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 18/19] sched: prctl() core-scheduling interface

From: Chris Hyser <[email protected]>

This patch provides support for setting and copying core scheduling
'task cookies' between threads (PID), processes (TGID), and process
groups (PGID).

The value of core scheduling isn't that tasks don't share a core,
'nosmt' can do that. The value lies in exploiting all the sharing
opportunities that exist to recover possible lost performance and that
requires a degree of flexibility in the API.

>From a security perspective (and there are others), the thread,
process and process group distinction is an existent hierarchal
categorization of tasks that reflects many of the security concerns
about 'data sharing'. For example, protecting against cache-snooping
by a thread that can just read the memory directly isn't all that
useful.

With this in mind, subcommands to CREATE/SHARE (TO/FROM) provide a
mechanism to create and share cookies. CREATE/SHARE_TO specify a
target pid with enum pidtype used to specify the scope of the targeted
tasks. For example, PIDTYPE_TGID will share the cookie with the
process and all of it's threads as typically desired in a security
scenario.

API:

prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, tgtpid, pidtype, &cookie)
prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, NULL)
prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, NULL)
prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, srcpid, pidtype, NULL)

where 'tgtpid/srcpid == 0' implies the current process and pidtype is
kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}.

For return values, EINVAL, ENOMEM are what they say. ESRCH means the
tgtpid/srcpid was not found. EPERM indicates lack of PTRACE permission
access to tgtpid/srcpid. ENODEV indicates your machines lacks SMT.

[peterz: complete rewrite]
Signed-off-by: Chris Hyser <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/sched.h | 2
include/uapi/linux/prctl.h | 8 ++
kernel/sched/core_sched.c | 114 +++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 5 +
tools/include/uapi/linux/prctl.h | 8 ++
5 files changed, 137 insertions(+)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2173,6 +2173,8 @@ const struct cpumask *sched_trace_rd_spa
#ifdef CONFIG_SCHED_CORE
extern void sched_core_free(struct task_struct *tsk);
extern void sched_core_fork(struct task_struct *p);
+extern int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
+ unsigned long uaddr);
#else
static inline void sched_core_free(struct task_struct *tsk) { }
static inline void sched_core_fork(struct task_struct *p) { }
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -255,4 +255,12 @@ struct prctl_mm_map {
# define SYSCALL_DISPATCH_FILTER_ALLOW 0
# define SYSCALL_DISPATCH_FILTER_BLOCK 1

+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE 60
+# define PR_SCHED_CORE_GET 0
+# define PR_SCHED_CORE_CREATE 1 /* create unique core_sched cookie */
+# define PR_SCHED_CORE_SHARE_TO 2 /* push core_sched cookie to pid */
+# define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */
+# define PR_SCHED_CORE_MAX 4
+
#endif /* _LINUX_PRCTL_H */
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0-only

+#include <linux/prctl.h>
#include "sched.h"

/*
@@ -110,3 +111,116 @@ void sched_core_free(struct task_struct
{
sched_core_put_cookie(p->core_cookie);
}
+
+static void __sched_core_set(struct task_struct *p, unsigned long cookie)
+{
+ cookie = sched_core_get_cookie(cookie);
+ cookie = sched_core_update_cookie(p, cookie);
+ sched_core_put_cookie(cookie);
+}
+
+/* Called from prctl interface: PR_SCHED_CORE */
+int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
+ unsigned long uaddr)
+{
+ unsigned long cookie = 0, id = 0;
+ struct task_struct *task, *p;
+ struct pid *grp;
+ int err = 0;
+
+ if (!static_branch_likely(&sched_smt_present))
+ return -ENODEV;
+
+ if (type > PIDTYPE_PGID || cmd >= PR_SCHED_CORE_MAX || pid < 0 ||
+ (cmd != PR_SCHED_CORE_GET && uaddr))
+ return -EINVAL;
+
+ rcu_read_lock();
+ if (pid == 0) {
+ task = current;
+ } else {
+ task = find_task_by_vpid(pid);
+ if (!task) {
+ rcu_read_unlock();
+ return -ESRCH;
+ }
+ }
+ get_task_struct(task);
+ rcu_read_unlock();
+
+ /*
+ * Check if this process has the right to modify the specified
+ * process. Use the regular "ptrace_may_access()" checks.
+ */
+ if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+ err = -EPERM;
+ goto out;
+ }
+
+ switch (cmd) {
+ case PR_SCHED_CORE_GET:
+ if (type != PIDTYPE_PID || uaddr & 7) {
+ err = -EINVAL;
+ goto out;
+ }
+ cookie = sched_core_clone_cookie(task);
+ if (cookie) {
+ /* XXX improve ? */
+ ptr_to_hashval((void *)cookie, &id);
+ }
+ err = put_user(id, (u64 __user *)uaddr);
+ goto out;
+
+ case PR_SCHED_CORE_CREATE:
+ cookie = sched_core_alloc_cookie();
+ if (!cookie) {
+ err = -ENOMEM;
+ goto out;
+ }
+ break;
+
+ case PR_SCHED_CORE_SHARE_TO:
+ cookie = sched_core_clone_cookie(current);
+ break;
+
+ case PR_SCHED_CORE_SHARE_FROM:
+ if (type != PIDTYPE_PID) {
+ err = -EINVAL;
+ goto out;
+ }
+ cookie = sched_core_clone_cookie(task);
+ __sched_core_set(current, cookie);
+ goto out;
+
+ default:
+ err = -EINVAL;
+ goto out;
+ };
+
+ if (type == PIDTYPE_PID) {
+ __sched_core_set(task, cookie);
+ goto out;
+ }
+
+ read_lock(&tasklist_lock);
+ grp = task_pid_type(task, type);
+
+ do_each_pid_thread(grp, type, p) {
+ if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS)) {
+ err = -EPERM;
+ goto out_tasklist;
+ }
+ } while_each_pid_thread(grp, type, p);
+
+ do_each_pid_thread(grp, type, p) {
+ __sched_core_set(p, cookie);
+ } while_each_pid_thread(grp, type, p);
+out_tasklist:
+ read_unlock(&tasklist_lock);
+
+out:
+ sched_core_put_cookie(cookie);
+ put_task_struct(task);
+ return err;
+}
+
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2534,6 +2534,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
error = set_syscall_user_dispatch(arg2, arg3, arg4,
(char __user *) arg5);
break;
+#ifdef CONFIG_SCHED_CORE
+ case PR_SCHED_CORE:
+ error = sched_core_share_pid(arg2, arg3, arg4, arg5);
+ break;
+#endif
default:
error = -EINVAL;
break;
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -255,4 +255,12 @@ struct prctl_mm_map {
# define SYSCALL_DISPATCH_FILTER_ALLOW 0
# define SYSCALL_DISPATCH_FILTER_BLOCK 1

+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE 60
+# define PR_SCHED_CORE_GET 0
+# define PR_SCHED_CORE_CREATE 1 /* create unique core_sched cookie */
+# define PR_SCHED_CORE_SHARE_TO 2 /* push core_sched cookie to pid */
+# define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */
+# define PR_SCHED_CORE_MAX 4
+
#endif /* _LINUX_PRCTL_H */



Subject: [tip: sched/core] sched: prctl() core-scheduling interface

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 7ac592aa35a684ff1858fb9ec282886b9e3575ac
Gitweb: https://git.kernel.org/tip/7ac592aa35a684ff1858fb9ec282886b9e3575ac
Author: Chris Hyser <[email protected]>
AuthorDate: Wed, 24 Mar 2021 17:40:15 -04:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 12 May 2021 11:43:31 +02:00

sched: prctl() core-scheduling interface

This patch provides support for setting and copying core scheduling
'task cookies' between threads (PID), processes (TGID), and process
groups (PGID).

The value of core scheduling isn't that tasks don't share a core,
'nosmt' can do that. The value lies in exploiting all the sharing
opportunities that exist to recover possible lost performance and that
requires a degree of flexibility in the API.

>From a security perspective (and there are others), the thread,
process and process group distinction is an existent hierarchal
categorization of tasks that reflects many of the security concerns
about 'data sharing'. For example, protecting against cache-snooping
by a thread that can just read the memory directly isn't all that
useful.

With this in mind, subcommands to CREATE/SHARE (TO/FROM) provide a
mechanism to create and share cookies. CREATE/SHARE_TO specify a
target pid with enum pidtype used to specify the scope of the targeted
tasks. For example, PIDTYPE_TGID will share the cookie with the
process and all of it's threads as typically desired in a security
scenario.

API:

prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, tgtpid, pidtype, &cookie)
prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, NULL)
prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, NULL)
prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, srcpid, pidtype, NULL)

where 'tgtpid/srcpid == 0' implies the current process and pidtype is
kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}.

For return values, EINVAL, ENOMEM are what they say. ESRCH means the
tgtpid/srcpid was not found. EPERM indicates lack of PTRACE permission
access to tgtpid/srcpid. ENODEV indicates your machines lacks SMT.

[peterz: complete rewrite]
Signed-off-by: Chris Hyser <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Don Hiatt <[email protected]>
Tested-by: Hongyu Ning <[email protected]>
Tested-by: Vincent Guittot <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
include/linux/sched.h | 2 +-
include/uapi/linux/prctl.h | 8 ++-
kernel/sched/core_sched.c | 114 ++++++++++++++++++++++++++++++-
kernel/sys.c | 5 +-
tools/include/uapi/linux/prctl.h | 8 ++-
5 files changed, 137 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fba47e5..c7e7d50 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2182,6 +2182,8 @@ const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
#ifdef CONFIG_SCHED_CORE
extern void sched_core_free(struct task_struct *tsk);
extern void sched_core_fork(struct task_struct *p);
+extern int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
+ unsigned long uaddr);
#else
static inline void sched_core_free(struct task_struct *tsk) { }
static inline void sched_core_fork(struct task_struct *p) { }
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 18a9f59..967d9c5 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -259,4 +259,12 @@ struct prctl_mm_map {
#define PR_PAC_SET_ENABLED_KEYS 60
#define PR_PAC_GET_ENABLED_KEYS 61

+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE 62
+# define PR_SCHED_CORE_GET 0
+# define PR_SCHED_CORE_CREATE 1 /* create unique core_sched cookie */
+# define PR_SCHED_CORE_SHARE_TO 2 /* push core_sched cookie to pid */
+# define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */
+# define PR_SCHED_CORE_MAX 4
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
index dcbbeae..9a80e9a 100644
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0-only

+#include <linux/prctl.h>
#include "sched.h"

/*
@@ -113,3 +114,116 @@ void sched_core_free(struct task_struct *p)
{
sched_core_put_cookie(p->core_cookie);
}
+
+static void __sched_core_set(struct task_struct *p, unsigned long cookie)
+{
+ cookie = sched_core_get_cookie(cookie);
+ cookie = sched_core_update_cookie(p, cookie);
+ sched_core_put_cookie(cookie);
+}
+
+/* Called from prctl interface: PR_SCHED_CORE */
+int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
+ unsigned long uaddr)
+{
+ unsigned long cookie = 0, id = 0;
+ struct task_struct *task, *p;
+ struct pid *grp;
+ int err = 0;
+
+ if (!static_branch_likely(&sched_smt_present))
+ return -ENODEV;
+
+ if (type > PIDTYPE_PGID || cmd >= PR_SCHED_CORE_MAX || pid < 0 ||
+ (cmd != PR_SCHED_CORE_GET && uaddr))
+ return -EINVAL;
+
+ rcu_read_lock();
+ if (pid == 0) {
+ task = current;
+ } else {
+ task = find_task_by_vpid(pid);
+ if (!task) {
+ rcu_read_unlock();
+ return -ESRCH;
+ }
+ }
+ get_task_struct(task);
+ rcu_read_unlock();
+
+ /*
+ * Check if this process has the right to modify the specified
+ * process. Use the regular "ptrace_may_access()" checks.
+ */
+ if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+ err = -EPERM;
+ goto out;
+ }
+
+ switch (cmd) {
+ case PR_SCHED_CORE_GET:
+ if (type != PIDTYPE_PID || uaddr & 7) {
+ err = -EINVAL;
+ goto out;
+ }
+ cookie = sched_core_clone_cookie(task);
+ if (cookie) {
+ /* XXX improve ? */
+ ptr_to_hashval((void *)cookie, &id);
+ }
+ err = put_user(id, (u64 __user *)uaddr);
+ goto out;
+
+ case PR_SCHED_CORE_CREATE:
+ cookie = sched_core_alloc_cookie();
+ if (!cookie) {
+ err = -ENOMEM;
+ goto out;
+ }
+ break;
+
+ case PR_SCHED_CORE_SHARE_TO:
+ cookie = sched_core_clone_cookie(current);
+ break;
+
+ case PR_SCHED_CORE_SHARE_FROM:
+ if (type != PIDTYPE_PID) {
+ err = -EINVAL;
+ goto out;
+ }
+ cookie = sched_core_clone_cookie(task);
+ __sched_core_set(current, cookie);
+ goto out;
+
+ default:
+ err = -EINVAL;
+ goto out;
+ };
+
+ if (type == PIDTYPE_PID) {
+ __sched_core_set(task, cookie);
+ goto out;
+ }
+
+ read_lock(&tasklist_lock);
+ grp = task_pid_type(task, type);
+
+ do_each_pid_thread(grp, type, p) {
+ if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS)) {
+ err = -EPERM;
+ goto out_tasklist;
+ }
+ } while_each_pid_thread(grp, type, p);
+
+ do_each_pid_thread(grp, type, p) {
+ __sched_core_set(p, cookie);
+ } while_each_pid_thread(grp, type, p);
+out_tasklist:
+ read_unlock(&tasklist_lock);
+
+out:
+ sched_core_put_cookie(cookie);
+ put_task_struct(task);
+ return err;
+}
+
diff --git a/kernel/sys.c b/kernel/sys.c
index 3a583a2..9de46a4 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2550,6 +2550,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = set_syscall_user_dispatch(arg2, arg3, arg4,
(char __user *) arg5);
break;
+#ifdef CONFIG_SCHED_CORE
+ case PR_SCHED_CORE:
+ error = sched_core_share_pid(arg2, arg3, arg4, arg5);
+ break;
+#endif
default:
error = -EINVAL;
break;
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index 18a9f59..967d9c5 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -259,4 +259,12 @@ struct prctl_mm_map {
#define PR_PAC_SET_ENABLED_KEYS 60
#define PR_PAC_GET_ENABLED_KEYS 61

+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE 62
+# define PR_SCHED_CORE_GET 0
+# define PR_SCHED_CORE_CREATE 1 /* create unique core_sched cookie */
+# define PR_SCHED_CORE_SHARE_TO 2 /* push core_sched cookie to pid */
+# define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */
+# define PR_SCHED_CORE_MAX 4
+
#endif /* _LINUX_PRCTL_H */

2021-06-14 23:43:41

by Josh Don

[permalink] [raw]
Subject: Re: [PATCH 18/19] sched: prctl() core-scheduling interface

On Thu, Apr 22, 2021 at 5:36 AM Peter Zijlstra <[email protected]> wrote:
>
> From: Chris Hyser <[email protected]>
>
> This patch provides support for setting and copying core scheduling
> 'task cookies' between threads (PID), processes (TGID), and process
> groups (PGID).

[snip]

Internally, we have lots of trusted processes that don't have a
security need for coresched cookies. However, these processes could
still decide to create cookies for themselves, which will degrade
machine capacity and performance for other jobs on the machine.

Any thoughts on whether it would be desirable to have the ability to
restrict use of SCHED_CORE_CREATE? Perhaps a new SCHED_CORE capability
would be appropriate?

- Josh

2021-06-15 11:34:04

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 18/19] sched: prctl() core-scheduling interface

On Mon, Jun 14, 2021 at 7:36 PM Josh Don <[email protected]> wrote:
>
> On Thu, Apr 22, 2021 at 5:36 AM Peter Zijlstra <[email protected]> wrote:
> >
> > From: Chris Hyser <[email protected]>
> >
> > This patch provides support for setting and copying core scheduling
> > 'task cookies' between threads (PID), processes (TGID), and process
> > groups (PGID).
>
> [snip]
>
> Internally, we have lots of trusted processes that don't have a
> security need for coresched cookies. However, these processes could
> still decide to create cookies for themselves, which will degrade
> machine capacity and performance for other jobs on the machine.
>
> Any thoughts on whether it would be desirable to have the ability to
> restrict use of SCHED_CORE_CREATE? Perhaps a new SCHED_CORE capability
> would be appropriate?

Hi,
Maybe a capability may not work because then other users who don't
care for the issue you mention will be required to manage/assign the
capability as well?

How about you use seccomp to filter the prctl based on the PID, and
CREATE command?

-Joel

2021-08-05 23:15:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 18/19] sched: prctl() core-scheduling interface

On Thu, Aug 05, 2021 at 06:53:19PM +0200, Eugene Syromiatnikov wrote:
> On Thu, Apr 22, 2021 at 02:05:17PM +0200, Peter Zijlstra wrote:
> > API:
> >
> > prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, tgtpid, pidtype, &cookie)
> > prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, NULL)
> > prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, NULL)
> > prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, srcpid, pidtype, NULL)
> >
> > where 'tgtpid/srcpid == 0' implies the current process and pidtype is
> > kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}.
>
> It means that enum pid_tipe has to be a part of UAPI now. Would you
> like to address it, or rather I'd send a patch?

Please send a patch; I'm more sparse than usual atm.

2021-08-05 23:57:15

by Eugene Syromiatnikov

[permalink] [raw]
Subject: Re: [PATCH 18/19] sched: prctl() core-scheduling interface

On Thu, Apr 22, 2021 at 02:05:17PM +0200, Peter Zijlstra wrote:
> API:
>
> prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, tgtpid, pidtype, &cookie)
> prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, NULL)
> prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, NULL)
> prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, srcpid, pidtype, NULL)
>
> where 'tgtpid/srcpid == 0' implies the current process and pidtype is
> kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}.

It means that enum pid_tipe has to be a part of UAPI now. Would you
like to address it, or rather I'd send a patch?

2021-08-17 15:19:12

by Eugene Syromiatnikov

[permalink] [raw]
Subject: Re: [PATCH 18/19] sched: prctl() core-scheduling interface

On Thu, Apr 22, 2021 at 02:05:17PM +0200, Peter Zijlstra wrote:
> From: Chris Hyser <[email protected]>
>
> This patch provides support for setting and copying core scheduling
> 'task cookies' between threads (PID), processes (TGID), and process
> groups (PGID).

Hello.

It seems that there is some issue within the scheduler code
that can be triggered via this interface:

# gcc -std=gnu99 -Wextra -Werror prctl-sched-core-oops-repro.c -o prctl-sched-core-oops-repro
# ../src/strace -fvq -eprctl,clone,setsid ./prctl-sched-core-oops-repro
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f271f875890) = 239820
[pid 239820] setsid() = 239820
[pid 239820] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=239820, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
Iteration 0 status: 0
prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 239816, 0x2 /* PIDTYPE_PGID */, NULL) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f271f875890) = 239821
[pid 239821] setsid() = ?
[pid 239821] +++ killed by SIGKILL +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=239821, si_uid=0, si_status=SIGKILL, si_utime=0, si_stime=0} ---
Iteration 1 status: 0x9
prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 239816, 0x2 /* PIDTYPE_PGID */, NULL) = 0
+++ exited with 0 +++

kmsg indicates that a NULL pointer dereference has occurred:

[76195.611570] BUG: kernel NULL pointer dereference, address: 0000000000000000
...
[76195.621771] RIP: 0010:do_raw_spin_trylock+0x5/0x40
...
[76195.640144] Call Trace:
[76195.640706] _raw_spin_lock_nested+0x37/0x80
[76195.641645] ? raw_spin_rq_lock_nested+0x4b/0x80
[76195.642693] raw_spin_rq_lock_nested+0x4b/0x80
[76195.643669] online_fair_sched_group+0x39/0x240
[76195.644663] sched_autogroup_create_attach+0x9d/0x170
[76195.645765] ksys_setsid+0xe6/0x110
[76195.646533] __do_sys_setsid+0xa/0x10
[76195.647358] do_syscall_64+0x3b/0x90
[76195.648219] entry_SYSCALL_64_after_hwframe+0x44/0xae

The full kmsg excerpt and the reproducer code are attached.

There's also additional "BUG: sleeping function called from invalid
context at include/linux/percpu-rwsem.h:49" message is produced (see the
full log in the attached file "prctl-sched-core-oops-bug-dmesg.log")
when the full test case[1] is run, but I haven't been successful
so far in producing a minimal reproduced for it.

[1] https://github.com/strace/strace/commit/a90a5a56d2b76ba3ebd417472a02f40d3d6599d8
Run with `./bootstrap && ./configure CFLAGS='-g -Og' --enable-gcc-Werror &&
make check TESTS=prctl-sched-core--pidns-translation.gen`


Attachments:
(No filename) (2.81 kB)
prctl-sched-core-oops-repro.c (474.00 B)
prctl-sched-core-oops-dmesg.log (4.19 kB)
prctl-sched-core-oops-bug-dmesg.log (6.03 kB)
Download all attachments

2021-08-17 15:57:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 18/19] sched: prctl() core-scheduling interface

On Tue, Aug 17, 2021 at 05:15:42PM +0200, Eugene Syromiatnikov wrote:
> [76195.611570] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [76195.613059] #PF: supervisor read access in kernel mode
> [76195.614174] #PF: error_code(0x0000) - not-present page
> [76195.615329] PGD 800000005f27e067 P4D 800000005f27e067 PUD 3f7a3067 PMD 0
> [76195.616801] Oops: 0000 [#67] SMP PTI
> [76195.617586] CPU: 2 PID: 239821 Comm: prctl-sched-cor Tainted: G D W --------- --- 5.14.0-0.rc5.20210813gitf8e6dfc64f61.46.fc36.x86_64 #1
> [76195.620374] Hardware name: HP ProLiant BL480c G1, BIOS I14 10/04/2007
> [76195.621771] RIP: 0010:do_raw_spin_trylock+0x5/0x40
> [76195.622821] Code: c6 a4 12 5f 9f 48 89 ef e8 c8 fe ff ff eb a9 89 c6 48 89 ef e8 0c f5 ff ff 66 90 eb a9 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 07 85 c0 75 28 ba 01 00 00 00 f0 0f b1 17 75 1d 65 8b 05 fb 98
> [76195.626797] RSP: 0018:ffffa366014abe58 EFLAGS: 00010086
> [76195.627936] RAX: 0000000000000001 RBX: 0000000000000004 RCX: 0000000000000000
> [76195.629470] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> [76195.631048] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
> [76195.632585] R10: 0000000000000000 R11: ffff98292b21ad48 R12: 0000000000000018
> [76195.634078] R13: 0000000000000000 R14: ffff98292b7ef940 R15: ffff982813938e00
> [76195.635621] FS: 00007f271f8755c0(0000) GS:ffff98292b200000(0000) knlGS:0000000000000000
> [76195.637354] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [76195.638606] CR2: 0000000000000000 CR3: 000000000fdd0000 CR4: 00000000000006e0
> [76195.640144] Call Trace:
> [76195.640706] _raw_spin_lock_nested+0x37/0x80
> [76195.641645] ? raw_spin_rq_lock_nested+0x4b/0x80
> [76195.642693] raw_spin_rq_lock_nested+0x4b/0x80
> [76195.643669] online_fair_sched_group+0x39/0x240

Urgh... lemme guess, your HP BIOS is funny and reports more possible
CPUs than you actually have resulting in cpu_possible_mask !=
cpu_online_mask. Alternatively, you booted with nr_cpus= or something
daft like that.

That code does for_each_possible_cpus(i) { rq_lock_irq(cpu_rq(i)); },
which, because of core-sched, needs rq->core set-up, but because these
CPUs have never been online, that's not done and *BOOM*.

Or something like that.. I'll try and have a look tomorrow, I'm in dire
need of sleep.

2021-08-17 23:20:04

by Eugene Syromiatnikov

[permalink] [raw]
Subject: Re: [PATCH 18/19] sched: prctl() core-scheduling interface

On Tue, Aug 17, 2021 at 05:52:43PM +0200, Peter Zijlstra wrote:
> Urgh... lemme guess, your HP BIOS is funny and reports more possible
> CPUs than you actually have resulting in cpu_possible_mask !=
> cpu_online_mask. Alternatively, you booted with nr_cpus= or something
> daft like that.

Yep, it seems to be the case:

# cat /sys/devices/system/cpu/possible
0-7
# cat /sys/devices/system/cpu/online
0-3

>
> That code does for_each_possible_cpus(i) { rq_lock_irq(cpu_rq(i)); },
> which, because of core-sched, needs rq->core set-up, but because these
> CPUs have never been online, that's not done and *BOOM*.
>
> Or something like that.. I'll try and have a look tomorrow, I'm in dire
> need of sleep.
>