2021-10-09 15:15:00

by Pratik R. Sampat

[permalink] [raw]
Subject: [RFC 0/5] kernel: Introduce CPU Namespace

An early prototype of to demonstrate CPU namespace interface and its
mechanism.

The kernel provides two ways to control CPU resources for tasks
1. cgroup cpuset:
A control mechanism to restrict CPUs to a task or a
set of tasks attached to that group
2. syscall sched_setaffinity:
A system call that can pin tasks to a set of CPUs

The kernel also provides three ways to view the CPU resources available
to the system:
1. sys/procfs:
CPU system information is divulged through sys and proc fs, it
exposes online, offline, present as well as load characteristics on
the CPUs
2. syscall sched_getaffinity:
A system call interface to get the cpuset affinity of tasks
3. cgroup cpuset:
While cgroup is more of a control mechanism than a display mechanism,
it can be viewed to retrieve the CPU restrictions applied on a group
of tasks

Coherency of information
------------------------
The control and the display interface is fairly disjoint with each
other. Restrictions can be set through control interfaces like cgroups,
while many applications legacy or otherwise get the view of the system
through sysfs/procfs and allocate resources like number of
threads/processes, memory allocation based on that information.

This can lead to unexpected running behaviors as well as have a high
impact on performance.

Existing solutions to the problem include userspace tools like LXCFS
which can fake the sysfs information by mounting onto the sysfs online
file to be in coherence with the limits set through cgroup cpuset.
However, LXCFS is an external solution and needs to be explicitly setup
for applications that require it. Another concern is also that tools
like LXCFS don't handle all the other display mechanism like procfs load
stats.

Therefore, the need of a clean interface could be advocated for.

Security and fair use implications
----------------------------------
In a multi-tenant system, multiple containers may exist and information
about the entire system, rather than just the resources that are
restricted upon them can cause security and fair use implications such
as:
1. A case where an actor can be in cognizance of the CPU node topology
can schedule workloads and select CPUs such that the bus is flooded
causing a Denial Of Service attack
2. A case wherein identifying the CPU system topology can help identify
cores that are close to buses and peripherals such as GPUs to get an
undue latency advantage from the rest of the workloads

A survey RFD discusses other potential solutions and their concerns are
listed here: https://lkml.org/lkml/2021/7/22/204

This prototype patchset introduces a new kernel namespace mechanism --
CPU namespace.

The CPU namespace isolates CPU information by virtualizing logical CPU
IDs and creating a scrambled virtual CPU map of the same.
It latches onto the task_struct and is the cpu translations designed to
be in a flat hierarchy this means that every virtual namespace CPU maps
to a physical CPU at the creation of the namespace. The advantage of a
flat hierarchy is that translations are O(1) and children do not need
to traverse up the tree to retrieve a translation.

This namespace then allows both control and display interfaces to be
CPU namespace context aware, such that a task within a namespace only
gets the view and therefore control of its and view CPU resources
available to it via a virtual CPU map.

Experiment
----------
We designed an experiment to benchmark nginx configured with
"worker_processes: auto" (which ensures that the number of processes to
spawn will be derived from resources viewed on the system) and a
benchmark/driver application wrk

Nginx: Nginx is a web server that can also be used as a reverse proxy,
load balancer, mail proxy and HTTP cache
Wrk: wrk is a modern HTTP benchmarking tool capable of generating
significant load when run on a single multi-core CPU

Docker is used as the containerization platform of choice.

The numbers gathered on IBM Power 9 CPU @ 2.979GHz with 176 CPUs and
127GB memory
kernel: 5.14

Case1: vanilla kernel - cpuset 4 cpus, no optimization
Case2: CPU namespace kernel - cpuset 4 cpus


+-----------------------+----------+----------+-----------------+
| Metric | Case1 | Case2 | case2 vs case 1 |
+-----------------------+----------+----------+-----------------+
| PIDs | 177 | 5 | 172 PIDs |
| mem usage (init) (MB) | 272.8 | 11.12 | 95.92% |
| mem usage (peak) (MB) | 281.3 | 20.62 | 92.66% |
| Latency (avg ms) | 70.91 | 25.36 | 64.23% |
| Requests/sec | 47011.05 | 47080.98 | 0.14% |
| Transfer/sec (MB) | 38.11 | 38.16 | 0.13% |
+-----------------------+----------+----------+-----------------+

With the CPU namespace we see the correct number of PIDs spawning
corresponding to the cpuset limits set. The memory utilization drops
over 92-95%, the latency reduces by 64% and the the throughput like
requests and transfer per second is unchanged.

Note: To utilize this new namespace in a container runtime like docker,
the clone CPU namespace flag was modified to coincide with the PID
namespace as they are the building blocks of containers and will always
be invoked.

Current shortcomings in the prototype:
--------------------------------------
1. Containers also frequently use cfs period and quotas to restrict CPU
runtime also known as millicores in modern container runtimes.
The RFC interface currently does not account for this in
the scheme of things.
2. While /proc/stat is now namespace aware and userspace programs like
top will see the CPU utilization for their view of virtual CPUs;
if the system or any other application outside the namespace
bumps up the CPU utilization it will still show up in sys/user time.
This should ideally be shown as stolen time instead.
The current implementation plugs into the display of stats rather
than accounting which causes incorrect reporting of stolen time.
3. The current implementation assumes that no hotplug operations occur
within a container and hence the online and present cpus within a CPU
namespace are always the same and query the same CPU namespace mask
4. As this is a proof of concept, currently we do not differentiate
between cgroup cpus_allowed and effective_cpus and plugs them into
the same virtual CPU map of the namespace
5. As described in a fair use implication earlier, knowledge of the
CPU topology can potentially be taken an misused with a flood.
While scrambling the CPUset in the namespace can help by
obfuscation of information, the topology can still be roughly figured
out with the use of IPI latencies to determine siblings or far away
cores

More information about the design and a video demo of the prototype can
be found here: https://pratiksampat.github.io/cpu_namespace.html

Pratik R. Sampat (5):
ns: Introduce CPU Namespace
ns: Add scrambling functionality to CPU namespace
cpuset/cpuns: Make cgroup CPUset CPU namespace aware
cpu/cpuns: Make sysfs CPU namespace aware
proc/cpuns: Make procfs load stats CPU namespace aware

drivers/base/cpu.c | 35 ++++-
fs/proc/namespaces.c | 4 +
fs/proc/stat.c | 50 +++++--
include/linux/cpu_namespace.h | 159 ++++++++++++++++++++++
include/linux/nsproxy.h | 2 +
include/linux/proc_ns.h | 2 +
include/linux/user_namespace.h | 1 +
include/uapi/linux/sched.h | 1 +
init/Kconfig | 8 ++
kernel/Makefile | 1 +
kernel/cgroup/cpuset.c | 57 +++++++-
kernel/cpu_namespace.c | 233 +++++++++++++++++++++++++++++++++
kernel/fork.c | 2 +-
kernel/nsproxy.c | 30 ++++-
kernel/sched/core.c | 16 ++-
kernel/ucount.c | 1 +
16 files changed, 581 insertions(+), 21 deletions(-)
create mode 100644 include/linux/cpu_namespace.h
create mode 100644 kernel/cpu_namespace.c

--
2.31.1


2021-10-09 15:15:11

by Pratik R. Sampat

[permalink] [raw]
Subject: [RFC 1/5] ns: Introduce CPU Namespace

CPU namespace isolates CPU topology information

The CPU namespace isolates CPU information by virtualizing CPU IDs as
viewed by linux and maintaining a virtual map for each task.
The commit also adds functionality of plugging this interface into the
control and display interface via the sched_set/getaffinity syscalls.
These syscalls translate the namespace map and vice-versa to determine
the CPUset for the task to operate on.

As all the clone flags have been exhausted, therefore following suit
with the time namespace, the flag for a new CPU namespace similarly
now continues with the pattern of intersecting with CSIGNAL.
This means that this namespace can be triggered by only unshare()
and clone3() syscalls.

Signed-off-by: Pratik R. Sampat <[email protected]>
---
fs/proc/namespaces.c | 4 +
include/linux/cpu_namespace.h | 159 +++++++++++++++++++++++++++
include/linux/nsproxy.h | 2 +
include/linux/proc_ns.h | 2 +
include/linux/user_namespace.h | 1 +
include/uapi/linux/sched.h | 1 +
init/Kconfig | 8 ++
kernel/Makefile | 1 +
kernel/cpu_namespace.c | 192 +++++++++++++++++++++++++++++++++
kernel/fork.c | 2 +-
kernel/nsproxy.c | 30 +++++-
kernel/sched/core.c | 16 ++-
kernel/ucount.c | 1 +
13 files changed, 414 insertions(+), 5 deletions(-)
create mode 100644 include/linux/cpu_namespace.h
create mode 100644 kernel/cpu_namespace.c

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8e159fc78c0a..d65170a8a648 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -9,6 +9,7 @@
#include <linux/ipc_namespace.h>
#include <linux/pid_namespace.h>
#include <linux/user_namespace.h>
+#include <linux/cpu_namespace.h>
#include "internal.h"


@@ -37,6 +38,9 @@ static const struct proc_ns_operations *ns_entries[] = {
&timens_operations,
&timens_for_children_operations,
#endif
+#ifdef CONFIG_CPU_NS
+ &cpuns_operations,
+#endif
};

static const char *proc_ns_get_link(struct dentry *dentry,
diff --git a/include/linux/cpu_namespace.h b/include/linux/cpu_namespace.h
new file mode 100644
index 000000000000..edad05919db7
--- /dev/null
+++ b/include/linux/cpu_namespace.h
@@ -0,0 +1,159 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_CPU_NS_H
+#define _LINUX_CPU_NS_H
+
+#include <linux/sched.h>
+#include <linux/bug.h>
+#include <linux/nsproxy.h>
+#include <linux/ns_common.h>
+
+/*
+ * Virtual CPUs => View of the CPUs in the CPU NS context
+ * Physical CPUs => CPU as viewed by host, also known as logical CPUs
+ */
+struct cpu_namespace {
+ /* Virtual map of cpus in the cpuset */
+ cpumask_t v_cpuset_cpus;
+ /* map for CPU translation -- Physical --> Virtual */
+ int p_v_trans_map[NR_CPUS];
+ /* map for CPU translation -- Virtual --> Physical */
+ int v_p_trans_map[NR_CPUS];
+ struct cpu_namespace *parent;
+ struct ucounts *ucounts;
+ struct user_namespace *user_ns;
+ struct ns_common ns;
+} __randomize_layout;
+
+extern struct cpu_namespace init_cpu_ns;
+
+#ifdef CONFIG_CPU_NS
+
+static inline struct cpu_namespace *get_cpu_ns(struct cpu_namespace *ns)
+{
+ if (ns != &init_cpu_ns)
+ refcount_inc(&ns->ns.count);
+ return ns;
+}
+
+/*
+ * Get the virtual CPU for the requested physical CPU in the ns context
+ */
+static inline int get_vcpu_cpuns(struct cpu_namespace *c, int pcpu)
+{
+ if (pcpu >= num_possible_cpus())
+ return -1;
+
+ return c->p_v_trans_map[pcpu];
+}
+
+/*
+ * Get the physical CPU for requested virtual CPU in the ns context
+ */
+static inline int get_pcpu_cpuns(struct cpu_namespace *c, int vcpu)
+{
+ if (vcpu >= num_possible_cpus())
+ return -1;
+
+ return c->v_p_trans_map[vcpu];
+}
+
+/*
+ * Given the physical CPU map get the virtual CPUs corresponding to that ns
+ */
+static inline cpumask_t get_vcpus_cpuns(struct cpu_namespace *c,
+ const cpumask_var_t mask)
+{
+ int cpu;
+ cpumask_t temp;
+
+ cpumask_clear(&temp);
+
+ for_each_cpu(cpu, mask)
+ cpumask_set_cpu(get_vcpu_cpuns(c, cpu), &temp);
+
+ return temp;
+}
+
+/*
+ * Given a virtual CPU map get the physical CPUs corresponding to that ns
+ */
+static inline cpumask_t get_pcpus_cpuns(struct cpu_namespace *c,
+ const cpumask_var_t mask)
+{
+ int cpu;
+ cpumask_t temp;
+
+ cpumask_clear(&temp);
+
+ for_each_cpu(cpu, mask)
+ cpumask_set_cpu(get_pcpu_cpuns(c, cpu), &temp);
+
+ return temp;
+}
+
+extern struct cpu_namespace *copy_cpu_ns(unsigned long flags,
+ struct user_namespace *user_ns, struct cpu_namespace *ns);
+extern void put_cpu_ns(struct cpu_namespace *ns);
+
+#else /* !CONFIG_CPU_NS */
+#include <linux/err.h>
+
+static inline struct cpu_namespace *get_cpu_ns(struct cpu_namespace *ns)
+{
+ return ns;
+}
+
+static inline struct cpu_namespace *copy_cpu_ns(unsigned long flags,
+ struct user_namespace *user_ns, struct cpu_namespace *ns)
+{
+ if (flags & CLONE_NEWCPU)
+ return ERR_PTR(-EINVAL);
+ return ns;
+}
+
+static inline void put_cpu_ns(struct cpu_namespace *ns)
+{
+}
+
+static inline int get_vcpu_cpuns(struct cpu_namespace *c, int pcpu)
+{
+ return pcpu;
+}
+
+static inline int get_pcpu_cpuns(struct cpu_namespace *c, int vcpu)
+{
+ return vcpu;
+}
+
+static inline cpumask_t get_vcpus_cpuns(struct cpu_namespace *c,
+ const cpumask_var_t mask)
+{
+ cpumask_t temp;
+ int cpu;
+
+ cpumask_clear(&temp);
+
+ for_each_cpu(cpu, mask)
+ cpumask_set_cpu(get_vcpu_cpuns(c, cpu), &temp);
+
+ return temp;
+}
+
+static inline cpumask_t get_pcpus_cpuns(struct cpu_namespace *c,
+ const cpumask_var_t mask)
+{
+ cpumask_t temp;
+ int cpu;
+
+ cpumask_clear(&temp);
+
+ for_each_cpu(cpu, mask)
+ cpumask_set_cpu(get_pcpu_cpuns(c, cpu), &temp);
+
+ return temp;
+}
+
+#endif /* CONFIG_CPU_NS */
+
+#endif /* _LINUX_CPU_NS_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index cdb171efc7cb..40e0357fe0bb 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -10,6 +10,7 @@ struct uts_namespace;
struct ipc_namespace;
struct pid_namespace;
struct cgroup_namespace;
+struct cpu_namespace;
struct fs_struct;

/*
@@ -38,6 +39,7 @@ struct nsproxy {
struct time_namespace *time_ns;
struct time_namespace *time_ns_for_children;
struct cgroup_namespace *cgroup_ns;
+ struct cpu_namespace *cpu_ns;
};
extern struct nsproxy init_nsproxy;

diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 75807ecef880..dd1db6782336 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -34,6 +34,7 @@ extern const struct proc_ns_operations mntns_operations;
extern const struct proc_ns_operations cgroupns_operations;
extern const struct proc_ns_operations timens_operations;
extern const struct proc_ns_operations timens_for_children_operations;
+extern const struct proc_ns_operations cpuns_operations;

/*
* We always define these enumerators
@@ -46,6 +47,7 @@ enum {
PROC_PID_INIT_INO = 0xEFFFFFFCU,
PROC_CGROUP_INIT_INO = 0xEFFFFFFBU,
PROC_TIME_INIT_INO = 0xEFFFFFFAU,
+ PROC_CPU_INIT_INO = 0xEFFFFFFU,
};

#ifdef CONFIG_PROC_FS
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index eb70cabe6e7f..9f0b121f97ac 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -46,6 +46,7 @@ enum ucount_type {
UCOUNT_MNT_NAMESPACES,
UCOUNT_CGROUP_NAMESPACES,
UCOUNT_TIME_NAMESPACES,
+ UCOUNT_CPU_NAMESPACES,
#ifdef CONFIG_INOTIFY_USER
UCOUNT_INOTIFY_INSTANCES,
UCOUNT_INOTIFY_WATCHES,
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 3bac0a8ceab2..f8bb6de68874 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -41,6 +41,7 @@
* cloning flags intersect with CSIGNAL so can be used with unshare and clone3
* syscalls only:
*/
+#define CLONE_NEWCPU 0x00000040 /* New cpu namespace */
#define CLONE_NEWTIME 0x00000080 /* New time namespace */

#ifndef __ASSEMBLY__
diff --git a/init/Kconfig b/init/Kconfig
index 55f9f7738ebb..c3e3abd35bb4 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1214,6 +1214,14 @@ config NET_NS
Allow user space to create what appear to be multiple instances
of the network stack.

+config CPU_NS
+ bool "CPU Namespaces"
+ default y
+ help
+ Support CPU namespaces. This allows having a context-aware
+ scrambled view of the CPU topology coherent to limits set
+ in system control mechanism.
+
endif # NAMESPACES

config CHECKPOINT_RESTORE
diff --git a/kernel/Makefile b/kernel/Makefile
index 4df609be42d0..5a37e3e56f8f 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -82,6 +82,7 @@ obj-$(CONFIG_CGROUPS) += cgroup/
obj-$(CONFIG_UTS_NS) += utsname.o
obj-$(CONFIG_USER_NS) += user_namespace.o
obj-$(CONFIG_PID_NS) += pid_namespace.o
+obj-$(CONFIG_CPU_NS) += cpu_namespace.o
obj-$(CONFIG_IKCONFIG) += configs.o
obj-$(CONFIG_IKHEADERS) += kheaders.o
obj-$(CONFIG_SMP) += stop_machine.o
diff --git a/kernel/cpu_namespace.c b/kernel/cpu_namespace.c
new file mode 100644
index 000000000000..6c700522352a
--- /dev/null
+++ b/kernel/cpu_namespace.c
@@ -0,0 +1,192 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * CPU namespaces
+ * <TODO More description>
+ *
+ * Author: Pratik Rajesh Sampat <[email protected]>
+ */
+
+#include <linux/cpu_namespace.h>
+#include <linux/syscalls.h>
+#include <linux/proc_ns.h>
+#include <linux/export.h>
+#include <linux/acct.h>
+#include <linux/err.h>
+#include <linux/random.h>
+
+static void dec_cpu_namespaces(struct ucounts *ucounts)
+{
+ dec_ucount(ucounts, UCOUNT_CPU_NAMESPACES);
+}
+
+static void destroy_cpu_namespace(struct cpu_namespace *ns)
+{
+ ns_free_inum(&ns->ns);
+
+ dec_cpu_namespaces(ns->ucounts);
+ put_user_ns(ns->user_ns);
+}
+
+static struct ucounts *inc_cpu_namespaces(struct user_namespace *ns)
+{
+ return inc_ucount(ns, current_euid(), UCOUNT_CPU_NAMESPACES);
+}
+
+static struct cpu_namespace *create_cpu_namespace(struct user_namespace *user_ns,
+ struct cpu_namespace *parent_cpu_ns)
+{
+ struct cpu_namespace *ns;
+ struct ucounts *ucounts;
+ int err, i, cpu;
+ cpumask_t temp;
+
+ err = -EINVAL;
+ if (!in_userns(parent_cpu_ns->user_ns, user_ns))
+ goto out;
+
+ ucounts = inc_cpu_namespaces(user_ns);
+ if (!ucounts)
+ goto out;
+
+ err = -ENOMEM;
+ ns = kmalloc(sizeof(*ns), GFP_KERNEL);
+ if (ns == NULL)
+ goto out_dec;
+
+ err = ns_alloc_inum(&ns->ns);
+ if (err)
+ goto out_free_ns;
+ ns->ns.ops = &cpuns_operations;
+
+ refcount_set(&ns->ns.count, 1);
+ ns->parent = get_cpu_ns(parent_cpu_ns);
+ ns->user_ns = get_user_ns(user_ns);
+
+ for_each_present_cpu(cpu) {
+ ns->p_v_trans_map[cpu] = ns->parent->p_v_trans_map[cpu];
+ ns->v_p_trans_map[cpu] = ns->parent->v_p_trans_map[cpu];
+ }
+ cpumask_clear(&temp);
+ cpumask_clear(&ns->v_cpuset_cpus);
+
+ for_each_cpu(i, &parent_cpu_ns->v_cpuset_cpus) {
+ int parent_pcpu = get_pcpu_cpuns(parent_cpu_ns, i);
+
+ cpumask_set_cpu(get_vcpu_cpuns(ns, parent_pcpu),
+ &ns->v_cpuset_cpus);
+ }
+ for_each_cpu(i, &ns->v_cpuset_cpus)
+ cpumask_set_cpu(get_pcpu_cpuns(ns, i), &temp);
+
+ set_cpus_allowed_ptr(current, &temp);
+
+ return ns;
+
+out_free_ns:
+ kfree(ns);
+out_dec:
+ dec_cpu_namespaces(ucounts);
+out:
+ return ERR_PTR(err);
+}
+
+struct cpu_namespace *copy_cpu_ns(unsigned long flags,
+ struct user_namespace *user_ns, struct cpu_namespace *old_ns)
+{
+ if (!(flags & CLONE_NEWCPU))
+ return get_cpu_ns(old_ns);
+ return create_cpu_namespace(user_ns, old_ns);
+}
+
+void put_cpu_ns(struct cpu_namespace *ns)
+{
+ struct cpu_namespace *parent;
+
+ while (ns != &init_cpu_ns) {
+ parent = ns->parent;
+ if (!refcount_dec_and_test(&ns->ns.count))
+ break;
+ destroy_cpu_namespace(ns);
+ ns = parent;
+ }
+}
+EXPORT_SYMBOL_GPL(put_cpu_ns);
+
+static inline struct cpu_namespace *to_cpu_ns(struct ns_common *ns)
+{
+ return container_of(ns, struct cpu_namespace, ns);
+}
+
+static struct ns_common *cpuns_get(struct task_struct *task)
+{
+ struct cpu_namespace *ns = NULL;
+ struct nsproxy *nsproxy;
+
+ task_lock(task);
+ nsproxy = task->nsproxy;
+ if (nsproxy) {
+ ns = nsproxy->cpu_ns;
+ get_cpu_ns(ns);
+ }
+ task_unlock(task);
+
+ return ns ? &ns->ns : NULL;
+}
+
+static void cpuns_put(struct ns_common *ns)
+{
+ put_cpu_ns(to_cpu_ns(ns));
+}
+
+static int cpuns_install(struct nsset *nsset, struct ns_common *new)
+{
+ struct nsproxy *nsproxy = nsset->nsproxy;
+ struct cpu_namespace *ns = to_cpu_ns(new);
+
+ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
+ !ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ get_cpu_ns(ns);
+ put_cpu_ns(nsproxy->cpu_ns);
+ nsproxy->cpu_ns = ns;
+ return 0;
+}
+
+static struct user_namespace *cpuns_owner(struct ns_common *ns)
+{
+ return to_cpu_ns(ns)->user_ns;
+}
+
+const struct proc_ns_operations cpuns_operations = {
+ .name = "cpu",
+ .type = CLONE_NEWCPU,
+ .get = cpuns_get,
+ .put = cpuns_put,
+ .install = cpuns_install,
+ .owner = cpuns_owner,
+};
+
+struct cpu_namespace init_cpu_ns = {
+ .ns.count = REFCOUNT_INIT(2),
+ .user_ns = &init_user_ns,
+ .ns.inum = PROC_CPU_INIT_INO,
+ .ns.ops = &cpuns_operations,
+};
+EXPORT_SYMBOL(init_cpu_ns);
+
+static __init int cpu_namespaces_init(void)
+{
+ int cpu;
+
+ cpumask_copy(&init_cpu_ns.v_cpuset_cpus, cpu_possible_mask);
+
+ /* Identity mapping for the cpu_namespace init */
+ for_each_present_cpu(cpu) {
+ init_cpu_ns.p_v_trans_map[cpu] = cpu;
+ init_cpu_ns.v_p_trans_map[cpu] = cpu;
+ }
+
+ return 0;
+}
+device_initcall(cpu_namespaces_init);
diff --git a/kernel/fork.c b/kernel/fork.c
index bc94b2cc5995..fac3317b1f57 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2877,7 +2877,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP|
- CLONE_NEWTIME))
+ CLONE_NEWTIME|CLONE_NEWCPU))
return -EINVAL;
/*
* Not implemented, but pretend it works if there is nothing
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index abc01fcad8c7..dab0f9799ce7 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -19,6 +19,7 @@
#include <net/net_namespace.h>
#include <linux/ipc_namespace.h>
#include <linux/time_namespace.h>
+#include <linux/cpu_namespace.h>
#include <linux/fs_struct.h>
#include <linux/proc_fs.h>
#include <linux/proc_ns.h>
@@ -47,6 +48,9 @@ struct nsproxy init_nsproxy = {
.time_ns = &init_time_ns,
.time_ns_for_children = &init_time_ns,
#endif
+#ifdef CONFIG_CPU_NS
+ .cpu_ns = &init_cpu_ns,
+#endif
};

static inline struct nsproxy *create_nsproxy(void)
@@ -121,8 +125,17 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
}
new_nsp->time_ns = get_time_ns(tsk->nsproxy->time_ns);

+ new_nsp->cpu_ns = copy_cpu_ns(flags, user_ns, tsk->nsproxy->cpu_ns);
+ if (IS_ERR(new_nsp->cpu_ns)) {
+ err = PTR_ERR(new_nsp->cpu_ns);
+ goto out_cpu;
+ }
+
return new_nsp;

+out_cpu:
+ if (new_nsp->cpu_ns)
+ put_cpu_ns(new_nsp->cpu_ns);
out_time:
put_net(new_nsp->net_ns);
out_net:
@@ -156,7 +169,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)

if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
CLONE_NEWPID | CLONE_NEWNET |
- CLONE_NEWCGROUP | CLONE_NEWTIME)))) {
+ CLONE_NEWCGROUP | CLONE_NEWTIME |
+ CLONE_NEWCPU)))) {
if (likely(old_ns->time_ns_for_children == old_ns->time_ns)) {
get_nsproxy(old_ns);
return 0;
@@ -216,7 +230,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,

if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP |
- CLONE_NEWTIME)))
+ CLONE_NEWTIME | CLONE_NEWCPU)))
return 0;

user_ns = new_cred ? new_cred->user_ns : current_user_ns();
@@ -289,6 +303,10 @@ static int check_setns_flags(unsigned long flags)
if (flags & CLONE_NEWTIME)
return -EINVAL;
#endif
+#ifndef CONFIG_CPU_NS
+ if (flags & CLONE_NEWCPU)
+ return -EINVAL;
+#endif

return 0;
}
@@ -471,6 +489,14 @@ static int validate_nsset(struct nsset *nsset, struct pid *pid)
}
#endif

+#ifdef CONFIG_CPU_NS
+ if (flags & CLONE_NEWCPU) {
+ ret = validate_ns(nsset, &nsp->cpu_ns->ns);
+ if (ret)
+ goto out;
+ }
+#endif
+
out:
if (pid_ns)
put_pid_ns(pid_ns);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d9ff40f4661..0413175e6d73 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -27,6 +27,8 @@
#include "pelt.h"
#include "smp.h"

+#include <linux/cpu_namespace.h>
+
/*
* Export tracepoints that act as a bare tracehook (ie: have no trace event
* associated with them) to allow external modules to probe them.
@@ -7559,6 +7561,7 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
{
cpumask_var_t cpus_allowed, new_mask;
struct task_struct *p;
+ cpumask_t temp;
int retval;

rcu_read_lock();
@@ -7601,7 +7604,8 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)


cpuset_cpus_allowed(p, cpus_allowed);
- cpumask_and(new_mask, in_mask, cpus_allowed);
+ temp = get_pcpus_cpuns(current->nsproxy->cpu_ns, in_mask);
+ cpumask_and(new_mask, &temp, cpus_allowed);

/*
* Since bandwidth control happens on root_domain basis,
@@ -7682,8 +7686,9 @@ SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
long sched_getaffinity(pid_t pid, struct cpumask *mask)
{
struct task_struct *p;
+ cpumask_var_t temp;
unsigned long flags;
- int retval;
+ int retval, cpu;

rcu_read_lock();

@@ -7698,6 +7703,13 @@ long sched_getaffinity(pid_t pid, struct cpumask *mask)

raw_spin_lock_irqsave(&p->pi_lock, flags);
cpumask_and(mask, &p->cpus_mask, cpu_active_mask);
+ cpumask_clear(temp);
+ for_each_cpu(cpu, mask) {
+ cpumask_set_cpu(get_vcpu_cpuns(current->nsproxy->cpu_ns, cpu),
+ temp);
+ }
+
+ cpumask_copy(mask, temp);
raw_spin_unlock_irqrestore(&p->pi_lock, flags);

out_unlock:
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 87799e2379bd..3adb168b4a2b 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -76,6 +76,7 @@ static struct ctl_table user_table[] = {
UCOUNT_ENTRY("max_mnt_namespaces"),
UCOUNT_ENTRY("max_cgroup_namespaces"),
UCOUNT_ENTRY("max_time_namespaces"),
+ UCOUNT_ENTRY("max_cpu_namespaces"),
#ifdef CONFIG_INOTIFY_USER
UCOUNT_ENTRY("max_inotify_instances"),
UCOUNT_ENTRY("max_inotify_watches"),
--
2.31.1

2021-10-09 15:16:32

by Pratik R. Sampat

[permalink] [raw]
Subject: [RFC 5/5] proc/cpuns: Make procfs load stats CPU namespace aware

This commit adds support provide a virtualized view to the /proc/stat
load statistics. The load, idle, irq and the rest of the information of
a physical CPU is now displayed for its corresponding virtual CPU
counterpart.
The procfs file only populates the virtualized view for the CPUs based
on the restrictions from cgroupfs set upon it.

Signed-off-by: Pratik R. Sampat <[email protected]>
---
fs/proc/stat.c | 50 ++++++++++++++++++++++++++++++++++++++------------
1 file changed, 38 insertions(+), 12 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 6561a06ef905..3ff39e7362bb 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -14,6 +14,7 @@
#include <linux/irqnr.h>
#include <linux/sched/cputime.h>
#include <linux/tick.h>
+#include <linux/cpu_namespace.h>

#ifndef arch_irq_stat_cpu
#define arch_irq_stat_cpu(cpu) 0
@@ -107,13 +108,14 @@ static void show_all_irqs(struct seq_file *p)

static int show_stat(struct seq_file *p, void *v)
{
- int i, j;
+ int i, j, pcpu;
u64 user, nice, system, idle, iowait, irq, softirq, steal;
u64 guest, guest_nice;
u64 sum = 0;
u64 sum_softirq = 0;
unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
struct timespec64 boottime;
+ cpumask_var_t cpu_mask;

user = nice = system = idle = iowait =
irq = softirq = steal = 0;
@@ -122,27 +124,39 @@ static int show_stat(struct seq_file *p, void *v)
/* shift boot timestamp according to the timens offset */
timens_sub_boottime(&boottime);

- for_each_possible_cpu(i) {
+#ifdef CONFIG_CPU_NS
+ if (current->nsproxy->cpu_ns == &init_cpu_ns) {
+ cpumask_copy(cpu_mask, cpu_possible_mask);
+ } else {
+ cpumask_copy(cpu_mask,
+ &current->nsproxy->cpu_ns->v_cpuset_cpus);
+ }
+#else
+ cpumask_copy(cpu_mask, cpu_possible_mask);
+#endif
+
+ for_each_cpu(i, cpu_mask) {
struct kernel_cpustat kcpustat;
u64 *cpustat = kcpustat.cpustat;

- kcpustat_cpu_fetch(&kcpustat, i);
+ pcpu = get_pcpu_cpuns(current->nsproxy->cpu_ns, i);
+ kcpustat_cpu_fetch(&kcpustat, pcpu);

user += cpustat[CPUTIME_USER];
nice += cpustat[CPUTIME_NICE];
system += cpustat[CPUTIME_SYSTEM];
- idle += get_idle_time(&kcpustat, i);
- iowait += get_iowait_time(&kcpustat, i);
+ idle += get_idle_time(&kcpustat, pcpu);
+ iowait += get_iowait_time(&kcpustat, pcpu);
irq += cpustat[CPUTIME_IRQ];
softirq += cpustat[CPUTIME_SOFTIRQ];
steal += cpustat[CPUTIME_STEAL];
guest += cpustat[CPUTIME_GUEST];
guest_nice += cpustat[CPUTIME_GUEST_NICE];
- sum += kstat_cpu_irqs_sum(i);
- sum += arch_irq_stat_cpu(i);
+ sum += kstat_cpu_irqs_sum(pcpu);
+ sum += arch_irq_stat_cpu(pcpu);

for (j = 0; j < NR_SOFTIRQS; j++) {
- unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
+ unsigned int softirq_stat = kstat_softirqs_cpu(j, pcpu);

per_softirq_sums[j] += softirq_stat;
sum_softirq += softirq_stat;
@@ -162,18 +176,30 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest_nice));
seq_putc(p, '\n');

- for_each_online_cpu(i) {
+#ifdef CONFIG_CPU_NS
+ if (current->nsproxy->cpu_ns == &init_cpu_ns) {
+ cpumask_copy(cpu_mask, cpu_online_mask);
+ } else {
+ cpumask_copy(cpu_mask,
+ &current->nsproxy->cpu_ns->v_cpuset_cpus);
+ }
+#else
+ cpumask_copy(cpu_mask, cpu_online_mask);
+#endif
+ for_each_cpu(i, cpu_mask) {
struct kernel_cpustat kcpustat;
u64 *cpustat = kcpustat.cpustat;

- kcpustat_cpu_fetch(&kcpustat, i);
+ pcpu = get_pcpu_cpuns(current->nsproxy->cpu_ns, i);
+
+ kcpustat_cpu_fetch(&kcpustat, pcpu);

/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
user = cpustat[CPUTIME_USER];
nice = cpustat[CPUTIME_NICE];
system = cpustat[CPUTIME_SYSTEM];
- idle = get_idle_time(&kcpustat, i);
- iowait = get_iowait_time(&kcpustat, i);
+ idle = get_idle_time(&kcpustat, pcpu);
+ iowait = get_iowait_time(&kcpustat, pcpu);
irq = cpustat[CPUTIME_IRQ];
softirq = cpustat[CPUTIME_SOFTIRQ];
steal = cpustat[CPUTIME_STEAL];
--
2.31.1

2021-10-09 15:17:14

by Pratik R. Sampat

[permalink] [raw]
Subject: [RFC 2/5] ns: Add scrambling functionality to CPU namespace

The commit adds functionality to map every host CPU to a virtualized CPU
in the namespace.

Every CPU namespace apart from the init namespace has its CPU map
scrambled. The CPUs are mapped in a flat hierarchy. This means that
regardless of the parent-child hierarchy of the namespace, each
translation will map directly from the virtual namespace CPU to
physical CPU without the need of traversal along the hierarchy.

Signed-off-by: Pratik R. Sampat <[email protected]>
---
kernel/cpu_namespace.c | 49 ++++++++++++++++++++++++++++++++++++++----
1 file changed, 45 insertions(+), 4 deletions(-)

diff --git a/kernel/cpu_namespace.c b/kernel/cpu_namespace.c
index 6c700522352a..7b8b28f3d0e7 100644
--- a/kernel/cpu_namespace.c
+++ b/kernel/cpu_namespace.c
@@ -27,6 +27,33 @@ static void destroy_cpu_namespace(struct cpu_namespace *ns)
put_user_ns(ns->user_ns);
}

+/*
+ * Shuffle
+ * Arrange the N elements of ARRAY in random order.
+ * Only effective if N is much smaller than RAND_MAX;
+ * if this may not be the case, use a better random
+ * number generator. -- Ben Pfaff.
+ */
+#define RAND_MAX 32767
+void shuffle(int *array, size_t n)
+{
+ unsigned int rnd_num;
+ int i, j, t;
+
+ if (n <= 1)
+ return;
+
+ for (i = 0; i < n-1; i++) {
+ get_random_bytes(&rnd_num, sizeof(rnd_num));
+ rnd_num = rnd_num % RAND_MAX;
+
+ j = i + rnd_num / (RAND_MAX / (n - i) + 1);
+ t = array[j];
+ array[j] = array[i];
+ array[i] = t;
+ }
+}
+
static struct ucounts *inc_cpu_namespaces(struct user_namespace *ns)
{
return inc_ucount(ns, current_euid(), UCOUNT_CPU_NAMESPACES);
@@ -37,8 +64,9 @@ static struct cpu_namespace *create_cpu_namespace(struct user_namespace *user_ns
{
struct cpu_namespace *ns;
struct ucounts *ucounts;
- int err, i, cpu;
+ int err, i, cpu, n = 0;
cpumask_t temp;
+ int *cpu_arr;

err = -EINVAL;
if (!in_userns(parent_cpu_ns->user_ns, user_ns))
@@ -62,10 +90,21 @@ static struct cpu_namespace *create_cpu_namespace(struct user_namespace *user_ns
ns->parent = get_cpu_ns(parent_cpu_ns);
ns->user_ns = get_user_ns(user_ns);

- for_each_present_cpu(cpu) {
- ns->p_v_trans_map[cpu] = ns->parent->p_v_trans_map[cpu];
- ns->v_p_trans_map[cpu] = ns->parent->v_p_trans_map[cpu];
+ cpu_arr = kmalloc_array(num_possible_cpus(), sizeof(int), GFP_KERNEL);
+ if (!cpu_arr)
+ goto out_free_ns;
+
+ for_each_possible_cpu(cpu) {
+ cpu_arr[n++] = cpu;
+ }
+
+ shuffle(cpu_arr, n);
+
+ for (i = 0; i < n; i++) {
+ ns->p_v_trans_map[i] = cpu_arr[i];
+ ns->v_p_trans_map[cpu_arr[i]] = i;
}
+
cpumask_clear(&temp);
cpumask_clear(&ns->v_cpuset_cpus);

@@ -80,6 +119,8 @@ static struct cpu_namespace *create_cpu_namespace(struct user_namespace *user_ns

set_cpus_allowed_ptr(current, &temp);

+ kfree(cpu_arr);
+
return ns;

out_free_ns:
--
2.31.1

2021-10-09 15:17:17

by Pratik R. Sampat

[permalink] [raw]
Subject: [RFC 3/5] cpuset/cpuns: Make cgroup CPUset CPU namespace aware

When a new cgroup is created or a cpuset is updated, the mask supplied
to it looks for its corresponding CPU translations for the restrictions
to apply on.

The patch also updates the display interface such that tasks within the
namespace can view the corresponding virtual CPUset based on the
requested CPU namespace context.

Signed-off-by: Pratik R. Sampat <[email protected]>
---
kernel/cgroup/cpuset.c | 57 +++++++++++++++++++++++++++++++++++++++---
1 file changed, 54 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index adb5190c4429..eb1e950543cf 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -65,6 +65,7 @@
#include <linux/mutex.h>
#include <linux/cgroup.h>
#include <linux/wait.h>
+#include <linux/cpu_namespace.h>

DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
@@ -1061,8 +1062,19 @@ static void update_tasks_cpumask(struct cpuset *cs)
struct task_struct *task;

css_task_iter_start(&cs->css, 0, &it);
- while ((task = css_task_iter_next(&it)))
+ while ((task = css_task_iter_next(&it))) {
+#ifdef CONFIG_CPU_NS
+ cpumask_t pcpus;
+ cpumask_t vcpus;
+
+ pcpus = get_pcpus_cpuns(current->nsproxy->cpu_ns, cs->effective_cpus);
+ vcpus = get_vcpus_cpuns(task->nsproxy->cpu_ns, &pcpus);
+ cpumask_copy(&task->nsproxy->cpu_ns->v_cpuset_cpus, &vcpus);
+ set_cpus_allowed_ptr(task, &pcpus);
+#else
set_cpus_allowed_ptr(task, cs->effective_cpus);
+#endif
+ }
css_task_iter_end(&it);
}

@@ -2212,8 +2224,18 @@ static void cpuset_attach(struct cgroup_taskset *tset)
* can_attach beforehand should guarantee that this doesn't
* fail. TODO: have a better way to handle failure here
*/
- WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
+#ifdef CONFIG_CPU_NS
+ cpumask_t pcpus;
+ cpumask_t vcpus;

+ pcpus = get_pcpus_cpuns(current->nsproxy->cpu_ns, cpus_attach);
+ vcpus = get_vcpus_cpuns(task->nsproxy->cpu_ns, &pcpus);
+ cpumask_copy(&task->nsproxy->cpu_ns->v_cpuset_cpus, &vcpus);
+
+ WARN_ON_ONCE(set_cpus_allowed_ptr(task, &pcpus));
+#else
+ WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
+#endif
cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
cpuset_update_task_spread_flag(cs, task);
}
@@ -2436,13 +2458,33 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)

switch (type) {
case FILE_CPULIST:
+#ifdef CONFIG_CPU_NS
+ if (current->nsproxy->cpu_ns == &init_cpu_ns) {
+ seq_printf(sf, "%*pbl\n",
+ cpumask_pr_args(cs->cpus_allowed));
+ } else {
+ seq_printf(sf, "%*pbl\n",
+ cpumask_pr_args(&current->nsproxy->cpu_ns->v_cpuset_cpus));
+ }
+#else
seq_printf(sf, "%*pbl\n", cpumask_pr_args(cs->cpus_allowed));
+#endif
break;
case FILE_MEMLIST:
seq_printf(sf, "%*pbl\n", nodemask_pr_args(&cs->mems_allowed));
break;
case FILE_EFFECTIVE_CPULIST:
+#ifdef CONFIG_CPU_NS
+ if (current->nsproxy->cpu_ns == &init_cpu_ns) {
+ seq_printf(sf, "%*pbl\n",
+ cpumask_pr_args(cs->effective_cpus));
+ } else {
+ seq_printf(sf, "%*pbl\n",
+ cpumask_pr_args(&current->nsproxy->cpu_ns->v_cpuset_cpus));
+ }
+#else
seq_printf(sf, "%*pbl\n", cpumask_pr_args(cs->effective_cpus));
+#endif
break;
case FILE_EFFECTIVE_MEMLIST:
seq_printf(sf, "%*pbl\n", nodemask_pr_args(&cs->effective_mems));
@@ -2884,9 +2926,18 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
*/
static void cpuset_fork(struct task_struct *task)
{
+#ifdef CONFIG_CPU_NS
+ cpumask_t vcpus;
+#endif
+
if (task_css_is_root(task, cpuset_cgrp_id))
return;
-
+#ifdef CONFIG_CPU_NS
+ if (task->nsproxy->cpu_ns != &init_cpu_ns) {
+ vcpus = get_vcpus_cpuns(task->nsproxy->cpu_ns, current->cpus_ptr);
+ cpumask_copy(&task->nsproxy->cpu_ns->v_cpuset_cpus, &vcpus);
+ }
+#endif
set_cpus_allowed_ptr(task, current->cpus_ptr);
task->mems_allowed = current->mems_allowed;
}
--
2.31.1

2021-10-09 15:18:42

by Pratik R. Sampat

[permalink] [raw]
Subject: [RFC 4/5] cpu/cpuns: Make sysfs CPU namespace aware

The commit adds support to sysfs files like online, offline, present to
be CPU namespace context aware. It presents virtualized CPU information
which is coherent to the cgroup cpuset restrictions that are set upon
the tasks.

Signed-off-by: Pratik R. Sampat <[email protected]>
---
drivers/base/cpu.c | 35 ++++++++++++++++++++++++++++++++++-
1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 5ef14db97904..1487b23e5472 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -20,6 +20,7 @@
#include <linux/tick.h>
#include <linux/pm_qos.h>
#include <linux/sched/isolation.h>
+#include <linux/cpu_namespace.h>

#include "base.h"

@@ -203,6 +204,24 @@ struct cpu_attr {
const struct cpumask *const map;
};

+#ifdef CONFIG_CPU_NS
+static ssize_t show_cpuns_cpus_attr(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct cpu_attr *ca = container_of(attr, struct cpu_attr, attr);
+
+ if (current->nsproxy->cpu_ns == &init_cpu_ns)
+ return cpumap_print_to_pagebuf(true, buf, ca->map);
+
+ return cpumap_print_to_pagebuf(true, buf,
+ &current->nsproxy->cpu_ns->v_cpuset_cpus);
+}
+
+#define _CPU_CPUNS_ATTR(name, map) \
+ { __ATTR(name, 0444, show_cpuns_cpus_attr, NULL), map }
+#endif
+
static ssize_t show_cpus_attr(struct device *dev,
struct device_attribute *attr,
char *buf)
@@ -217,9 +236,14 @@ static ssize_t show_cpus_attr(struct device *dev,

/* Keep in sync with cpu_subsys_attrs */
static struct cpu_attr cpu_attrs[] = {
+#ifdef CONFIG_CPU_NS
+ _CPU_CPUNS_ATTR(online, &__cpu_online_mask),
+ _CPU_CPUNS_ATTR(present, &__cpu_present_mask),
+#else
_CPU_ATTR(online, &__cpu_online_mask),
- _CPU_ATTR(possible, &__cpu_possible_mask),
_CPU_ATTR(present, &__cpu_present_mask),
+#endif
+ _CPU_ATTR(possible, &__cpu_possible_mask),
};

/*
@@ -244,7 +268,16 @@ static ssize_t print_cpus_offline(struct device *dev,
/* display offline cpus < nr_cpu_ids */
if (!alloc_cpumask_var(&offline, GFP_KERNEL))
return -ENOMEM;
+#ifdef CONFIG_CPU_NS
+ if (current->nsproxy->cpu_ns == &init_cpu_ns) {
+ cpumask_andnot(offline, cpu_possible_mask, cpu_online_mask);
+ } else {
+ cpumask_andnot(offline, cpu_possible_mask,
+ &current->nsproxy->cpu_ns->v_cpuset_cpus);
+ }
+#else
cpumask_andnot(offline, cpu_possible_mask, cpu_online_mask);
+#endif
len += sysfs_emit_at(buf, len, "%*pbl", cpumask_pr_args(offline));
free_cpumask_var(offline);

--
2.31.1

2021-10-09 22:45:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 1/5] ns: Introduce CPU Namespace

On Sat, Oct 09, 2021 at 08:42:39PM +0530, Pratik R. Sampat wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2d9ff40f4661..0413175e6d73 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -27,6 +27,8 @@
> #include "pelt.h"
> #include "smp.h"
>
> +#include <linux/cpu_namespace.h>
> +
> /*
> * Export tracepoints that act as a bare tracehook (ie: have no trace event
> * associated with them) to allow external modules to probe them.
> @@ -7559,6 +7561,7 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
> {
> cpumask_var_t cpus_allowed, new_mask;
> struct task_struct *p;
> + cpumask_t temp;
> int retval;
>
> rcu_read_lock();

You're not supposed to put a cpumask_t on stack. Those things can be
huge.

> @@ -7682,8 +7686,9 @@ SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
> long sched_getaffinity(pid_t pid, struct cpumask *mask)
> {
> struct task_struct *p;
> + cpumask_var_t temp;
> unsigned long flags;
> - int retval;
> + int retval, cpu;
>
> rcu_read_lock();
>
> @@ -7698,6 +7703,13 @@ long sched_getaffinity(pid_t pid, struct cpumask *mask)
>
> raw_spin_lock_irqsave(&p->pi_lock, flags);
> cpumask_and(mask, &p->cpus_mask, cpu_active_mask);
> + cpumask_clear(temp);

There's a distinct lack of allocating temp before use. Are you sure you
actually tested this?

2021-10-09 22:47:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/5] kernel: Introduce CPU Namespace

On Sat, Oct 09, 2021 at 08:42:38PM +0530, Pratik R. Sampat wrote:

> Current shortcomings in the prototype:
> --------------------------------------
> 1. Containers also frequently use cfs period and quotas to restrict CPU
> runtime also known as millicores in modern container runtimes.
> The RFC interface currently does not account for this in
> the scheme of things.
> 2. While /proc/stat is now namespace aware and userspace programs like
> top will see the CPU utilization for their view of virtual CPUs;
> if the system or any other application outside the namespace
> bumps up the CPU utilization it will still show up in sys/user time.
> This should ideally be shown as stolen time instead.
> The current implementation plugs into the display of stats rather
> than accounting which causes incorrect reporting of stolen time.
> 3. The current implementation assumes that no hotplug operations occur
> within a container and hence the online and present cpus within a CPU
> namespace are always the same and query the same CPU namespace mask
> 4. As this is a proof of concept, currently we do not differentiate
> between cgroup cpus_allowed and effective_cpus and plugs them into
> the same virtual CPU map of the namespace
> 5. As described in a fair use implication earlier, knowledge of the
> CPU topology can potentially be taken an misused with a flood.
> While scrambling the CPUset in the namespace can help by
> obfuscation of information, the topology can still be roughly figured
> out with the use of IPI latencies to determine siblings or far away
> cores

6. completely destroys and ignores any machine topology information.

2021-10-11 13:33:31

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC 0/5] kernel: Introduce CPU Namespace

On Sat, Oct 09, 2021 at 08:42:38PM +0530, Pratik R. Sampat wrote:
> An early prototype of to demonstrate CPU namespace interface and its
> mechanism.
>
> The kernel provides two ways to control CPU resources for tasks
> 1. cgroup cpuset:
> A control mechanism to restrict CPUs to a task or a
> set of tasks attached to that group
> 2. syscall sched_setaffinity:
> A system call that can pin tasks to a set of CPUs
>
> The kernel also provides three ways to view the CPU resources available
> to the system:
> 1. sys/procfs:
> CPU system information is divulged through sys and proc fs, it
> exposes online, offline, present as well as load characteristics on
> the CPUs
> 2. syscall sched_getaffinity:
> A system call interface to get the cpuset affinity of tasks
> 3. cgroup cpuset:
> While cgroup is more of a control mechanism than a display mechanism,
> it can be viewed to retrieve the CPU restrictions applied on a group
> of tasks
>
> Coherency of information
> ------------------------
> The control and the display interface is fairly disjoint with each
> other. Restrictions can be set through control interfaces like cgroups,
> while many applications legacy or otherwise get the view of the system
> through sysfs/procfs and allocate resources like number of
> threads/processes, memory allocation based on that information.
>
> This can lead to unexpected running behaviors as well as have a high
> impact on performance.
>
> Existing solutions to the problem include userspace tools like LXCFS
> which can fake the sysfs information by mounting onto the sysfs online
> file to be in coherence with the limits set through cgroup cpuset.
> However, LXCFS is an external solution and needs to be explicitly setup
> for applications that require it. Another concern is also that tools
> like LXCFS don't handle all the other display mechanism like procfs load
> stats.
>
> Therefore, the need of a clean interface could be advocated for.
>
> Security and fair use implications
> ----------------------------------
> In a multi-tenant system, multiple containers may exist and information
> about the entire system, rather than just the resources that are
> restricted upon them can cause security and fair use implications such
> as:
> 1. A case where an actor can be in cognizance of the CPU node topology
> can schedule workloads and select CPUs such that the bus is flooded
> causing a Denial Of Service attack
> 2. A case wherein identifying the CPU system topology can help identify
> cores that are close to buses and peripherals such as GPUs to get an
> undue latency advantage from the rest of the workloads
>
> A survey RFD discusses other potential solutions and their concerns are
> listed here: https://lkml.org/lkml/2021/7/22/204
>
> This prototype patchset introduces a new kernel namespace mechanism --
> CPU namespace.
>
> The CPU namespace isolates CPU information by virtualizing logical CPU
> IDs and creating a scrambled virtual CPU map of the same.
> It latches onto the task_struct and is the cpu translations designed to
> be in a flat hierarchy this means that every virtual namespace CPU maps
> to a physical CPU at the creation of the namespace. The advantage of a
> flat hierarchy is that translations are O(1) and children do not need
> to traverse up the tree to retrieve a translation.
>
> This namespace then allows both control and display interfaces to be
> CPU namespace context aware, such that a task within a namespace only
> gets the view and therefore control of its and view CPU resources
> available to it via a virtual CPU map.
>
> Experiment
> ----------
> We designed an experiment to benchmark nginx configured with
> "worker_processes: auto" (which ensures that the number of processes to
> spawn will be derived from resources viewed on the system) and a
> benchmark/driver application wrk
>
> Nginx: Nginx is a web server that can also be used as a reverse proxy,
> load balancer, mail proxy and HTTP cache
> Wrk: wrk is a modern HTTP benchmarking tool capable of generating
> significant load when run on a single multi-core CPU
>
> Docker is used as the containerization platform of choice.
>
> The numbers gathered on IBM Power 9 CPU @ 2.979GHz with 176 CPUs and
> 127GB memory
> kernel: 5.14
>
> Case1: vanilla kernel - cpuset 4 cpus, no optimization
> Case2: CPU namespace kernel - cpuset 4 cpus
>
>
> +-----------------------+----------+----------+-----------------+
> | Metric | Case1 | Case2 | case2 vs case 1 |
> +-----------------------+----------+----------+-----------------+
> | PIDs | 177 | 5 | 172 PIDs |
> | mem usage (init) (MB) | 272.8 | 11.12 | 95.92% |
> | mem usage (peak) (MB) | 281.3 | 20.62 | 92.66% |
> | Latency (avg ms) | 70.91 | 25.36 | 64.23% |
> | Requests/sec | 47011.05 | 47080.98 | 0.14% |
> | Transfer/sec (MB) | 38.11 | 38.16 | 0.13% |
> +-----------------------+----------+----------+-----------------+
>
> With the CPU namespace we see the correct number of PIDs spawning
> corresponding to the cpuset limits set. The memory utilization drops
> over 92-95%, the latency reduces by 64% and the the throughput like
> requests and transfer per second is unchanged.
>
> Note: To utilize this new namespace in a container runtime like docker,
> the clone CPU namespace flag was modified to coincide with the PID
> namespace as they are the building blocks of containers and will always
> be invoked.
>
> Current shortcomings in the prototype:
> --------------------------------------
> 1. Containers also frequently use cfs period and quotas to restrict CPU
> runtime also known as millicores in modern container runtimes.
> The RFC interface currently does not account for this in
> the scheme of things.
> 2. While /proc/stat is now namespace aware and userspace programs like
> top will see the CPU utilization for their view of virtual CPUs;
> if the system or any other application outside the namespace
> bumps up the CPU utilization it will still show up in sys/user time.
> This should ideally be shown as stolen time instead.
> The current implementation plugs into the display of stats rather
> than accounting which causes incorrect reporting of stolen time.
> 3. The current implementation assumes that no hotplug operations occur
> within a container and hence the online and present cpus within a CPU
> namespace are always the same and query the same CPU namespace mask
> 4. As this is a proof of concept, currently we do not differentiate
> between cgroup cpus_allowed and effective_cpus and plugs them into
> the same virtual CPU map of the namespace
> 5. As described in a fair use implication earlier, knowledge of the
> CPU topology can potentially be taken an misused with a flood.
> While scrambling the CPUset in the namespace can help by
> obfuscation of information, the topology can still be roughly figured
> out with the use of IPI latencies to determine siblings or far away
> cores
>
> More information about the design and a video demo of the prototype can
> be found here: https://pratiksampat.github.io/cpu_namespace.html

Thank your for providing a new approach to this problem and thanks for
summarizing some of the painpoints and current solutions. I do agree
that this is a problem we should tackle in some form.

I have one design comment and one process related comments.

Fundamentally I think making this a new namespace is not the correct
approach. One core feature of a namespace it that it is an opt-in
isolation mechanism: if I do CLONE_NEW* that is when the new isolation
mechanism kicks. The correct reporting through procfs and sysfs is
built into that and we do bugfixes whenever reported information is
wrong.

The cpu namespace would be different; a point I think you're making as
well further above:

> The control and the display interface is fairly disjoint with each
> other. Restrictions can be set through control interfaces like cgroups,

A task wouldn't really opt-in to cpu isolation with CLONE_NEWCPU it
would only affect resource reporting. So it would be one half of the
semantics of a namespace.

We do already have all the infra in place to isolate/delegate cpu
related resources in the form of cgroups. The cpu namespace would then
be a hack on top of this to fix non-virtualized resource reporting.

In all honesty, I think cpu resource reporting through procfs/sysfs as
done today without taking a tasks cgroup information into account is a
bug. But the community has long agreed that fixing this would be a
regression.

I think that either we need to come up with new non-syscall based
interfaces that allow to query virtualized cpu information and buy into
the process of teaching userspace about them. This is even independent
of containers.
This is in line with proposing e.g. new procfs/sysfs files. Userspace
can then keep supplementing cpu virtualization via e.g. stuff like LXCFS
until tools have switched to read their cpu information from new
interfaces. Something that they need to be taught anyway.

Or if we really want to have this tied to a namespace then I think we
should consider extending CLONE_NEWCGROUP since cgroups are were cpu
isolation for containers is really happening. And arguably we should
restrict this to cgroup v2.

From a process perspective, I think this is something were we will need
strong guidance from the cgroup and cpu crowd. Ultimately, they need to
be the ones merging a feature like this as this is very much into their
territory.

Christian

2021-10-11 16:39:00

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC 0/5] kernel: Introduce CPU Namespace

On Mon, Oct 11, 2021 at 12:11:24PM +0200, Christian Brauner <[email protected]> wrote:
> Fundamentally I think making this a new namespace is not the correct
> approach.

I tend to agree.

Also, generally, this is not only problem of cpuset but some other
controllers well (the original letter mentions CPU bandwidth limits, another
thing are memory limits (and I wonder whether some apps already adjust their
behavior to available IO characteristics)).

The problem as I see it is the mapping from a real dedicated HW to a
cgroup restricted environment ("container"), which can be shared. In
this instance, the virtualized view would not be able to represent a
situation when a CPU is assigned non-exclusively to multiple cpusets.

(Although, one speciality of the CPU namespace approach here is the
remapping/scrambling of the CPU topology. Not sure if good or bad.)

> I think that either we need to come up with new non-syscall based
> interfaces that allow to query virtualized cpu information and buy into
> the process of teaching userspace about them. This is even independent
> of containers.

For the reason above, I also agree with this. And I think this interface
(mostly) exists -- the userspace could query the cgroup files
(cpuset.cpus.effective in this case), they would even have the liberty
to decide between querying available resources in their "container"
(root cgroup (cgroup NS)) or further subdivision of that (the
immediately encompassing cgroup).


On Sat, Oct 09, 2021 at 08:42:38PM +0530, "Pratik R. Sampat" <[email protected]> wrote:
> Existing solutions to the problem include userspace tools like LXCFS
> which can fake the sysfs information by mounting onto the sysfs online
> file to be in coherence with the limits set through cgroup cpuset.
> However, LXCFS is an external solution and needs to be explicitly setup
> for applications that require it. Another concern is also that tools
> like LXCFS don't handle all the other display mechanism like procfs load
> stats.
>
> Therefore, the need of a clean interface could be advocated for.

I'd like to write something in support of your approach but I'm afraid that the
problem of the mapping (dedicated vs shared) makes this most suitable for some
external/separate entity such as the LCXFS already.

My .02€,
Michal

2021-10-11 17:44:25

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC 0/5] kernel: Introduce CPU Namespace

Hello,

On Mon, Oct 11, 2021 at 04:17:37PM +0200, Michal Koutn? wrote:
> The problem as I see it is the mapping from a real dedicated HW to a
> cgroup restricted environment ("container"), which can be shared. In
> this instance, the virtualized view would not be able to represent a
> situation when a CPU is assigned non-exclusively to multiple cpusets.

There is a fundamental problem with trying to represent a resource shared
environment controlled with cgroup using system-wide interfaces including
procfs because the goal of many cgroup resource control includes
work-conservation, which also is one of the main reason why containers are
more attractive in resource-intense deployments. System-level interfaces
naturally describe a discrete system, which can't express the dynamic
distribution with cgroups.

There are aspects of cgroups which are akin to hard partitioning and thus
can be represented by diddling with system level interfaces. Whether those
are worthwhile to pursuit depends on how easy and useful they are; however,
there's no avoiding that each of those is gonna be a very partial and
fragmented thing, which significantly contributes the default cons list of
such attempts.

> > Existing solutions to the problem include userspace tools like LXCFS
> > which can fake the sysfs information by mounting onto the sysfs online
> > file to be in coherence with the limits set through cgroup cpuset.
> > However, LXCFS is an external solution and needs to be explicitly setup
> > for applications that require it. Another concern is also that tools
> > like LXCFS don't handle all the other display mechanism like procfs load
> > stats.
> >
> > Therefore, the need of a clean interface could be advocated for.
>
> I'd like to write something in support of your approach but I'm afraid that the
> problem of the mapping (dedicated vs shared) makes this most suitable for some
> external/separate entity such as the LCXFS already.

This is more of a unit problem than an interface one - ie. the existing
numbers in the system interface doesn't really fit what needs to be
described.

One approach that we've found useful in practice is dynamically changing
resource consumption based on shortage, as measured by PSI, rather than some
number representing what's available. e.g. for a build service, building a
feedback loop which monitors its own cpu, memory and io pressures and
modulates the number of concurrent jobs.

There are some numbers which would be fundamentlaly useful - e.g. ballpark
number of threads needed to saturate the computing capacity available to the
cgroup, or ballpark bytes of memory available without noticeable contention.
Those, I think we definitely need to work on, but I don't see much point in
trying to bend existing /proc numbers for them.

Thanks.

--
tejun

2021-10-12 08:47:20

by Pratik R. Sampat

[permalink] [raw]
Subject: Re: [RFC 0/5] kernel: Introduce CPU Namespace

Hello,
> Thank your for providing a new approach to this problem and thanks for
> summarizing some of the painpoints and current solutions. I do agree
> that this is a problem we should tackle in some form.
>
> I have one design comment and one process related comments.
>
> Fundamentally I think making this a new namespace is not the correct
> approach. One core feature of a namespace it that it is an opt-in
> isolation mechanism: if I do CLONE_NEW* that is when the new isolation
> mechanism kicks. The correct reporting through procfs and sysfs is
> built into that and we do bugfixes whenever reported information is
> wrong.
>
> The cpu namespace would be different; a point I think you're making as
> well further above:
>
>> The control and the display interface is fairly disjoint with each
>> other. Restrictions can be set through control interfaces like cgroups,
> A task wouldn't really opt-in to cpu isolation with CLONE_NEWCPU it
> would only affect resource reporting. So it would be one half of the
> semantics of a namespace.
>
I completely agree with you on this, fundamentally a namespace should
isolate both the resource as well as the reporting. As you mentioned
too, cgroups handles the resource isolation while this namespace
handles the reporting and this seems to break the semantics of what a
namespace should really be.

The CPU resource is unique in that sense, at least in this context,
which makes it tricky to design a interface that presents coherent
information.

> In all honesty, I think cpu resource reporting through procfs/sysfs as
> done today without taking a tasks cgroup information into account is a
> bug. But the community has long agreed that fixing this would be a
> regression.
>
> I think that either we need to come up with new non-syscall based
> interfaces that allow to query virtualized cpu information and buy into
> the process of teaching userspace about them. This is even independent
> of containers.
> This is in line with proposing e.g. new procfs/sysfs files. Userspace
> can then keep supplementing cpu virtualization via e.g. stuff like LXCFS
> until tools have switched to read their cpu information from new
> interfaces. Something that they need to be taught anyway.

I too think that having a brand new interface all together and teaching
userspace about it is much cleaner approach.
On the same lines, if were to do that, we could also add more useful
metrics in that interface like ballpark number of threads to saturate
usage as well as gather more such metrics as suggested by Tejun Heo.

My only concern for this would be that if today applications aren't
modifying their code to read the existing cgroup interface and would
rather resort to using userspace side-channel solutions like LXCFS or
wrapping them up in kata containers, would it now be compelling enough
to introduce yet another interface?

While I concur with Tejun Heo's comment the mail thread and overloading
existing interfaces of sys and proc which were originally designed for
system wide resources, may not be a great idea:

> There is a fundamental problem with trying to represent a resource shared
> environment controlled with cgroup using system-wide interfaces including
> procfs

A fundamental question we probably need to ascertain could be -
Today, is it incorrect for applications to look at the sys and procfs to
get resource information, regardless of their runtime environment?

Also, if an application were to only be able to view the resources
based on the restrictions set regardless of the interface - would there
be a disadvantage for them if they could only see an overloaded context
sensitive view rather than the whole system view?

> Or if we really want to have this tied to a namespace then I think we
> should consider extending CLONE_NEWCGROUP since cgroups are were cpu
> isolation for containers is really happening. And arguably we should
> restrict this to cgroup v2.

Given some thought, I tend to agree this could be wrapped in a cgroup
namespace. However, some more deliberation is definitely needed to
determine if by including CPU isolation here we aren't breaking
another semantic set by the cgroup namespace itself as cgroups don't
necessarily have to have restrictions on CPUs set and can also allow
mixing of restrictions from cpuset and cfs period-quota.

>
> From a process perspective, I think this is something were we will need
> strong guidance from the cgroup and cpu crowd. Ultimately, they need to
> be the ones merging a feature like this as this is very much into their
> territory.

I agree, we definitely need the guidance from the cgroups and cpu folks
from the community. We would also benefit from guidance from the
userspace community like containers and understand how they use the
existing interfaces so that we can arrive at a holistic view of what
everybody could benefit by.

>
> Christian

Thank you once again for all the comments, the CPU namespace is me
taking a stab trying to highlight the problem itself. Not without
its flaws, having a coherent interface does seem to show benefits as
well.
Hence, if the consensus builds for the right interface for solving this
problem, I would be glad to help in contributing to a solution towards
it.

Thanks,
Pratik



2021-10-15 03:56:02

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC 0/5] kernel: Introduce CPU Namespace

Hello,

On Tue, Oct 12, 2021 at 02:12:18PM +0530, Pratik Sampat wrote:
> > > The control and the display interface is fairly disjoint with each
> > > other. Restrictions can be set through control interfaces like cgroups,
> > A task wouldn't really opt-in to cpu isolation with CLONE_NEWCPU it
> > would only affect resource reporting. So it would be one half of the
> > semantics of a namespace.
> >
> I completely agree with you on this, fundamentally a namespace should
> isolate both the resource as well as the reporting. As you mentioned
> too, cgroups handles the resource isolation while this namespace
> handles the reporting and this seems to break the semantics of what a
> namespace should really be.
>
> The CPU resource is unique in that sense, at least in this context,
> which makes it tricky to design a interface that presents coherent
> information.

It's only unique in the context that you're trying to place CPU distribution
into the namespace framework when the resource in question isn't distributed
that way. All of the three major local resources - CPU, memory and IO - are
in the same boat. Computing resources, the physical ones, don't render
themselves naturally to accounting and ditributing by segmenting _name_
spaces which ultimately just shows and hides names. This direction is a
dead-end.

> I too think that having a brand new interface all together and teaching
> userspace about it is much cleaner approach.
> On the same lines, if were to do that, we could also add more useful
> metrics in that interface like ballpark number of threads to saturate
> usage as well as gather more such metrics as suggested by Tejun Heo.
>
> My only concern for this would be that if today applications aren't
> modifying their code to read the existing cgroup interface and would
> rather resort to using userspace side-channel solutions like LXCFS or
> wrapping them up in kata containers, would it now be compelling enough
> to introduce yet another interface?

While I'm sympathetic to compatibility argument, identifying available
resources was never well-define with the existing interfaces. Most of the
available information is what hardware is available but there's no
consistent way of knowing what the software environment is like. Is the
application the only one on the system? How much memory should be set aside
for system management, monitoring and other administrative operations?

In practice, the numbers that are available can serve as the starting points
on top of which application and environment specific knoweldge has to be
applied to actually determine deployable configurations, which in turn would
go through iterative adjustments unless the workload is self-sizing.

Given such variability in requirements, I'm not sure what numbers should be
baked into the "namespaced" system metrics. Some numbers, e.g., number of
CPUs can may be mapped from cpuset configuration but even that requires
quite a bit of assumptions about how cpuset is configured and the
expectations the applications would have while other numbers - e.g.
available memory - is a total non-starter.

If we try to fake these numbers for containers, what's likely to happen is
that the service owners would end up tuning workload size against whatever
number the kernel is showing factoring in all the environmental factors
knowingly or just through iterations. And that's not *really* an interface
which provides compatibility. We're just piping new numbers which don't
really mean what they used to mean and whose meanings can change depending
on configuration through existing interfaces and letting users figure out
what to do with the new numbers.

To achieve compatibility where applications don't need to be changed, I
don't think there is a solution which doesn't involve going through
userspace. For other cases and long term, the right direction is providing
well-defined resource metrics that applications can make sense of and use to
size themselves dynamically.

> While I concur with Tejun Heo's comment the mail thread and overloading
> existing interfaces of sys and proc which were originally designed for
> system wide resources, may not be a great idea:
>
> > There is a fundamental problem with trying to represent a resource shared
> > environment controlled with cgroup using system-wide interfaces including
> > procfs
>
> A fundamental question we probably need to ascertain could be -
> Today, is it incorrect for applications to look at the sys and procfs to
> get resource information, regardless of their runtime environment?

Well, it's incomplete even without containerization. Containerization just
amplifies the shortcomings. All of these problems existed well before
cgroups / namespaces. How would you know how much resource you can consume
on a system just looking at hardware resources without implicit knowledge of
what else is on the system? It's just that we are now more likely to load
systems dynamically with containerization.

> Also, if an application were to only be able to view the resources
> based on the restrictions set regardless of the interface - would there
> be a disadvantage for them if they could only see an overloaded context
> sensitive view rather than the whole system view?

Can you elaborate further? I have a hard time understanding what's being
asked.

Thanks.

--
tejun

2021-10-18 15:32:22

by Pratik R. Sampat

[permalink] [raw]
Subject: Re: [RFC 0/5] kernel: Introduce CPU Namespace



On 15/10/21 3:44 am, Tejun Heo wrote:
> Hello,
>
> On Tue, Oct 12, 2021 at 02:12:18PM +0530, Pratik Sampat wrote:
>>>> The control and the display interface is fairly disjoint with each
>>>> other. Restrictions can be set through control interfaces like cgroups,
>>> A task wouldn't really opt-in to cpu isolation with CLONE_NEWCPU it
>>> would only affect resource reporting. So it would be one half of the
>>> semantics of a namespace.
>>>
>> I completely agree with you on this, fundamentally a namespace should
>> isolate both the resource as well as the reporting. As you mentioned
>> too, cgroups handles the resource isolation while this namespace
>> handles the reporting and this seems to break the semantics of what a
>> namespace should really be.
>>
>> The CPU resource is unique in that sense, at least in this context,
>> which makes it tricky to design a interface that presents coherent
>> information.
> It's only unique in the context that you're trying to place CPU distribution
> into the namespace framework when the resource in question isn't distributed
> that way. All of the three major local resources - CPU, memory and IO - are
> in the same boat. Computing resources, the physical ones, don't render
> themselves naturally to accounting and ditributing by segmenting _name_
> spaces which ultimately just shows and hides names. This direction is a
> dead-end.
>
>> I too think that having a brand new interface all together and teaching
>> userspace about it is much cleaner approach.
>> On the same lines, if were to do that, we could also add more useful
>> metrics in that interface like ballpark number of threads to saturate
>> usage as well as gather more such metrics as suggested by Tejun Heo.
>>
>> My only concern for this would be that if today applications aren't
>> modifying their code to read the existing cgroup interface and would
>> rather resort to using userspace side-channel solutions like LXCFS or
>> wrapping them up in kata containers, would it now be compelling enough
>> to introduce yet another interface?
> While I'm sympathetic to compatibility argument, identifying available
> resources was never well-define with the existing interfaces. Most of the
> available information is what hardware is available but there's no
> consistent way of knowing what the software environment is like. Is the
> application the only one on the system? How much memory should be set aside
> for system management, monitoring and other administrative operations?
>
> In practice, the numbers that are available can serve as the starting points
> on top of which application and environment specific knoweldge has to be
> applied to actually determine deployable configurations, which in turn would
> go through iterative adjustments unless the workload is self-sizing.
>
> Given such variability in requirements, I'm not sure what numbers should be
> baked into the "namespaced" system metrics. Some numbers, e.g., number of
> CPUs can may be mapped from cpuset configuration but even that requires
> quite a bit of assumptions about how cpuset is configured and the
> expectations the applications would have while other numbers - e.g.
> available memory - is a total non-starter.
>
> If we try to fake these numbers for containers, what's likely to happen is
> that the service owners would end up tuning workload size against whatever
> number the kernel is showing factoring in all the environmental factors
> knowingly or just through iterations. And that's not *really* an interface
> which provides compatibility. We're just piping new numbers which don't
> really mean what they used to mean and whose meanings can change depending
> on configuration through existing interfaces and letting users figure out
> what to do with the new numbers.
>
> To achieve compatibility where applications don't need to be changed, I
> don't think there is a solution which doesn't involve going through
> userspace. For other cases and long term, the right direction is providing
> well-defined resource metrics that applications can make sense of and use to
> size themselves dynamically.

I agree that major local resources like CPUs and memory cannot to be
distributed cleanly in a namespace semantic.
Thus the memory resource like CPU too does face similar coherency
issues where /proc/meminfo can be different from what the restrictions
are.

While a CPU namespace maybe not be the preferred way of solving
this problem, the prototype RFC is rather for understanding related
problems with this as well as other potential directions that we could
explore for solving this problem.

Also, I agree with your point about variability of requirements. If the
interface we give even though it is in conjunction with the limits set,
if the applications have to derive metrics from this or from other
kernel information regardless; then the interface would not be useful.
If the solution to this problem lies in userspace, then I'm all for it
as well. However, the intention is to probe if this could potentially be
solved in cleanly in the kernel.

>> While I concur with Tejun Heo's comment the mail thread and overloading
>> existing interfaces of sys and proc which were originally designed for
>> system wide resources, may not be a great idea:
>>
>>> There is a fundamental problem with trying to represent a resource shared
>>> environment controlled with cgroup using system-wide interfaces including
>>> procfs
>> A fundamental question we probably need to ascertain could be -
>> Today, is it incorrect for applications to look at the sys and procfs to
>> get resource information, regardless of their runtime environment?
> Well, it's incomplete even without containerization. Containerization just
> amplifies the shortcomings. All of these problems existed well before
> cgroups / namespaces. How would you know how much resource you can consume
> on a system just looking at hardware resources without implicit knowledge of
> what else is on the system? It's just that we are now more likely to load
> systems dynamically with containerization.

Yes, these shortcomings exist even without containerization, on a
dynamically loaded multi-tenant system it becomes very difficult to
determine what is the maximum amount resource that can be requested
before we hurt our own performance.
cgroups and namespace mechanics help containers give some structure to
the maximum amount of resources that they can consume. However,
applications are unable to leverage that in some cases especially if
they are more inclined to look at a more traditional system wide
interface like sys and proc.

>> Also, if an application were to only be able to view the resources
>> based on the restrictions set regardless of the interface - would there
>> be a disadvantage for them if they could only see an overloaded context
>> sensitive view rather than the whole system view?
> Can you elaborate further? I have a hard time understanding what's being
> asked.

The question that I have essentially tries to understand the
implications of overloading existing interface's definitions to be
context sensitive.
The way that the prototype works today is that it does not interfere
with the information when the system boots or even when it is run in a
new namespace.
The effects are only observed when restrictions are applied to it.
Therefore, what would potentially break if interfaces like these are
made to divulge information based on restrictions rather than the whole
system view?

Thanks
Pratik

2021-10-18 16:33:10

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC 0/5] kernel: Introduce CPU Namespace

(cc'ing Johannes for memory sizing part)

Hello,

On Mon, Oct 18, 2021 at 08:59:16PM +0530, Pratik Sampat wrote:
...
> Also, I agree with your point about variability of requirements. If the
> interface we give even though it is in conjunction with the limits set,
> if the applications have to derive metrics from this or from other
> kernel information regardless; then the interface would not be useful.
> If the solution to this problem lies in userspace, then I'm all for it
> as well. However, the intention is to probe if this could potentially be
> solved in cleanly in the kernel.

Just to be clear, avoiding application changes would have to involve
userspace (at least parameterization from it), and I think to set that as a
goal for kernel would be more of a distraction. Please note that we should
definitely provide metrics which actually capture what's going on in terms
of resource availability in a way which can be used to size workloads
automatically.

> Yes, these shortcomings exist even without containerization, on a
> dynamically loaded multi-tenant system it becomes very difficult to
> determine what is the maximum amount resource that can be requested
> before we hurt our own performance.

As I mentioned before, feedback loop on PSI can work really well in finding
the saturation points for cpu/mem/io and regulating workload size
automatically and dynamically. While such dynamic sizing can work without
any other inputs, it sucks to have to probe the entire range each time and
it'd be really useful if the kernel can provide ballpark numbers that are
needed to estimate the saturation points.

What gets challenging is that there doesn't seem to be a good way to
consistently describe availability for each of the three resources and the
different distribution rules they may be under.

e.g. For CPU, the affinity restrictions from cpuset determines the maximum
number of threads that a workload would need to saturate the available CPUs.
However, conveying the results of cpu.max and cpu.weight controls isn't as
straight-fowrads.

For memory, it's even trickier because in a lot of cases it's impossible to
tell how much memory is actually available without trying to use them as
active workingset can only be learned by trying to reclaim memory.

IO is in somewhat similar boat as CPU in that there are both io.max and
io.weight. However, if io.cost is in use and configured according to the
hardware, we can map those two in terms iocost.

Another thing is that the dynamic nature of these control mechanisms means
that the numbers can keep changing moment to moment and we'd need to provide
some time averaged numbers. We can probably take the same approach as PSI
and load-avgs and provide running avgs of a few time intervals.

> The question that I have essentially tries to understand the
> implications of overloading existing interface's definitions to be
> context sensitive.
> The way that the prototype works today is that it does not interfere
> with the information when the system boots or even when it is run in a
> new namespace.
> The effects are only observed when restrictions are applied to it.
> Therefore, what would potentially break if interfaces like these are
> made to divulge information based on restrictions rather than the whole
> system view?

I don't think the problem is that something would necessarily break by doing
that. It's more that it's a dead-end approach which won't get us far for all
the reasons that have been discussed so far. It'd be more productive to
focus on long term solutions and leave backward compatibility to the domains
where they can actually be solved by applying the necessary local knoweldge
to emulate and fake whatever necessary numbers.

Thanks.

--
tejun

2021-10-20 10:47:13

by Pratik R. Sampat

[permalink] [raw]
Subject: Re: [RFC 0/5] kernel: Introduce CPU Namespace



On 18/10/21 9:59 pm, Tejun Heo wrote:
> (cc'ing Johannes for memory sizing part)
>
> Hello,
>
> On Mon, Oct 18, 2021 at 08:59:16PM +0530, Pratik Sampat wrote:
> ...
>> Also, I agree with your point about variability of requirements. If the
>> interface we give even though it is in conjunction with the limits set,
>> if the applications have to derive metrics from this or from other
>> kernel information regardless; then the interface would not be useful.
>> If the solution to this problem lies in userspace, then I'm all for it
>> as well. However, the intention is to probe if this could potentially be
>> solved in cleanly in the kernel.
> Just to be clear, avoiding application changes would have to involve
> userspace (at least parameterization from it), and I think to set that as a
> goal for kernel would be more of a distraction. Please note that we should
> definitely provide metrics which actually capture what's going on in terms
> of resource availability in a way which can be used to size workloads
> automatically.
>
>> Yes, these shortcomings exist even without containerization, on a
>> dynamically loaded multi-tenant system it becomes very difficult to
>> determine what is the maximum amount resource that can be requested
>> before we hurt our own performance.
> As I mentioned before, feedback loop on PSI can work really well in finding
> the saturation points for cpu/mem/io and regulating workload size
> automatically and dynamically. While such dynamic sizing can work without
> any other inputs, it sucks to have to probe the entire range each time and
> it'd be really useful if the kernel can provide ballpark numbers that are
> needed to estimate the saturation points.
>
> What gets challenging is that there doesn't seem to be a good way to
> consistently describe availability for each of the three resources and the
> different distribution rules they may be under.
>
> e.g. For CPU, the affinity restrictions from cpuset determines the maximum
> number of threads that a workload would need to saturate the available CPUs.
> However, conveying the results of cpu.max and cpu.weight controls isn't as
> straight-fowrads.
>
> For memory, it's even trickier because in a lot of cases it's impossible to
> tell how much memory is actually available without trying to use them as
> active workingset can only be learned by trying to reclaim memory.
>
> IO is in somewhat similar boat as CPU in that there are both io.max and
> io.weight. However, if io.cost is in use and configured according to the
> hardware, we can map those two in terms iocost.
>
> Another thing is that the dynamic nature of these control mechanisms means
> that the numbers can keep changing moment to moment and we'd need to provide
> some time averaged numbers. We can probably take the same approach as PSI
> and load-avgs and provide running avgs of a few time intervals.

As you have elucidated, it doesn't like an easy feat to
define metrics like ballpark numbers as there are many variables
involved.

For the CPU example, cpusets control the resource space whereas
period-quota control resource time. These seem like two vectors on
different axes.
Conveying these restrictions in one metric doesn't seem easy. Some
container runtime convert the period-quota time dimension to X CPUs
worth of runtime space dimension. However, we need to carefully model
what a ballpark metric in this sense would be and provide clearer
constraints as both of these restrictions can be active at a given
point in time and can influence how something is run.

Restrictions for memory are even more complicated to model as you have
pointed out as well.

I would also request using this mail thread to suggest if there are
more such metrics which would be useful to expose from the kernel?
This would probably not solve the coherency problem but maybe it could
help entice the userspace applications to look at the cgroup interface
as there could be more relevant metrics that would help them tune for
performance.

>
>> The question that I have essentially tries to understand the
>> implications of overloading existing interface's definitions to be
>> context sensitive.
>> The way that the prototype works today is that it does not interfere
>> with the information when the system boots or even when it is run in a
>> new namespace.
>> The effects are only observed when restrictions are applied to it.
>> Therefore, what would potentially break if interfaces like these are
>> made to divulge information based on restrictions rather than the whole
>> system view?
> I don't think the problem is that something would necessarily break by doing
> that. It's more that it's a dead-end approach which won't get us far for all
> the reasons that have been discussed so far. It'd be more productive to
> focus on long term solutions and leave backward compatibility to the domains
> where they can actually be solved by applying the necessary local knoweldge
> to emulate and fake whatever necessary numbers.

Sure, understood. If the only goal is backward compatibility then its
best to let existing solutions help emulate and/or fake this
information to the applications.

Thank you again for all the feedback.

2021-10-20 16:38:47

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC 0/5] kernel: Introduce CPU Namespace

Hello,

On Wed, Oct 20, 2021 at 04:14:25PM +0530, Pratik Sampat wrote:
> As you have elucidated, it doesn't like an easy feat to
> define metrics like ballpark numbers as there are many variables
> involved.

Yeah, it gets tricky and we want to get the basics right from the get go.

> For the CPU example, cpusets control the resource space whereas
> period-quota control resource time. These seem like two vectors on
> different axes.
> Conveying these restrictions in one metric doesn't seem easy. Some
> container runtime convert the period-quota time dimension to X CPUs
> worth of runtime space dimension. However, we need to carefully model
> what a ballpark metric in this sense would be and provide clearer
> constraints as both of these restrictions can be active at a given
> point in time and can influence how something is run.

So, for CPU, the important functional number is the number of threads needed
to saturate available resources and that one is pretty easy. The other
metric would be the maximum available fractions of CPUs available to the
cgroup subtree if the cgroup stays saturating. This number is trickier as it
has to consider how much others are using but would be determined by the
smaller of what would be available through cpu.weight and cpu.max.

IO likely is in a similar boat. We can calculate metrics showing the
rbps/riops/wbps/wiops available to a given cgroup subtree. This would factor
in the limits from io.max and the resulting distribution from io.weight in
iocost's case (iocost will give a % number but we can translate that to
bps/iops numbers).

> Restrictions for memory are even more complicated to model as you have
> pointed out as well.

Yeah, this one is the most challenging.

Thanks.

--
tejun

2021-10-21 07:46:47

by Pratik R. Sampat

[permalink] [raw]
Subject: Re: [RFC 0/5] kernel: Introduce CPU Namespace

Hello Tejun,


On 20/10/21 10:05 pm, Tejun Heo wrote:
> Hello,
>
> On Wed, Oct 20, 2021 at 04:14:25PM +0530, Pratik Sampat wrote:
>> As you have elucidated, it doesn't like an easy feat to
>> define metrics like ballpark numbers as there are many variables
>> involved.
> Yeah, it gets tricky and we want to get the basics right from the get go.
>
>> For the CPU example, cpusets control the resource space whereas
>> period-quota control resource time. These seem like two vectors on
>> different axes.
>> Conveying these restrictions in one metric doesn't seem easy. Some
>> container runtime convert the period-quota time dimension to X CPUs
>> worth of runtime space dimension. However, we need to carefully model
>> what a ballpark metric in this sense would be and provide clearer
>> constraints as both of these restrictions can be active at a given
>> point in time and can influence how something is run.
> So, for CPU, the important functional number is the number of threads needed
> to saturate available resources and that one is pretty easy.

I'm speculating, and please correct correct me if I'm wrong; suggesting
an optimal number of threads to spawn to saturate the available
resources can get convoluted right?

In the nginx example illustrated in the cover patch, it worked best
when the thread count was N+1 (N worker threads 1 master thread),
however different applications can work better with a different
configuration of threads spawned based on its usecase and
multi-threading requirements.

Eventually looking at the load we maybe able to suggest more/less
threads to spawn, but initially we may have to have to suggest threads
to spawn as direct function of N CPUs available or N CPUs worth of
runtime available?

> The other
> metric would be the maximum available fractions of CPUs available to the
> cgroup subtree if the cgroup stays saturating. This number is trickier as it
> has to consider how much others are using but would be determined by the
> smaller of what would be available through cpu.weight and cpu.max.

I agree, this would be a very useful metric to have. Having the
knowledge for how much further we can scale when we're saturating our
limits keeping in mind of the other running applications can possibly
be really useful not just for the applications itself but also for the
container orchestrators as well.

> IO likely is in a similar boat. We can calculate metrics showing the
> rbps/riops/wbps/wiops available to a given cgroup subtree. This would factor
> in the limits from io.max and the resulting distribution from io.weight in
> iocost's case (iocost will give a % number but we can translate that to
> bps/iops numbers).

Yes, that's a useful metric to expose this way as well.

>> Restrictions for memory are even more complicated to model as you have
>> pointed out as well.
> Yeah, this one is the most challenging.
>
> Thanks.
>
Thank you,
Pratik

2021-10-21 17:11:07

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC 0/5] kernel: Introduce CPU Namespace

Hello,

On Thu, Oct 21, 2021 at 01:14:10PM +0530, Pratik Sampat wrote:
> I'm speculating, and please correct correct me if I'm wrong; suggesting
> an optimal number of threads to spawn to saturate the available
> resources can get convoluted right?
>
> In the nginx example illustrated in the cover patch, it worked best
> when the thread count was N+1 (N worker threads 1 master thread),
> however different applications can work better with a different
> configuration of threads spawned based on its usecase and
> multi-threading requirements.

Yeah, I mean, the number would have to be based an ideal conditions - ie.
the cgroup needs N always-runnable threads to saturate all the available
CPUs and then applications can do what they need to do based on that
information. Note that this is equivalent to making these decisions based on
number of CPUs.

> Eventually looking at the load we maybe able to suggest more/less
> threads to spawn, but initially we may have to have to suggest threads
> to spawn as direct function of N CPUs available or N CPUs worth of
> runtime available?

That kind of dynamic tuning is best done with PSI which can reliably
indicate saturation and the degree of contention.

> > The other
> > metric would be the maximum available fractions of CPUs available to the
> > cgroup subtree if the cgroup stays saturating. This number is trickier as it
> > has to consider how much others are using but would be determined by the
> > smaller of what would be available through cpu.weight and cpu.max.
>
> I agree, this would be a very useful metric to have. Having the
> knowledge for how much further we can scale when we're saturating our
> limits keeping in mind of the other running applications can possibly
> be really useful not just for the applications itself but also for the
> container orchestrators as well.

Similarly, availability metrics would be useful in ballpark sizing so that
applications don't have to dynamically tune across the entire range, the
actual adustments to stay saturated is likely best done through PSI, which
is the direct metric indicating resource saturation.

Thanks.

--
tejun

2021-10-21 17:20:06

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 0/5] kernel: Introduce CPU Namespace

Pratik Sampat <[email protected]> writes:

> On 18/10/21 9:59 pm, Tejun Heo wrote:
>> (cc'ing Johannes for memory sizing part)
>>
>> For memory, it's even trickier because in a lot of cases it's impossible to
>> tell how much memory is actually available without trying to use them as
>> active workingset can only be learned by trying to reclaim memory.
>
> Restrictions for memory are even more complicated to model as you have
> pointed out as well.

For memory sizing we currently have MemAvailable in /proc/meminfo which
makes a global guess at that.

We still need roughly that same approximation from an applications
perspective that takes cgroups into account.

There was another conversation not too long ago and it was tenatively
agreed that it could make sense to provide such a number. However it
was very much requested that an application that would actually use
that number be found so it would be possible to tell what makes a
difference in practice rather than what makes a difference in theory.

Eric