2021-03-02 20:41:20

by Vipin Sharma

[permalink] [raw]
Subject: [RFC v2 0/2] cgroup: New misc cgroup controller

Hello

This patch series is creating a new misc cgroup controller for limiting
and tracking of resources which are not abstract like other cgroup
controllers.

This controller was initially proposed as encryption_id but after
the feedbacks, it is now changed to misc cgroup.
https://lore.kernel.org/lkml/[email protected]/

Changes in RFC v2:
1. Documentation fixes.
2. Added kernel log messages.
3. Changed charge API to treat misc_cg as input parameter.
4. Added helper APIs to get and release references on the cgroup.

[1] https://lore.kernel.org/lkml/[email protected]
Vipin Sharma (2):
cgroup: sev: Add misc cgroup controller
cgroup: sev: Miscellaneous cgroup documentation.

Documentation/admin-guide/cgroup-v1/index.rst | 1 +
Documentation/admin-guide/cgroup-v1/misc.rst | 4 +
Documentation/admin-guide/cgroup-v2.rst | 69 ++-
arch/x86/kvm/svm/sev.c | 65 ++-
arch/x86/kvm/svm/svm.h | 1 +
include/linux/cgroup_subsys.h | 4 +
include/linux/misc_cgroup.h | 122 +++++
init/Kconfig | 14 +
kernel/cgroup/Makefile | 1 +
kernel/cgroup/misc.c | 423 ++++++++++++++++++
10 files changed, 692 insertions(+), 12 deletions(-)
create mode 100644 Documentation/admin-guide/cgroup-v1/misc.rst
create mode 100644 include/linux/misc_cgroup.h
create mode 100644 kernel/cgroup/misc.c

--
2.30.1.766.gb4fecdf3b7-goog


2021-03-04 06:02:42

by Vipin Sharma

[permalink] [raw]
Subject: [RFC v2 1/2] cgroup: sev: Add misc cgroup controller

The Miscellaneous cgroup provides the resource limiting and tracking
mechanism for the scalar resources which cannot be abstracted like the
other cgroup resources. Controller is enabled by the CONFIG_CGROUP_MISC
config option.

The first two resources added to the miscellaneous controller are Secure
Encrypted Virtualization (SEV) ASIDs and SEV - Encrypted State (SEV-ES)
ASIDs. These limited ASIDs are used for encrypting virtual machines
memory on the AMD platform

Miscellaneous controller provides 3 interface files:

misc.capacity
A read-only flat-keyed file shown only in the root cgroup. It shows
miscellaneous scalar resources available on the platform along with
their quantities::

$ cat misc.capacity
sev 50
sev_es 10

misc.current
A read-only flat-keyed file shown in the non-root cgroups. It shows
the current usage of the resources in the cgroup and its children::

$ cat misc.current
sev 3
sev_es 0

misc.max
A read-write flat-keyed file shown in the non root cgroups. Allowed
maximum usage of the resources in the cgroup and its children.::

$ cat misc.max
sev max
sev_es 4

Limit can be set by::

# echo sev 1 > misc.max

Limit can be set to max by::

# echo sev max > misc.max

Limits can be set more than the capacity value in the misc.capacity
file.

Signed-off-by: Vipin Sharma <[email protected]>
Reviewed-by: David Rientjes <[email protected]>
---
arch/x86/kvm/svm/sev.c | 65 +++++-
arch/x86/kvm/svm/svm.h | 1 +
include/linux/cgroup_subsys.h | 4 +
include/linux/misc_cgroup.h | 122 ++++++++++
init/Kconfig | 14 ++
kernel/cgroup/Makefile | 1 +
kernel/cgroup/misc.c | 423 ++++++++++++++++++++++++++++++++++
7 files changed, 620 insertions(+), 10 deletions(-)
create mode 100644 include/linux/misc_cgroup.h
create mode 100644 kernel/cgroup/misc.c

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 48017fef1cd9..dd05a1522862 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -14,6 +14,7 @@
#include <linux/psp-sev.h>
#include <linux/pagemap.h>
#include <linux/swap.h>
+#include <linux/misc_cgroup.h>
#include <linux/processor.h>
#include <linux/trace_events.h>
#include <asm/fpu/internal.h>
@@ -27,6 +28,21 @@

#define __ex(x) __kvm_handle_fault_on_reboot(x)

+#ifndef CONFIG_KVM_AMD_SEV
+/*
+ * When this config is not defined, SEV feature is not supported and APIs in
+ * this file are not used but this file still gets compiled into the KVM AMD
+ * module.
+ *
+ * We will not have MISC_CG_RES_SEV and MISC_CG_RES_SEV_ES entries in the enum
+ * misc_res_type {} defined in linux/misc_cgroup.h.
+ *
+ * Below macros allow compilation to succeed.
+ */
+#define MISC_CG_RES_SEV MISC_CG_RES_TYPES
+#define MISC_CG_RES_SEV_ES MISC_CG_RES_TYPES
+#endif
+
static u8 sev_enc_bit;
static int sev_flush_asids(void);
static DECLARE_RWSEM(sev_deactivate_lock);
@@ -88,8 +104,17 @@ static bool __sev_recycle_asids(int min_asid, int max_asid)

static int sev_asid_new(struct kvm_sev_info *sev)
{
- int pos, min_asid, max_asid;
+ int pos, min_asid, max_asid, ret;
bool retry = true;
+ enum misc_res_type type;
+
+ type = sev->es_active ? MISC_CG_RES_SEV_ES : MISC_CG_RES_SEV;
+ sev->misc_cg = get_current_misc_cg();
+ ret = misc_cg_try_charge(type, sev->misc_cg, 1);
+ if (ret) {
+ put_misc_cg(sev->misc_cg);
+ return ret;
+ }

mutex_lock(&sev_bitmap_lock);

@@ -107,7 +132,8 @@ static int sev_asid_new(struct kvm_sev_info *sev)
goto again;
}
mutex_unlock(&sev_bitmap_lock);
- return -EBUSY;
+ ret = -EBUSY;
+ goto e_uncharge;
}

__set_bit(pos, sev_asid_bitmap);
@@ -115,6 +141,10 @@ static int sev_asid_new(struct kvm_sev_info *sev)
mutex_unlock(&sev_bitmap_lock);

return pos + 1;
+e_uncharge:
+ misc_cg_uncharge(type, sev->misc_cg, 1);
+ put_misc_cg(sev->misc_cg);
+ return ret;
}

static int sev_get_asid(struct kvm *kvm)
@@ -124,14 +154,15 @@ static int sev_get_asid(struct kvm *kvm)
return sev->asid;
}

-static void sev_asid_free(int asid)
+static void sev_asid_free(struct kvm_sev_info *sev)
{
struct svm_cpu_data *sd;
int cpu, pos;
+ enum misc_res_type type;

mutex_lock(&sev_bitmap_lock);

- pos = asid - 1;
+ pos = sev->asid - 1;
__set_bit(pos, sev_reclaim_asid_bitmap);

for_each_possible_cpu(cpu) {
@@ -140,6 +171,10 @@ static void sev_asid_free(int asid)
}

mutex_unlock(&sev_bitmap_lock);
+
+ type = sev->es_active ? MISC_CG_RES_SEV_ES : MISC_CG_RES_SEV;
+ misc_cg_uncharge(type, sev->misc_cg, 1);
+ put_misc_cg(sev->misc_cg);
}

static void sev_unbind_asid(struct kvm *kvm, unsigned int handle)
@@ -187,19 +222,19 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
asid = sev_asid_new(sev);
if (asid < 0)
return ret;
+ sev->asid = asid;

ret = sev_platform_init(&argp->error);
if (ret)
goto e_free;

sev->active = true;
- sev->asid = asid;
INIT_LIST_HEAD(&sev->regions_list);

return 0;

e_free:
- sev_asid_free(asid);
+ sev_asid_free(sev);
return ret;
}

@@ -1243,12 +1278,12 @@ void sev_vm_destroy(struct kvm *kvm)
mutex_unlock(&kvm->lock);

sev_unbind_asid(kvm, sev->handle);
- sev_asid_free(sev->asid);
+ sev_asid_free(sev);
}

void __init sev_hardware_setup(void)
{
- unsigned int eax, ebx, ecx, edx;
+ unsigned int eax, ebx, ecx, edx, sev_asid_count, sev_es_asid_count;
bool sev_es_supported = false;
bool sev_supported = false;

@@ -1280,7 +1315,11 @@ void __init sev_hardware_setup(void)
if (!sev_reclaim_asid_bitmap)
goto out;

- pr_info("SEV supported: %u ASIDs\n", max_sev_asid - min_sev_asid + 1);
+ sev_asid_count = max_sev_asid - min_sev_asid + 1;
+ if (misc_cg_set_capacity(MISC_CG_RES_SEV, sev_asid_count))
+ goto out;
+
+ pr_info("SEV supported: %u ASIDs\n", sev_asid_count);
sev_supported = true;

/* SEV-ES support requested? */
@@ -1295,7 +1334,11 @@ void __init sev_hardware_setup(void)
if (min_sev_asid == 1)
goto out;

- pr_info("SEV-ES supported: %u ASIDs\n", min_sev_asid - 1);
+ sev_es_asid_count = min_sev_asid - 1;
+ if (misc_cg_set_capacity(MISC_CG_RES_SEV_ES, sev_es_asid_count))
+ goto out;
+
+ pr_info("SEV-ES supported: %u ASIDs\n", sev_es_asid_count);
sev_es_supported = true;

out:
@@ -1310,6 +1353,8 @@ void sev_hardware_teardown(void)

bitmap_free(sev_asid_bitmap);
bitmap_free(sev_reclaim_asid_bitmap);
+ misc_cg_set_capacity(MISC_CG_RES_SEV, 0);
+ misc_cg_set_capacity(MISC_CG_RES_SEV_ES, 0);

sev_flush_asids();
}
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 6e7d070f8b86..8ed6ebf47885 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -79,6 +79,7 @@ struct kvm_sev_info {
unsigned long pages_locked; /* Number of pages locked */
struct list_head regions_list; /* List of registered regions */
u64 ap_jump_table; /* SEV-ES AP Jump Table address */
+ struct misc_cg *misc_cg; /* For misc cgroup accounting */
};

struct kvm_svm {
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index acb77dcff3b4..445235487230 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -61,6 +61,10 @@ SUBSYS(pids)
SUBSYS(rdma)
#endif

+#if IS_ENABLED(CONFIG_CGROUP_MISC)
+SUBSYS(misc)
+#endif
+
/*
* The following subsystems are not supported on the default hierarchy.
*/
diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
new file mode 100644
index 000000000000..6761bcbb3161
--- /dev/null
+++ b/include/linux/misc_cgroup.h
@@ -0,0 +1,122 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Miscellaneous cgroup controller.
+ *
+ * Copyright 2020 Google LLC
+ * Author: Vipin Sharma <[email protected]>
+ */
+#ifndef _MISC_CGROUP_H_
+#define _MISC_CGROUP_H_
+
+/**
+ * Types of misc cgroup entries supported by the host.
+ */
+enum misc_res_type {
+#ifdef CONFIG_KVM_AMD_SEV
+ /* AMD SEV ASIDs resource */
+ MISC_CG_RES_SEV,
+ /* AMD SEV-ES ASIDs resource */
+ MISC_CG_RES_SEV_ES,
+#endif
+ MISC_CG_RES_TYPES
+};
+
+struct misc_cg;
+
+#ifdef CONFIG_CGROUP_MISC
+
+/**
+ * struct misc_res: Per cgroup per misc type resource
+ * @max: Maximum count of the resource.
+ * @usage: Current usage of the resource.
+ */
+struct misc_res {
+ unsigned int max;
+ atomic_t usage;
+};
+
+/**
+ * struct misc_cg - Miscellaneous controller's cgroup structure.
+ * @css: cgroup subsys state object.
+ * @res: Array of misc resources usage in the cgroup.
+ */
+struct misc_cg {
+ struct cgroup_subsys_state css;
+ struct misc_res res[MISC_CG_RES_TYPES];
+};
+
+int misc_cg_set_capacity(enum misc_res_type type, unsigned int capacity);
+int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg,
+ unsigned int amount);
+void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg,
+ unsigned int amount);
+
+/**
+ * css_misc() - Get misc cgroup from the css.
+ * @css: cgroup subsys state object.
+ *
+ * Context: Any context.
+ * Return:
+ * * %NULL - If @css is null.
+ * * struct misc_cg* - misc cgroup pointer of the passed css.
+ */
+static inline struct misc_cg *css_misc(struct cgroup_subsys_state *css)
+{
+ return css ? container_of(css, struct misc_cg, css) : NULL;
+}
+
+/*
+ * get_current_misc_cg() - Finds and get the misc cgroup of current task.
+ *
+ * Returned cgroup has its ref count increased by 1. Caller must call
+ * put_misc_cg() to return the reference.
+ *
+ * Return: Misc cgroup to which current task belongs to.
+ */
+static inline struct misc_cg *get_current_misc_cg(void)
+{
+ return css_misc(task_get_css(current, misc_cgrp_id));
+}
+
+/*
+ * put_misc_cg() - Put the misc cgroup and reduce its ref count.
+ * @cg - cgroup to put.
+ */
+static inline void put_misc_cg(struct misc_cg *cg)
+{
+ if (cg)
+ css_put(&cg->css);
+}
+
+#else /* !CONFIG_CGROUP_MISC */
+
+static inline int misc_cg_set_capacity(enum misc_res_type type,
+ unsigned int capacity)
+{
+ return 0;
+}
+
+static inline int misc_cg_try_charge(enum misc_res_type type,
+ struct misc_cg *cg,
+ unsigned int amount)
+{
+ return 0;
+}
+
+static inline void misc_cg_uncharge(enum misc_res_type type,
+ struct misc_cg *cg,
+ unsigned int amount)
+{
+}
+
+static inline struct misc_cg *get_current_misc_cg(void)
+{
+ return NULL;
+}
+
+static inline void put_misc_cg(struct misc_cg *cg)
+{
+}
+
+#endif /* CONFIG_CGROUP_MISC */
+#endif /* _MISC_CGROUP_H_ */
diff --git a/init/Kconfig b/init/Kconfig
index 29ad68325028..0b392135e555 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1105,6 +1105,20 @@ config CGROUP_BPF
BPF_CGROUP_INET_INGRESS will be executed on the ingress path of
inet sockets.

+config CGROUP_MISC
+ bool "Misc resource controller"
+ default n
+ help
+ Provides a controller for miscellaneous resources on a host.
+
+ Miscellaneous scalar resources are the resources on the host system
+ which cannot be abstracted like the other cgroups. This controller
+ tracks and limits the miscellaneous resources used by a process
+ attached to a cgroup hierarchy.
+
+ For more information, please check misc cgroup section in
+ /Documentation/admin-guide/cgroup-v2.rst.
+
config CGROUP_DEBUG
bool "Debug controller"
default n
diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index 5d7a76bfbbb7..12f8457ad1f9 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -5,4 +5,5 @@ obj-$(CONFIG_CGROUP_FREEZER) += legacy_freezer.o
obj-$(CONFIG_CGROUP_PIDS) += pids.o
obj-$(CONFIG_CGROUP_RDMA) += rdma.o
obj-$(CONFIG_CPUSETS) += cpuset.o
+obj-$(CONFIG_CGROUP_MISC) += misc.o
obj-$(CONFIG_CGROUP_DEBUG) += debug.o
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
new file mode 100644
index 000000000000..c92730e0832f
--- /dev/null
+++ b/kernel/cgroup/misc.c
@@ -0,0 +1,423 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Miscellaneous cgroup controller
+ *
+ * Copyright 2020 Google LLC
+ * Author: Vipin Sharma <[email protected]>
+ */
+
+#include <linux/limits.h>
+#include <linux/cgroup.h>
+#include <linux/errno.h>
+#include <linux/atomic.h>
+#include <linux/slab.h>
+#include <linux/misc_cgroup.h>
+
+#define MAX_STR "max"
+#define MAX_NUM UINT_MAX
+
+/* Miscellaneous res name, keep it in sync with enum misc_res_type */
+static const char *const misc_res_name[] = {
+#ifdef CONFIG_KVM_AMD_SEV
+ "sev",
+ "sev_es",
+#endif
+};
+
+/* Root misc cgroup */
+static struct misc_cg root_cg;
+
+/*
+ * Miscellaneous resources capacity for the entire machine. 0 capacity means
+ * resource is not initialized or not present in the host.
+ *
+ * root_cg.max and capacity are independent of each other. root_cg.max can be
+ * more than the actual capacity. We are using Limits resource distribution
+ * model of cgroup for miscellaneous controller. However, root_cg.current for a
+ * resource will never exceeds the resource capacity.
+ */
+static unsigned int misc_res_capacity[MISC_CG_RES_TYPES];
+
+/**
+ * parent_misc() - Get the parent of the passed misc cgroup.
+ * @cgroup: cgroup whose parent needs to be fetched.
+ *
+ * Context: Any context.
+ * Return:
+ * * struct misc_cg* - Parent of the @cgroup.
+ * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ */
+static struct misc_cg *parent_misc(struct misc_cg *cgroup)
+{
+ return cgroup ? css_misc(cgroup->css.parent) : NULL;
+}
+
+/**
+ * valid_type() - Check if @type is valid or not.
+ * @type: misc res type.
+ *
+ * Context: Any context.
+ * Return:
+ * * true - If valid type.
+ * * false - If not valid type.
+ */
+static inline bool valid_type(enum misc_res_type type)
+{
+ return type >= 0 && type < MISC_CG_RES_TYPES;
+}
+
+/**
+ * misc_cg_set_capacity() - Set the capacity of the misc cgroup res.
+ * @type: Type of the misc res.
+ * @capacity: Supported capacity of the misc res on the host.
+ *
+ * If capacity is 0 then the charging a misc cgroup fails for that type.
+ *
+ * Caller must:
+ * 1. Serialize the invocations on the same resource.
+ * 2. Make sure that the usage is 0 before deactivating the resource by setting
+ * its capacity to 0.
+ *
+ * Context: Process context.
+ * Return:
+ * * %0 - Successfully registered the capacity.
+ * * %-EINVAL - If @type is invalid.
+ * * %-EBUSY - If current usage is more than the capacity.
+ */
+int misc_cg_set_capacity(enum misc_res_type type, unsigned int capacity)
+{
+ if (!valid_type(type))
+ return -EINVAL;
+
+ for (;;) {
+ int usage;
+ unsigned int old;
+
+ /*
+ * Update the capacity while making sure that it's not below
+ * the concurrently-changing usage value.
+ *
+ * The xchg implies two full memory barriers before and after,
+ * so the read-swap-read is ordered and ensures coherency with
+ * misc_cg_try_charge(): that function modifies the usage
+ * before checking the capacity, so if it sees the old
+ * capacity, we see the modified usage and retry.
+ */
+ usage = atomic_read(&root_cg.res[type].usage);
+
+ if (usage > capacity)
+ return -EBUSY;
+
+ old = xchg(&misc_res_capacity[type], capacity);
+
+ if (atomic_read(&root_cg.res[type].usage) <= usage)
+ return 0;
+
+ misc_res_capacity[type] = old;
+ cond_resched();
+ }
+}
+EXPORT_SYMBOL(misc_cg_set_capacity);
+
+/**
+ * misc_cg_reduce_charge() - Reduce the charge from misc cgroup.
+ * @type: Misc res type in misc cg to reduce the charge from.
+ * @cg: Misc cgroup to reduce charge from.
+ * @amount: Amount to reduce.
+ *
+ * Context: Any context.
+ */
+static void misc_cg_reduce_charge(enum misc_res_type type, struct misc_cg *cg,
+ unsigned int amount)
+{
+ WARN_ONCE(atomic_add_negative(-amount, &cg->res[type].usage),
+ "misc cgroup resource %s became less than 0",
+ misc_res_name[type]);
+}
+
+/**
+ * misc_cg_try_charge() - Try charging misc cgroup.
+ * @type: misc res type to charge.
+ * @cg: Misc cgroup which will be charged, out parameter.
+ * @amount: Amount to charge.
+ *
+ * Charge @amount to the cgroup to which the current task belongs to. Charged
+ * cgroup will be pointed by @cg. Caller must use the same cgroup during
+ * uncharge call.
+ *
+ * Context: Any context.
+ * Return:
+ * * %0 - If successfully charged.
+ * * -EINVAL - If @type is invalid or misc res has 0 capacity.
+ * * -EBUSY - If max limit will be crossed or total usage will be more than the
+ * capacity.
+ */
+int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg,
+ unsigned int amount)
+{
+ struct misc_cg *i, *j;
+ int ret;
+ struct misc_res *res;
+ int new_usage;
+
+ if (!(valid_type(type) && cg && misc_res_capacity[type]))
+ return -EINVAL;
+
+ if (!amount)
+ return 0;
+
+ for (i = cg; i; i = parent_misc(i)) {
+ res = &i->res[type];
+
+ /*
+ * The atomic_long_add_return() implies a full memory barrier
+ * between incrementing the count and reading the capacity.
+ * When racing with misc_cg_set_capacity(), we either see the
+ * new capacity or the setter sees the counter has changed and
+ * retries.
+ */
+ new_usage = atomic_add_return(amount, &res->usage);
+ if (new_usage > res->max ||
+ new_usage > misc_res_capacity[type]) {
+ pr_info("cgroup: charge rejected by misc controller for %s resource in ",
+ misc_res_name[type]);
+ pr_cont_cgroup_path(i->css.cgroup);
+ pr_cont("\n");
+ ret = -EBUSY;
+ goto err_charge;
+ }
+ }
+ return 0;
+
+err_charge:
+ for (j = cg; j != i; j = parent_misc(j))
+ misc_cg_reduce_charge(type, j, amount);
+ misc_cg_reduce_charge(type, i, amount);
+ return ret;
+}
+EXPORT_SYMBOL(misc_cg_try_charge);
+
+/**
+ * misc_cg_uncharge() - Uncharge the misc cgroup.
+ * @type: Misc res type which was charged.
+ * @cg: Misc cgroup which will be uncharged.
+ * @amount: Charged amount.
+ *
+ * Context: Any context.
+ */
+void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg,
+ unsigned int amount)
+{
+ struct misc_cg *i;
+
+ if (!(amount && valid_type(type) && cg))
+ return;
+
+ for (i = cg; i; i = parent_misc(i))
+ misc_cg_reduce_charge(type, i, amount);
+}
+EXPORT_SYMBOL(misc_cg_uncharge);
+
+/**
+ * misc_cg_max_show() - Show misc cgroup max limit.
+ * @sf: Interface file
+ * @v: Arguments passed
+ *
+ * Context: Any context.
+ * Return: 0 to denote successful print.
+ */
+static int misc_cg_max_show(struct seq_file *sf, void *v)
+{
+ int i;
+ struct misc_cg *cg = css_misc(seq_css(sf));
+
+ for (i = 0; i < MISC_CG_RES_TYPES; i++) {
+ if (misc_res_capacity[i]) {
+ if (cg->res[i].max == MAX_NUM)
+ seq_printf(sf, "%s max\n", misc_res_name[i]);
+ else
+ seq_printf(sf, "%s %u\n", misc_res_name[i],
+ cg->res[i].max);
+ }
+ }
+
+ return 0;
+}
+
+/**
+ * misc_cg_max_write() - Update the maximum limit of the cgroup.
+ * @of: Handler for the file.
+ * @buf: Data from the user. It should be either "max", 0, or a positive
+ * integer.
+ * @nbytes: Number of bytes of the data.
+ * @off: Offset in the file.
+ *
+ * User can pass data like:
+ * echo sev 23 > misc.max, OR
+ * echo sev max > misc.max
+ *
+ * Context: Any context.
+ * Return:
+ * * >= 0 - Number of bytes processed in the input.
+ * * -EINVAL - If buf is not valid.
+ * * -ERANGE - If number is bigger than unsigned int capacity.
+ */
+static ssize_t misc_cg_max_write(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, loff_t off)
+{
+ struct misc_cg *cg;
+ unsigned int max;
+ int ret = 0, i;
+ enum misc_res_type type = MISC_CG_RES_TYPES;
+ char *token;
+
+ buf = strstrip(buf);
+ token = strsep(&buf, " ");
+
+ if (!token || !buf)
+ return -EINVAL;
+
+ for (i = 0; i < MISC_CG_RES_TYPES; i++) {
+ if (!strcmp(misc_res_name[i], token)) {
+ type = i;
+ break;
+ }
+ }
+
+ if (type == MISC_CG_RES_TYPES)
+ return -EINVAL;
+
+ if (!strcmp(MAX_STR, buf)) {
+ max = UINT_MAX;
+ } else {
+ ret = kstrtouint(buf, 0, &max);
+ if (ret)
+ return ret;
+ }
+
+ cg = css_misc(of_css(of));
+
+ if (misc_res_capacity[type])
+ cg->res[type].max = max;
+ else
+ ret = -EINVAL;
+
+ return ret ? ret : nbytes;
+}
+
+/**
+ * misc_cg_current_show() - Show current usage of the misc cgroup.
+ * @sf: Interface file
+ * @v: Arguments passed
+ *
+ * Context: Any context.
+ * Return: 0 to denote successful print.
+ */
+static int misc_cg_current_show(struct seq_file *sf, void *v)
+{
+ int i;
+ struct misc_cg *cg = css_misc(seq_css(sf));
+
+ for (i = 0; i < MISC_CG_RES_TYPES; i++) {
+ if (misc_res_capacity[i])
+ seq_printf(sf, "%s %u\n", misc_res_name[i],
+ atomic_read(&cg->res[i].usage));
+ }
+
+ return 0;
+}
+
+/**
+ * misc_cg_capacity_show() - Show the total capacity of misc res on the host.
+ * @sf: Interface file
+ * @v: Arguments passed
+ *
+ * Only present in the root cgroup directory.
+ *
+ * Context: Any context.
+ * Return: 0 to denote successful print.
+ */
+static int misc_cg_capacity_show(struct seq_file *sf, void *v)
+{
+ int i;
+ unsigned int cap;
+
+ for (i = 0; i < MISC_CG_RES_TYPES; i++) {
+ cap = READ_ONCE(misc_res_capacity[i]);
+ if (cap)
+ seq_printf(sf, "%s %u\n", misc_res_name[i], cap);
+ }
+
+ return 0;
+}
+
+/* Misc cgroup interface files */
+static struct cftype misc_cg_files[] = {
+ {
+ .name = "max",
+ .write = misc_cg_max_write,
+ .seq_show = misc_cg_max_show,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
+ {
+ .name = "current",
+ .seq_show = misc_cg_current_show,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
+ {
+ .name = "capacity",
+ .seq_show = misc_cg_capacity_show,
+ .flags = CFTYPE_ONLY_ON_ROOT,
+ },
+ {}
+};
+
+/**
+ * misc_cg_alloc() - Allocate misc cgroup.
+ * @parent_css: Parent cgroup.
+ *
+ * Context: Process context.
+ * Return:
+ * * struct cgroup_subsys_state* - css of the allocated cgroup.
+ * * ERR_PTR(-ENOMEM) - No memory available to allocate.
+ */
+static struct cgroup_subsys_state *
+misc_cg_alloc(struct cgroup_subsys_state *parent_css)
+{
+ enum misc_res_type i;
+ struct misc_cg *cg;
+
+ if (!parent_css) {
+ cg = &root_cg;
+ } else {
+ cg = kzalloc(sizeof(*cg), GFP_KERNEL);
+ if (!cg)
+ return ERR_PTR(-ENOMEM);
+ }
+
+ for (i = 0; i < MISC_CG_RES_TYPES; i++) {
+ cg->res[i].max = MAX_NUM;
+ atomic_set(&cg->res[i].usage, 0);
+ }
+
+ return &cg->css;
+}
+
+/**
+ * misc_cg_free() - Free the misc cgroup.
+ * @css: cgroup subsys object.
+ *
+ * Context: Any context.
+ */
+static void misc_cg_free(struct cgroup_subsys_state *css)
+{
+ kfree(css_misc(css));
+}
+
+/* Cgroup controller callbacks */
+struct cgroup_subsys misc_cgrp_subsys = {
+ .css_alloc = misc_cg_alloc,
+ .css_free = misc_cg_free,
+ .legacy_cftypes = misc_cg_files,
+ .dfl_cftypes = misc_cg_files,
+};
--
2.30.1.766.gb4fecdf3b7-goog

2021-03-04 06:02:47

by Vipin Sharma

[permalink] [raw]
Subject: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

Documentation of miscellaneous cgroup controller. This new controller is
used to track and limit the usage of scalar resources.

Signed-off-by: Vipin Sharma <[email protected]>
Reviewed-by: David Rientjes <[email protected]>
---
Documentation/admin-guide/cgroup-v1/index.rst | 1 +
Documentation/admin-guide/cgroup-v1/misc.rst | 4 ++
Documentation/admin-guide/cgroup-v2.rst | 69 ++++++++++++++++++-
3 files changed, 72 insertions(+), 2 deletions(-)
create mode 100644 Documentation/admin-guide/cgroup-v1/misc.rst

diff --git a/Documentation/admin-guide/cgroup-v1/index.rst b/Documentation/admin-guide/cgroup-v1/index.rst
index 226f64473e8e..99fbc8a64ba9 100644
--- a/Documentation/admin-guide/cgroup-v1/index.rst
+++ b/Documentation/admin-guide/cgroup-v1/index.rst
@@ -17,6 +17,7 @@ Control Groups version 1
hugetlb
memcg_test
memory
+ misc
net_cls
net_prio
pids
diff --git a/Documentation/admin-guide/cgroup-v1/misc.rst b/Documentation/admin-guide/cgroup-v1/misc.rst
new file mode 100644
index 000000000000..661614c24df3
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/misc.rst
@@ -0,0 +1,4 @@
+===============
+Misc controller
+===============
+Please refer "Misc" documentation in Documentation/admin-guide/cgroup-v2.rst
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 1de8695c264b..74777323b7fd 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -63,8 +63,11 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou
5-7-1. RDMA Interface Files
5-8. HugeTLB
5.8-1. HugeTLB Interface Files
- 5-8. Misc
- 5-8-1. perf_event
+ 5-9. Misc
+ 5.9-1 Miscellaneous cgroup Interface Files
+ 5.9-2 Migration and Ownership
+ 5-10. Others
+ 5-10-1. perf_event
5-N. Non-normative information
5-N-1. CPU controller root cgroup process behaviour
5-N-2. IO controller root cgroup process behaviour
@@ -2163,6 +2166,68 @@ HugeTLB Interface Files
Misc
----

+The Miscellaneous cgroup provides the resource limiting and tracking
+mechanism for the scalar resources which cannot be abstracted like the other
+cgroup resources. Controller is enabled by the CONFIG_CGROUP_MISC config
+option.
+
+The first two resources added to the miscellaneous controller are Secure
+Encrypted Virtualization (SEV) ASIDs and SEV - Encrypted State (SEV-ES) ASIDs.
+These limited ASIDs are used for encrypting virtual machines memory on the AMD
+platform.
+
+Misc Interface Files
+~~~~~~~~~~~~~~~~~~~~
+
+Miscellaneous controller provides 3 interface files:
+
+ misc.capacity
+ A read-only flat-keyed file shown only in the root cgroup. It shows
+ miscellaneous scalar resources available on the platform along with
+ their quantities::
+
+ $ cat misc.capacity
+ sev 50
+ sev_es 10
+
+ misc.current
+ A read-only flat-keyed file shown in the non-root cgroups. It shows
+ the current usage of the resources in the cgroup and its children.::
+
+ $ cat misc.current
+ sev 3
+ sev_es 0
+
+ misc.max
+ A read-write flat-keyed file shown in the non root cgroups. Allowed
+ maximum usage of the resources in the cgroup and its children.::
+
+ $ cat misc.max
+ sev max
+ sev_es 4
+
+ Limit can be set by::
+
+ # echo sev 1 > misc.max
+
+ Limit can be set to max by::
+
+ # echo sev max > misc.max
+
+ Limits can be set higher than the capacity value in the misc.capacity
+ file.
+
+Migration and Ownership
+~~~~~~~~~~~~~~~~~~~~~~~
+
+A miscellaneous scalar resource is charged to the cgroup in which it is used
+first, and stays charged to that cgroup until that resource is freed. Migrating
+a process to a different cgroup does not move the charge to the destination
+cgroup where the process has moved.
+
+Others
+------
+
perf_event
~~~~~~~~~~

--
2.30.1.766.gb4fecdf3b7-goog

2021-03-04 15:13:16

by Vipin Sharma

[permalink] [raw]
Subject: Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

On Wed, Mar 03, 2021 at 06:55:13PM -0800, Jacob Pan wrote:
> Hi Vipin,
>
> On Tue, 2 Mar 2021 00:17:05 -0800, Vipin Sharma <[email protected]> wrote:
>
> > +Migration and Ownership
> > +~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +A miscellaneous scalar resource is charged to the cgroup in which it is
> > used +first, and stays charged to that cgroup until that resource is
> > freed. Migrating +a process to a different cgroup does not move the
> > charge to the destination +cgroup where the process has moved.
> > +
> I am trying to see if IOASIDs cgroup can also fit in this misc controller
> as yet another resource type.
> https://lore.kernel.org/linux-iommu/20210303131726.7a8cb169@jacob-builder/T/#u
> However, unlike sev IOASIDs need to be migrated if the process is moved to
> another cgroup. i.e. charge the destination and uncharge the source.
>
> Do you think this behavior can be achieved by differentiating resource
> types? i.e. add attach callbacks for certain types. Having a single misc
> interface seems cleaner than creating another controller.

I think it makes sense to add support for migration for the resources
which need it. Resources like SEV, SEV-ES will not participate in
migration and won't stop can_attach() to succeed, other resources which
need migration will allow or stop based on their limits and capacity in
the destination.

2021-03-04 15:40:55

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

Hello,

On Wed, Mar 03, 2021 at 10:22:03PM -0800, Vipin Sharma wrote:
> > I am trying to see if IOASIDs cgroup can also fit in this misc controller
> > as yet another resource type.
> > https://lore.kernel.org/linux-iommu/20210303131726.7a8cb169@jacob-builder/T/#u
> > However, unlike sev IOASIDs need to be migrated if the process is moved to
> > another cgroup. i.e. charge the destination and uncharge the source.
> >
> > Do you think this behavior can be achieved by differentiating resource
> > types? i.e. add attach callbacks for certain types. Having a single misc
> > interface seems cleaner than creating another controller.
>
> I think it makes sense to add support for migration for the resources
> which need it. Resources like SEV, SEV-ES will not participate in
> migration and won't stop can_attach() to succeed, other resources which
> need migration will allow or stop based on their limits and capacity in
> the destination.

Please note that cgroup2 by and large don't really like or support charge
migration or even migrations themselves. We tried that w/ memcg on cgroup1
and it turned out horrible. The expected usage model as decribed in the doc
is using migration to seed a cgroup (or even better, use the new clone call
to start in the target cgroup) and then stay there until exit. All existing
controllers assume this usage model and I'm likely to nack deviation unless
there are some super strong justifications.

Thanks.

--
tejun

2021-03-04 15:45:01

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC v2 1/2] cgroup: sev: Add misc cgroup controller

Hello,

On Wed, Mar 03, 2021 at 10:12:32PM -0800, Vipin Sharma wrote:
> Right now there is no API for the caller to know total usage, unless they
> keep their own tally, I was thinking it will be useful to add one more API
>
> unsigned long misc_cg_res_total_usage(enum misc_res_type type)
>
> It will return root_cg usage for "type" resource.
> Will it be fine?

Yeah, that sounds fine.

Thanks.

--
tejun

2021-03-04 23:24:30

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC v2 1/2] cgroup: sev: Add misc cgroup controller

Hello,

On Tue, Mar 02, 2021 at 12:17:04AM -0800, Vipin Sharma wrote:
> +/**
> + * struct misc_res: Per cgroup per misc type resource
> + * @max: Maximum count of the resource.
> + * @usage: Current usage of the resource.
> + */
> +struct misc_res {
> + unsigned int max;
> + atomic_t usage;
> +};

Can we do 64bits so that something which counts memory can use this too?

> +/*
> + * Miscellaneous resources capacity for the entire machine. 0 capacity means
> + * resource is not initialized or not present in the host.
> + *
> + * root_cg.max and capacity are independent of each other. root_cg.max can be
> + * more than the actual capacity. We are using Limits resource distribution
> + * model of cgroup for miscellaneous controller. However, root_cg.current for a
> + * resource will never exceeds the resource capacity.
^
typo

> +int misc_cg_set_capacity(enum misc_res_type type, unsigned int capacity)
> +{
> + if (!valid_type(type))
> + return -EINVAL;
> +
> + for (;;) {
> + int usage;
> + unsigned int old;
> +
> + /*
> + * Update the capacity while making sure that it's not below
> + * the concurrently-changing usage value.
> + *
> + * The xchg implies two full memory barriers before and after,
> + * so the read-swap-read is ordered and ensures coherency with
> + * misc_cg_try_charge(): that function modifies the usage
> + * before checking the capacity, so if it sees the old
> + * capacity, we see the modified usage and retry.
> + */
> + usage = atomic_read(&root_cg.res[type].usage);
> +
> + if (usage > capacity)
> + return -EBUSY;

I'd rather go with allowing bringing down capacity below usage so that the
users can set it to a lower value to drain existing usages while denying new
ones. It's not like it's difficult to check the current total usage from the
caller side, so I'm not sure it's very useful to shift the condition check
here.

> +int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg,
> + unsigned int amount)
> +{
...
> + for (i = cg; i; i = parent_misc(i)) {
> + res = &i->res[type];
> +
> + /*
> + * The atomic_long_add_return() implies a full memory barrier
> + * between incrementing the count and reading the capacity.
> + * When racing with misc_cg_set_capacity(), we either see the
> + * new capacity or the setter sees the counter has changed and
> + * retries.
> + */
> + new_usage = atomic_add_return(amount, &res->usage);
> + if (new_usage > res->max ||
> + new_usage > misc_res_capacity[type]) {
> + pr_info("cgroup: charge rejected by misc controller for %s resource in ",
> + misc_res_name[type]);
> + pr_cont_cgroup_path(i->css.cgroup);
> + pr_cont("\n");

Should have commented on this in the priv thread but don't print something
on every rejection. This often becomes a nuisance and can make an easy DoS
vector at worst. If you wanna do it, print it once per cgroup or sth like
that.

Otherwise, looks good to me.

Thanks.

--
tejun

2021-03-05 00:00:41

by Jacob Pan

[permalink] [raw]
Subject: Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

Hi Vipin,

On Tue, 2 Mar 2021 00:17:05 -0800, Vipin Sharma <[email protected]> wrote:

> +Migration and Ownership
> +~~~~~~~~~~~~~~~~~~~~~~~
> +
> +A miscellaneous scalar resource is charged to the cgroup in which it is
> used +first, and stays charged to that cgroup until that resource is
> freed. Migrating +a process to a different cgroup does not move the
> charge to the destination +cgroup where the process has moved.
> +
I am trying to see if IOASIDs cgroup can also fit in this misc controller
as yet another resource type.
https://lore.kernel.org/linux-iommu/20210303131726.7a8cb169@jacob-builder/T/#u
However, unlike sev IOASIDs need to be migrated if the process is moved to
another cgroup. i.e. charge the destination and uncharge the source.

Do you think this behavior can be achieved by differentiating resource
types? i.e. add attach callbacks for certain types. Having a single misc
interface seems cleaner than creating another controller.

Thanks,

Jacob

2021-03-05 00:03:18

by Vipin Sharma

[permalink] [raw]
Subject: Re: [RFC v2 1/2] cgroup: sev: Add misc cgroup controller

On Wed, Mar 03, 2021 at 10:42:37AM -0500, Tejun Heo wrote:
> > + atomic_t usage;
> > +};
>
> Can we do 64bits so that something which counts memory can use this too?
>
Sure.

> > +
> > + if (usage > capacity)
> > + return -EBUSY;
>
> I'd rather go with allowing bringing down capacity below usage so that the
> users can set it to a lower value to drain existing usages while denying new
> ones. It's not like it's difficult to check the current total usage from the
> caller side, so I'm not sure it's very useful to shift the condition check
> here.
>

Okay, I will change the code to set new capacity unconditionally.

Right now there is no API for the caller to know total usage, unless they
keep their own tally, I was thinking it will be useful to add one more API

unsigned long misc_cg_res_total_usage(enum misc_res_type type)

It will return root_cg usage for "type" resource.
Will it be fine?

> > + pr_info("cgroup: charge rejected by misc controller for %s resource in ",
> > + misc_res_name[type]);
> > + pr_cont_cgroup_path(i->css.cgroup);
> > + pr_cont("\n");
>
> Should have commented on this in the priv thread but don't print something
> on every rejection. This often becomes a nuisance and can make an easy DoS
> vector at worst. If you wanna do it, print it once per cgroup or sth like
> that.

I didn't think in that way. Thanks, I will print it once per cgroup.

Thanks
Vipin

2021-03-12 20:57:58

by Jacob Pan

[permalink] [raw]
Subject: Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

Hi Vipin & Tejun,

Sorry for the late reply, I sent from a different email address than I
intended. Please see my comments inline.


On Thu, 4 Mar 2021 03:51:16 -0500, Tejun Heo <[email protected]> wrote:

> Hello,
>
> On Wed, Mar 03, 2021 at 10:22:03PM -0800, Vipin Sharma wrote:
> > > I am trying to see if IOASIDs cgroup can also fit in this misc
> > > controller as yet another resource type.
> > > https://lore.kernel.org/linux-iommu/20210303131726.7a8cb169@jacob-builder/T/#u
> > > However, unlike sev IOASIDs need to be migrated if the process is
> > > moved to another cgroup. i.e. charge the destination and uncharge the
> > > source.
> > >
> > > Do you think this behavior can be achieved by differentiating resource
> > > types? i.e. add attach callbacks for certain types. Having a single
> > > misc interface seems cleaner than creating another controller.
> >
> > I think it makes sense to add support for migration for the resources
> > which need it. Resources like SEV, SEV-ES will not participate in
> > migration and won't stop can_attach() to succeed, other resources which
> > need migration will allow or stop based on their limits and capacity in
> > the destination.
>
Sounds good. Perhaps some capability/feature flags for each resource such
that different behavior can be accommodated?
Could you please include me in your future posting? I will rebase on yours.

> Please note that cgroup2 by and large don't really like or support charge
> migration or even migrations themselves. We tried that w/ memcg on cgroup1
> and it turned out horrible. The expected usage model as decribed in the
> doc is using migration to seed a cgroup (or even better, use the new
> clone call to start in the target cgroup) and then stay there until exit.
> All existing controllers assume this usage model and I'm likely to nack
> deviation unless there are some super strong justifications.
>
Thank you so much for the pointers. Just to be clear, you meant
1. Use clone3 CLONE_INTO_CGROUP to put the child into a different cgroup.
2. Do not support migration of the parent (existing proc)

Thanks,

Jacob

2021-03-12 21:19:26

by Vipin Sharma

[permalink] [raw]
Subject: Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

On Fri, Mar 12, 2021 at 12:58:21PM -0800, Jacob Pan wrote:
> Hi Vipin & Tejun,
>
> Sorry for the late reply, I sent from a different email address than I
> intended. Please see my comments inline.
>
>
> On Thu, 4 Mar 2021 03:51:16 -0500, Tejun Heo <[email protected]> wrote:
>
> > Hello,
> >
> > On Wed, Mar 03, 2021 at 10:22:03PM -0800, Vipin Sharma wrote:
> > > > I am trying to see if IOASIDs cgroup can also fit in this misc
> > > > controller as yet another resource type.
> > > > https://lore.kernel.org/linux-iommu/20210303131726.7a8cb169@jacob-builder/T/#u
> > > > However, unlike sev IOASIDs need to be migrated if the process is
> > > > moved to another cgroup. i.e. charge the destination and uncharge the
> > > > source.
> > > >
> > > > Do you think this behavior can be achieved by differentiating resource
> > > > types? i.e. add attach callbacks for certain types. Having a single
> > > > misc interface seems cleaner than creating another controller.
> > >
> > > I think it makes sense to add support for migration for the resources
> > > which need it. Resources like SEV, SEV-ES will not participate in
> > > migration and won't stop can_attach() to succeed, other resources which
> > > need migration will allow or stop based on their limits and capacity in
> > > the destination.
> >
> Sounds good. Perhaps some capability/feature flags for each resource such
> that different behavior can be accommodated?
> Could you please include me in your future posting? I will rebase on yours.

Hi Jacob

Based on Tejun's response, I will not add charge migration support in
misc controller.

I can definitly add you in my future posting, if you still wanna use it
without charge migration support.

Thanks
Vipin

2021-03-12 22:58:40

by Jacob Pan

[permalink] [raw]
Subject: Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

Hi Vipin,

On Fri, 12 Mar 2021 13:15:14 -0800, Vipin Sharma <[email protected]> wrote:

> On Fri, Mar 12, 2021 at 12:58:21PM -0800, Jacob Pan wrote:
> > Hi Vipin & Tejun,
> >
> > Sorry for the late reply, I sent from a different email address than I
> > intended. Please see my comments inline.
> >
> >
> > On Thu, 4 Mar 2021 03:51:16 -0500, Tejun Heo <[email protected]> wrote:
> >
> > > Hello,
> > >
> > > On Wed, Mar 03, 2021 at 10:22:03PM -0800, Vipin Sharma wrote:
> > > > > I am trying to see if IOASIDs cgroup can also fit in this misc
> > > > > controller as yet another resource type.
> > > > > https://lore.kernel.org/linux-iommu/20210303131726.7a8cb169@jacob-builder/T/#u
> > > > > However, unlike sev IOASIDs need to be migrated if the process is
> > > > > moved to another cgroup. i.e. charge the destination and uncharge
> > > > > the source.
> > > > >
> > > > > Do you think this behavior can be achieved by differentiating
> > > > > resource types? i.e. add attach callbacks for certain types.
> > > > > Having a single misc interface seems cleaner than creating
> > > > > another controller.
> > > >
> > > > I think it makes sense to add support for migration for the
> > > > resources which need it. Resources like SEV, SEV-ES will not
> > > > participate in migration and won't stop can_attach() to succeed,
> > > > other resources which need migration will allow or stop based on
> > > > their limits and capacity in the destination.
> > >
> > Sounds good. Perhaps some capability/feature flags for each resource
> > such that different behavior can be accommodated?
> > Could you please include me in your future posting? I will rebase on
> > yours.
>
> Hi Jacob
>
> Based on Tejun's response, I will not add charge migration support in
> misc controller.
>
Sounds good. I need some confirmation on whether migration is a must have
for VMs allocated IOASIDs.
Our primary goal is to limit the amount of IOASIDs that VMs can allocate.
If a VM is migrated to a different cgroup, I think we need to
charge/uncharge the destination/source cgroup in order enforce the limit. I
am not an expert here, any feedback would be appreciated.

> I can definitly add you in my future posting, if you still wanna use it
> without charge migration support.
>
Yes, please. I got your v3 already, so just future patches.

> Thanks
> Vipin


Thanks,

Jacob

2021-03-13 10:37:24

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

On Fri, Mar 12, 2021 at 02:59:04PM -0800, Jacob Pan wrote:
> Our primary goal is to limit the amount of IOASIDs that VMs can allocate.
> If a VM is migrated to a different cgroup, I think we need to
> charge/uncharge the destination/source cgroup in order enforce the limit. I
> am not an expert here, any feedback would be appreciated.

That simply isn't a supported usage model. None of other resources will get
tracked if you do that.

Thanks.

--
tejun

2021-03-13 16:56:36

by Jacob Pan

[permalink] [raw]
Subject: Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

Hi Tejun,

On Sat, 13 Mar 2021 05:20:39 -0500, Tejun Heo <[email protected]> wrote:

> On Fri, Mar 12, 2021 at 02:59:04PM -0800, Jacob Pan wrote:
> > Our primary goal is to limit the amount of IOASIDs that VMs can
> > allocate. If a VM is migrated to a different cgroup, I think we need to
> > charge/uncharge the destination/source cgroup in order enforce the
> > limit. I am not an expert here, any feedback would be appreciated.
>
> That simply isn't a supported usage model. None of other resources will
> get tracked if you do that.
Isn't PIDs controller doing the charge/uncharge? I was under the impression
that each resource can be independently charged/uncharged, why it affects
other resources? Sorry for the basic question.

I also didn't quite get the limitation on cgroup v2 migration, this is much
simpler than memcg. Could you give me some pointers?

BTW, since the IOASIDs are used to tag DMA and bound with guest process(mm)
for shared virtual addressing. fork() cannot be supported, so I guess clone
is not a solution here.

Thanks,

Jacob

2021-03-13 18:11:40

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

Hello,

On Sat, Mar 13, 2021 at 08:57:01AM -0800, Jacob Pan wrote:
> Isn't PIDs controller doing the charge/uncharge? I was under the impression
> that each resource can be independently charged/uncharged, why it affects
> other resources? Sorry for the basic question.

Yeah, PID is an exception as we needed the initial migration to seed new
cgroups and it gets really confusing with other ways to observe the
processes - e.g. if you follow the original way of creating a cgroup,
forking and then moving the seed process into the target cgroup, if we don't
migrate the pid charge together, the numbers wouldn't agree and the seeder
cgroup may end up running out of pids if there are any restrictions.

> I also didn't quite get the limitation on cgroup v2 migration, this is much
> simpler than memcg. Could you give me some pointers?

Migration itself doesn't have restrictions but all resources are distributed
on the same hierarchy, so the controllers are supposed to follow the same
conventions that can be implemented by all controllers.

> BTW, since the IOASIDs are used to tag DMA and bound with guest process(mm)
> for shared virtual addressing. fork() cannot be supported, so I guess clone
> is not a solution here.

Can you please elaborate what wouldn't work? The new spawning into a new
cgroup w/ clone doesn't really change the usage model. It's just a neater
way to seed a new cgroup. If you're saying that the overall usage model
doesn't fit the needs of IOASIDs, it likely shouldn't be a cgroup
controller.

Thanks.

--
tejun

2021-03-15 22:16:20

by Jacob Pan

[permalink] [raw]
Subject: Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

Hi Tejun,

On Sat, 13 Mar 2021 13:05:36 -0500, Tejun Heo <[email protected]> wrote:

> Hello,
>
> On Sat, Mar 13, 2021 at 08:57:01AM -0800, Jacob Pan wrote:
> > Isn't PIDs controller doing the charge/uncharge? I was under the
> > impression that each resource can be independently charged/uncharged,
> > why it affects other resources? Sorry for the basic question.
>
> Yeah, PID is an exception as we needed the initial migration to seed new
> cgroups and it gets really confusing with other ways to observe the
> processes - e.g. if you follow the original way of creating a cgroup,
> forking and then moving the seed process into the target cgroup, if we
> don't migrate the pid charge together, the numbers wouldn't agree and the
> seeder cgroup may end up running out of pids if there are any
> restrictions.
>
Thanks for explaining. Unfortunately, it seems IOASIDs has a similar needs
in terms of migrating the charge.

> > I also didn't quite get the limitation on cgroup v2 migration, this is
> > much simpler than memcg. Could you give me some pointers?
>
> Migration itself doesn't have restrictions but all resources are
> distributed on the same hierarchy, so the controllers are supposed to
> follow the same conventions that can be implemented by all controllers.
>
Got it, I guess that is the behavior required by the unified hierarchy.
Cgroup v1 would be ok? But I am guessing we are not extending on v1?

> > BTW, since the IOASIDs are used to tag DMA and bound with guest
> > process(mm) for shared virtual addressing. fork() cannot be supported,
> > so I guess clone is not a solution here.
>
> Can you please elaborate what wouldn't work? The new spawning into a new
> cgroup w/ clone doesn't really change the usage model. It's just a neater
> way to seed a new cgroup. If you're saying that the overall usage model
> doesn't fit the needs of IOASIDs, it likely shouldn't be a cgroup
> controller.
>
The IOASIDs are programmed into devices to generate DMA requests tagged
with them. The IOMMU has a per device IOASID table with each entry has two
pointers:
- the PGD of the guest process.
- the PGD of the host process

The result of this 2 stage/nested translation is that we can share virtual
address (SVA) between guest process and DMA. The host process needs to
allocate multiple IOASIDs since one IOASID is needed for each guest process
who wants SVA.

The DMA binding among device-IOMMU-process is setup via a series of user
APIs (e.g. via VFIO).

If a process calls fork(), the children does not inherit the IOASIDs and
their bindings. Children who wish to use SVA has to call those APIs to
establish the binding for themselves.

Therefore, if a host process allocates 10 IOASIDs then does a
fork()/clone(), it cannot charge 10 IOASIDs in the new cgroup. i.e. the 10
IOASIDs stays with the process wherever it goes.

I feel this fit in the domain model, true?

> Thanks.
>


Thanks,

Jacob

2021-03-15 22:21:03

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

Hello,

On Mon, Mar 15, 2021 at 03:11:55PM -0700, Jacob Pan wrote:
> > Migration itself doesn't have restrictions but all resources are
> > distributed on the same hierarchy, so the controllers are supposed to
> > follow the same conventions that can be implemented by all controllers.
> >
> Got it, I guess that is the behavior required by the unified hierarchy.
> Cgroup v1 would be ok? But I am guessing we are not extending on v1?

A new cgroup1 only controller is unlikely to be accpeted.

> The IOASIDs are programmed into devices to generate DMA requests tagged
> with them. The IOMMU has a per device IOASID table with each entry has two
> pointers:
> - the PGD of the guest process.
> - the PGD of the host process
>
> The result of this 2 stage/nested translation is that we can share virtual
> address (SVA) between guest process and DMA. The host process needs to
> allocate multiple IOASIDs since one IOASID is needed for each guest process
> who wants SVA.
>
> The DMA binding among device-IOMMU-process is setup via a series of user
> APIs (e.g. via VFIO).
>
> If a process calls fork(), the children does not inherit the IOASIDs and
> their bindings. Children who wish to use SVA has to call those APIs to
> establish the binding for themselves.
>
> Therefore, if a host process allocates 10 IOASIDs then does a
> fork()/clone(), it cannot charge 10 IOASIDs in the new cgroup. i.e. the 10
> IOASIDs stays with the process wherever it goes.
>
> I feel this fit in the domain model, true?

I still don't get where migration is coming into the picture. Who's
migrating where?

Thanks.

--
tejun

2021-03-16 06:26:53

by Jacob Pan

[permalink] [raw]
Subject: Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

Hi Tejun,

On Mon, 15 Mar 2021 18:19:35 -0400, Tejun Heo <[email protected]> wrote:

> Hello,
>
> On Mon, Mar 15, 2021 at 03:11:55PM -0700, Jacob Pan wrote:
> > > Migration itself doesn't have restrictions but all resources are
> > > distributed on the same hierarchy, so the controllers are supposed to
> > > follow the same conventions that can be implemented by all
> > > controllers.
> > Got it, I guess that is the behavior required by the unified hierarchy.
> > Cgroup v1 would be ok? But I am guessing we are not extending on v1?
>
> A new cgroup1 only controller is unlikely to be accpeted.
>
> > The IOASIDs are programmed into devices to generate DMA requests tagged
> > with them. The IOMMU has a per device IOASID table with each entry has
> > two pointers:
> > - the PGD of the guest process.
> > - the PGD of the host process
> >
> > The result of this 2 stage/nested translation is that we can share
> > virtual address (SVA) between guest process and DMA. The host process
> > needs to allocate multiple IOASIDs since one IOASID is needed for each
> > guest process who wants SVA.
> >
> > The DMA binding among device-IOMMU-process is setup via a series of user
> > APIs (e.g. via VFIO).
> >
> > If a process calls fork(), the children does not inherit the IOASIDs and
> > their bindings. Children who wish to use SVA has to call those APIs to
> > establish the binding for themselves.
> >
> > Therefore, if a host process allocates 10 IOASIDs then does a
> > fork()/clone(), it cannot charge 10 IOASIDs in the new cgroup. i.e. the
> > 10 IOASIDs stays with the process wherever it goes.
> >
> > I feel this fit in the domain model, true?
>
> I still don't get where migration is coming into the picture. Who's
> migrating where?
>
Sorry, perhaps I can explain by an example.

There are two cgroups: cg_A and cg_B with limit set to 20 for both. Process1
is in cg_A. The initial state is:
cg_A/ioasid.current=0, cg_A/ioasid.max=20
cg_B/ioasid.current=0, cg_B/ioasid.max=20

Now, consider the following steps:

1. Process1 allocated 10 IOASIDs,
cg_A/ioasid.current=10,
cg_B/ioasid.current=0

2. then we want to move/migrate Process1 to cg_B. so we need uncharge 10 of
cg_A, charge 10 of cg_B

3. After the migration, I expect
cg_A/ioasid.current=0,
cg_B/ioasid.current=10

We don't enforce the limit during this organizational change since we can't
force free IOASIDs. But any new allocations will be subject to the limit
set in ioasid.max.

> Thanks.
>


Thanks,

Jacob

2021-03-16 07:07:00

by Jacob Pan

[permalink] [raw]
Subject: Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

Hi Tejun,

On Mon, 15 Mar 2021 19:54:36 -0400, Tejun Heo <[email protected]> wrote:

> Hello,
>
> On Mon, Mar 15, 2021 at 04:40:12PM -0700, Jacob Pan wrote:
> > 2. then we want to move/migrate Process1 to cg_B. so we need uncharge
> > 10 of cg_A, charge 10 of cg_B
>
> So, what I don't get is why this migration is necessary. This isn't
> supported as a usage pattern and no one, at least in terms of wide-spread
> usage, does this. Why is this a requirement for your use case?
>
I don't know if this is required. I thought utilities such as cgclassify
need to be supported.
" cgclassify - move running task(s) to given cgroups "
If no such use case, I am fine with dropping the migration support. Just
enforce limit on allocations.

> Thanks.
>


Thanks,

Jacob

2021-03-16 11:11:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

Hello,

On Mon, Mar 15, 2021 at 04:40:12PM -0700, Jacob Pan wrote:
> 2. then we want to move/migrate Process1 to cg_B. so we need uncharge 10 of
> cg_A, charge 10 of cg_B

So, what I don't get is why this migration is necessary. This isn't
supported as a usage pattern and no one, at least in terms of wide-spread
usage, does this. Why is this a requirement for your use case?

Thanks.

--
tejun

2021-03-16 11:20:15

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

On Mon, Mar 15, 2021 at 06:30:30PM -0700, Jacob Pan wrote:
> I don't know if this is required. I thought utilities such as cgclassify
> need to be supported.
> " cgclassify - move running task(s) to given cgroups "
> If no such use case, I am fine with dropping the migration support. Just
> enforce limit on allocations.

Yeah, that's what all other controllers do. Please read the in-tree cgroup2
doc.

Thanks.

--
tejun

2021-03-16 21:21:58

by Jacob Pan

[permalink] [raw]
Subject: Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

Hi Tejun,

On Mon, 15 Mar 2021 22:22:12 -0400, Tejun Heo <[email protected]> wrote:

> On Mon, Mar 15, 2021 at 06:30:30PM -0700, Jacob Pan wrote:
> > I don't know if this is required. I thought utilities such as cgclassify
> > need to be supported.
> > " cgclassify - move running task(s) to given cgroups "
> > If no such use case, I am fine with dropping the migration support. Just
> > enforce limit on allocations.
>
> Yeah, that's what all other controllers do. Please read the in-tree
> cgroup2 doc.
>
Thanks for your patience and guidance, will try to merge with misc
controller and go from there.

Thanks,

Jacob