Rstat currently only supports the default hierarchy in cgroup2. In
order to replace memcg's private stats infrastructure - used in both
cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1.
The initialization and destruction callbacks for regular cgroups are
already in place. Remove the cgroup_on_dfl() guards to handle cgroup1.
The initialization of the root cgroup is currently hardcoded to only
handle cgrp_dfl_root.cgrp. Move those callbacks to cgroup_setup_root()
and cgroup_destroy_root() to handle the default root as well as the
various cgroup1 roots we may set up during mounting.
The linking of css to cgroups happens in code shared between cgroup1
and cgroup2 as well. Simply remove the cgroup_on_dfl() guard.
Linkage of the root css to the root cgroup is a bit trickier: per
default, the root css of a subsystem controller belongs to the default
hierarchy (i.e. the cgroup2 root). When a controller is mounted in its
cgroup1 version, the root css is stolen and moved to the cgroup1 root;
on unmount, the css moves back to the default hierarchy. Annotate
rebind_subsystems() to move the root css linkage along between roots.
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Tejun Heo <[email protected]>
---
kernel/cgroup/cgroup.c | 34 +++++++++++++++++++++-------------
kernel/cgroup/rstat.c | 2 --
2 files changed, 21 insertions(+), 15 deletions(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9153b20e5cc6..e049edd66776 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1339,6 +1339,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
mutex_unlock(&cgroup_mutex);
+ cgroup_rstat_exit(cgrp);
kernfs_destroy_root(root->kf_root);
cgroup_free_root(root);
}
@@ -1751,6 +1752,12 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
&dcgrp->e_csets[ss->id]);
spin_unlock_irq(&css_set_lock);
+ if (ss->css_rstat_flush) {
+ list_del_rcu(&css->rstat_css_node);
+ list_add_rcu(&css->rstat_css_node,
+ &dcgrp->rstat_css_list);
+ }
+
/* default hierarchy doesn't enable controllers by default */
dst_root->subsys_mask |= 1 << ssid;
if (dst_root == &cgrp_dfl_root) {
@@ -1971,10 +1978,14 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
if (ret)
goto destroy_root;
- ret = rebind_subsystems(root, ss_mask);
+ ret = cgroup_rstat_init(root_cgrp);
if (ret)
goto destroy_root;
+ ret = rebind_subsystems(root, ss_mask);
+ if (ret)
+ goto exit_stats;
+
ret = cgroup_bpf_inherit(root_cgrp);
WARN_ON_ONCE(ret);
@@ -2006,6 +2017,8 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
ret = 0;
goto out;
+exit_stats:
+ cgroup_rstat_exit(root_cgrp);
destroy_root:
kernfs_destroy_root(root->kf_root);
root->kf_root = NULL;
@@ -4934,8 +4947,7 @@ static void css_free_rwork_fn(struct work_struct *work)
cgroup_put(cgroup_parent(cgrp));
kernfs_put(cgrp->kn);
psi_cgroup_free(cgrp);
- if (cgroup_on_dfl(cgrp))
- cgroup_rstat_exit(cgrp);
+ cgroup_rstat_exit(cgrp);
kfree(cgrp);
} else {
/*
@@ -4976,8 +4988,7 @@ static void css_release_work_fn(struct work_struct *work)
/* cgroup release path */
TRACE_CGROUP_PATH(release, cgrp);
- if (cgroup_on_dfl(cgrp))
- cgroup_rstat_flush(cgrp);
+ cgroup_rstat_flush(cgrp);
spin_lock_irq(&css_set_lock);
for (tcgrp = cgroup_parent(cgrp); tcgrp;
@@ -5034,7 +5045,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
css_get(css->parent);
}
- if (cgroup_on_dfl(cgrp) && ss->css_rstat_flush)
+ if (ss->css_rstat_flush)
list_add_rcu(&css->rstat_css_node, &cgrp->rstat_css_list);
BUG_ON(cgroup_css(cgrp, ss));
@@ -5159,11 +5170,9 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
if (ret)
goto out_free_cgrp;
- if (cgroup_on_dfl(parent)) {
- ret = cgroup_rstat_init(cgrp);
- if (ret)
- goto out_cancel_ref;
- }
+ ret = cgroup_rstat_init(cgrp);
+ if (ret)
+ goto out_cancel_ref;
/* create the directory */
kn = kernfs_create_dir(parent->kn, name, mode, cgrp);
@@ -5250,8 +5259,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
out_kernfs_remove:
kernfs_remove(cgrp->kn);
out_stat_exit:
- if (cgroup_on_dfl(parent))
- cgroup_rstat_exit(cgrp);
+ cgroup_rstat_exit(cgrp);
out_cancel_ref:
percpu_ref_exit(&cgrp->self.refcnt);
out_free_cgrp:
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index d51175cedfca..faa767a870ba 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -285,8 +285,6 @@ void __init cgroup_rstat_boot(void)
for_each_possible_cpu(cpu)
raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
-
- BUG_ON(cgroup_rstat_init(&cgrp_dfl_root.cgrp));
}
/*
--
2.30.0
On Wed, Feb 17, 2021 at 06:42:32PM +0100, Michal Koutn? wrote:
> Hello.
>
> On Tue, Feb 09, 2021 at 11:33:00AM -0500, Johannes Weiner <[email protected]> wrote:
> > @@ -1971,10 +1978,14 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
> > if (ret)
> > goto destroy_root;
> >
> > - ret = rebind_subsystems(root, ss_mask);
> > + ret = cgroup_rstat_init(root_cgrp);
> Would it make sense to do cgroup_rstat_init() only if there's a subsys
> in ss_mask that makes use of rstat?
> (On legacy systems there could be individual hierarchy for each
> controller so the rstat space can be saved.)
It's possible, but I don't think worth the trouble.
It would have to be done from rebind_subsystems(), as remount can add
more subsystems to an existing cgroup1 root. That in turn means we'd
have to have separate init paths for cgroup1 and cgroup2.
While we split cgroup1 and cgroup2 paths where necessary in the code,
it's a significant maintenance burden and a not unlikely source of
subtle errors (see the recent 'fix swap undercounting in cgroup2').
In this case, we're talking about a relatively small data structure
and the overhead is per mountpoint. Comparatively, we're allocating
the full vmstats structures for cgroup1 groups which barely use them,
and cgroup1 softlimit tree structures for each cgroup2 group.
So I don't think it's a good tradeoff. Subtle bugs that require kernel
patches are more disruptive to the user experience than the amount of
memory in question here.
> > @@ -285,8 +285,6 @@ void __init cgroup_rstat_boot(void)
> >
> > for_each_possible_cpu(cpu)
> > raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
> > -
> > - BUG_ON(cgroup_rstat_init(&cgrp_dfl_root.cgrp));
> > }
> Regardless of the suggestion above, this removal obsoletes the comment
> cgroup_rstat_init:
>
> int cpu;
>
> - /* the root cgrp has rstat_cpu preallocated */
> if (!cgrp->rstat_cpu) {
> cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
Oh, I'm not removing the init call, I'm merely moving it from
cgroup_rstat_boot() to cgroup_setup_root().
The default root group has statically preallocated percpu data before
and after this patch. See cgroup.c:
static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu);
/* the default hierarchy */
struct cgroup_root cgrp_dfl_root = { .cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu };
EXPORT_SYMBOL_GPL(cgrp_dfl_root);
Hello.
On Tue, Feb 09, 2021 at 11:33:00AM -0500, Johannes Weiner <[email protected]> wrote:
> @@ -1971,10 +1978,14 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
> if (ret)
> goto destroy_root;
>
> - ret = rebind_subsystems(root, ss_mask);
> + ret = cgroup_rstat_init(root_cgrp);
Would it make sense to do cgroup_rstat_init() only if there's a subsys
in ss_mask that makes use of rstat?
(On legacy systems there could be individual hierarchy for each
controller so the rstat space can be saved.)
> @@ -5159,11 +5170,9 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
> if (ret)
> goto out_free_cgrp;
>
> - if (cgroup_on_dfl(parent)) {
> - ret = cgroup_rstat_init(cgrp);
> - if (ret)
> - goto out_cancel_ref;
> - }
> + ret = cgroup_rstat_init(cgrp);
And here do cgroup_rstat_init() only when parent has it.
> @@ -285,8 +285,6 @@ void __init cgroup_rstat_boot(void)
>
> for_each_possible_cpu(cpu)
> raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
> -
> - BUG_ON(cgroup_rstat_init(&cgrp_dfl_root.cgrp));
> }
Regardless of the suggestion above, this removal obsoletes the comment
cgroup_rstat_init:
int cpu;
- /* the root cgrp has rstat_cpu preallocated */
if (!cgrp->rstat_cpu) {
cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
Regards,
Michal
On Wed, Feb 17, 2021 at 03:52:59PM -0500, Johannes Weiner <[email protected]> wrote:
> It's possible, but I don't think worth the trouble.
You're right. I gave it a deeper look and what would be saved on data,
would be paid in code complexity.
> In this case, we're talking about a relatively small data structure
> and the overhead is per mountpoint.
IIUC, it is per each mountpoint's number of cgroups. But I still accept
the argument above. Furthermore, this can be changed later.
> The default root group has statically preallocated percpu data before
> and after this patch. See cgroup.c:
I stand corrected, the comment is still valid.
Therefore,
Reviewed-by: Michal Koutn? <[email protected]>
On Thu, Feb 18, 2021 at 04:45:11PM +0100, Michal Koutn? wrote:
> On Wed, Feb 17, 2021 at 03:52:59PM -0500, Johannes Weiner <[email protected]> wrote:
> > In this case, we're talking about a relatively small data structure
> > and the overhead is per mountpoint.
> IIUC, it is per each mountpoint's number of cgroups. But I still accept
> the argument above. Furthermore, this can be changed later.
Oops, you're right of course.
> > The default root group has statically preallocated percpu data before
> > and after this patch. See cgroup.c:
> I stand corrected, the comment is still valid.
>
> Therefore,
> Reviewed-by: Michal Koutn? <[email protected]>
Thanks for your reviews, Michal!