2022-07-15 04:41:50

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional

3942a9bd7b58 ("locking, rcu, cgroup: Avoid synchronize_sched() in
__cgroup_procs_write()") disabled percpu operations on threadgroup_rwsem
because the impiled synchronize_rcu() on write locking was pushing up the
latencies too much for android which constantly moves processes between
cgroups.

This makes the hotter paths - fork and exit - slower as they're always
forced into the slow path. There is no reason to force this on everyone
especially given that more common static usage pattern can now completely
avoid write-locking the rwsem. Write-locking is elided when turning on and
off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a
cgroup without grabbing the rwsem.

Restore the default percpu operations and introduce the mount option
"favordynmods" and config option CGROUP_FAVOR_DYNMODS for users who need
lower latencies for the dynamic operations.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Michal Koutn? <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Dmitry Shmidt <[email protected]>
Cc: Oleg Nesterov <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 8 +++++
kernel/cgroup/cgroup-v1.c | 17 +++++++++++-
kernel/cgroup/cgroup.c | 43 ++++++++++++++++++++++++++------
3 files changed, 60 insertions(+), 8 deletions(-)

--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -184,6 +184,14 @@ cgroup v2 currently supports the followi
ignored on non-init namespace mounts. Please refer to the
Delegation section for details.

+ [no]favordynmods
+ Reduce the latencies of dynamic cgroup modifications such as
+ task migrations and controller on/offs at the cost of making
+ hot path operations such as forks and exits more expensive.
+ The static usage pattern of creating a cgroup, enabling
+ controllers, and then seeding it with CLONE_INTO_CGROUP is
+ not affected by this option.
+
memory_[no]localevents
Only populate memory.events with data for the current cgroup,
and not any subtrees. This is legacy behaviour, the default
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -875,6 +875,8 @@ static int cgroup1_show_options(struct s
seq_puts(seq, ",xattr");
if (root->flags & CGRP_ROOT_CPUSET_V2_MODE)
seq_puts(seq, ",cpuset_v2_mode");
+ if (root->flags & CGRP_ROOT_FAVOR_DYNMODS)
+ seq_puts(seq, ",favordynmods");

spin_lock(&release_agent_path_lock);
if (strlen(root->release_agent_path))
@@ -898,6 +900,8 @@ enum cgroup1_param {
Opt_noprefix,
Opt_release_agent,
Opt_xattr,
+ Opt_favordynmods,
+ Opt_nofavordynmods,
};

const struct fs_parameter_spec cgroup1_fs_parameters[] = {
@@ -909,6 +913,8 @@ const struct fs_parameter_spec cgroup1_f
fsparam_flag ("noprefix", Opt_noprefix),
fsparam_string("release_agent", Opt_release_agent),
fsparam_flag ("xattr", Opt_xattr),
+ fsparam_flag ("favordynmods", Opt_favordynmods),
+ fsparam_flag ("nofavordynmods", Opt_nofavordynmods),
{}
};

@@ -960,6 +966,12 @@ int cgroup1_parse_param(struct fs_contex
case Opt_xattr:
ctx->flags |= CGRP_ROOT_XATTR;
break;
+ case Opt_favordynmods:
+ ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+ break;
+ case Opt_nofavordynmods:
+ ctx->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
+ break;
case Opt_release_agent:
/* Specifying two release agents is forbidden */
if (ctx->release_agent)
@@ -1211,8 +1223,11 @@ static int cgroup1_root_to_use(struct fs
init_cgroup_root(ctx);

ret = cgroup_setup_root(root, ctx->subsys_mask);
- if (ret)
+ if (!ret)
+ cgroup_favor_dynmods(root, ctx->flags & CGRP_ROOT_FAVOR_DYNMODS);
+ else
cgroup_free_root(root);
+
return ret;
}

--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1305,6 +1305,20 @@ struct cgroup_root *cgroup_root_from_kf(
return root_cgrp->root;
}

+void cgroup_favor_dynmods(struct cgroup_root *root, bool favor)
+{
+ bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS;
+
+ /* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */
+ if (favor && !favoring) {
+ rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
+ root->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+ } else if (!favor && favoring) {
+ rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
+ root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
+ }
+}
+
static int cgroup_init_root_id(struct cgroup_root *root)
{
int id;
@@ -1365,6 +1379,7 @@ static void cgroup_destroy_root(struct c
cgroup_root_count--;
}

+ cgroup_favor_dynmods(root, false);
cgroup_exit_root_id(root);

mutex_unlock(&cgroup_mutex);
@@ -1858,6 +1873,7 @@ int cgroup_show_path(struct seq_file *sf

enum cgroup2_param {
Opt_nsdelegate, Opt_nonsdelegate,
+ Opt_favordynmods, Opt_nofavordynmods,
Opt_memory_localevents, Opt_memory_nolocalevents,
Opt_memory_recursiveprot, Opt_memory_norecursiveprot,
nr__cgroup2_params
@@ -1866,6 +1882,8 @@ enum cgroup2_param {
static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
fsparam_flag("nsdelegate", Opt_nsdelegate),
fsparam_flag("nonsdelegate", Opt_nonsdelegate),
+ fsparam_flag("favordynmods", Opt_favordynmods),
+ fsparam_flag("nofavordynmods", Opt_nofavordynmods),
fsparam_flag("memory_localevents", Opt_memory_localevents),
fsparam_flag("memory_nolocalevents", Opt_memory_nolocalevents),
fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot),
@@ -1890,6 +1908,12 @@ static int cgroup2_parse_param(struct fs
case Opt_nonsdelegate:
ctx->flags &= ~CGRP_ROOT_NS_DELEGATE;
return 0;
+ case Opt_favordynmods:
+ ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+ return 0;
+ case Opt_nofavordynmods:
+ ctx->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
+ return 0;
case Opt_memory_localevents:
ctx->flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
return 0;
@@ -1914,6 +1938,9 @@ static void apply_cgroup_root_flags(unsi
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;

+ cgroup_favor_dynmods(&cgrp_dfl_root,
+ root_flags & CGRP_ROOT_FAVOR_DYNMODS);
+
if (root_flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
else
@@ -1930,6 +1957,8 @@ static int cgroup_show_options(struct se
{
if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
seq_puts(seq, ",nsdelegate");
+ if (cgrp_dfl_root.flags & CGRP_ROOT_FAVOR_DYNMODS)
+ seq_puts(seq, ",favordynmods");
if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
seq_puts(seq, ",memory_localevents");
if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
@@ -1980,7 +2009,8 @@ void init_cgroup_root(struct cgroup_fs_c
cgrp->root = root;
init_cgroup_housekeeping(cgrp);

- root->flags = ctx->flags;
+ /* DYNMODS must be modified through cgroup_favor_dynmods() */
+ root->flags = ctx->flags & ~CGRP_ROOT_FAVOR_DYNMODS;
if (ctx->release_agent)
strscpy(root->release_agent_path, ctx->release_agent, PATH_MAX);
if (ctx->name)
@@ -2202,6 +2232,10 @@ static int cgroup_init_fs_context(struct
put_user_ns(fc->user_ns);
fc->user_ns = get_user_ns(ctx->ns->user_ns);
fc->global = true;
+
+#ifdef CONFIG_CGROUP_FAVOR_DYNMODS
+ ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+#endif
return 0;
}

@@ -5854,12 +5888,6 @@ int __init cgroup_init(void)

cgroup_rstat_boot();

- /*
- * The latency of the synchronize_rcu() is too high for cgroups,
- * avoid it at the cost of forcing all readers into the slow path.
- */
- rcu_sync_enter_start(&cgroup_threadgroup_rwsem.rss);
-
get_user_ns(init_cgroup_ns.user_ns);

mutex_lock(&cgroup_mutex);
@@ -6771,6 +6799,7 @@ static ssize_t features_show(struct kobj
{
return snprintf(buf, PAGE_SIZE,
"nsdelegate\n"
+ "favordynmods\n"
"memory_localevents\n"
"memory_recursiveprot\n");
}


2022-07-23 05:17:58

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional

Applying 1-3 to cgroup/for-5.20.

Thanks.

--
tejun

2022-07-23 15:13:17

by Tejun Heo

[permalink] [raw]
Subject: [PATCH RESEND 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional

3942a9bd7b58 ("locking, rcu, cgroup: Avoid synchronize_sched() in
__cgroup_procs_write()") disabled percpu operations on threadgroup_rwsem
because the impiled synchronize_rcu() on write locking was pushing up the
latencies too much for android which constantly moves processes between
cgroups.

This makes the hotter paths - fork and exit - slower as they're always
forced into the slow path. There is no reason to force this on everyone
especially given that more common static usage pattern can now completely
avoid write-locking the rwsem. Write-locking is elided when turning on and
off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a
cgroup without grabbing the rwsem.

Restore the default percpu operations and introduce the mount option
"favordynmods" and config option CGROUP_FAVOR_DYNMODS for users who need
lower latencies for the dynamic operations.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Michal Koutn? <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Dmitry Shmidt <[email protected]>
Cc: Oleg Nesterov <[email protected]>
---
I messed up patch generation and the patch was missing diffs for a few
files.

Thanks.

Documentation/admin-guide/cgroup-v2.rst | 8 +++++
include/linux/cgroup-defs.h | 19 +++++++++++---
init/Kconfig | 10 +++++++
kernel/cgroup/cgroup-internal.h | 1
kernel/cgroup/cgroup-v1.c | 17 +++++++++++-
kernel/cgroup/cgroup.c | 43 ++++++++++++++++++++++++++------
6 files changed, 87 insertions(+), 11 deletions(-)

--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -184,6 +184,14 @@ cgroup v2 currently supports the followi
ignored on non-init namespace mounts. Please refer to the
Delegation section for details.

+ [no]favordynmods
+ Reduce the latencies of dynamic cgroup modifications such as
+ task migrations and controller on/offs at the cost of making
+ hot path operations such as forks and exits more expensive.
+ The static usage pattern of creating a cgroup, enabling
+ controllers, and then seeding it with CLONE_INTO_CGROUP is
+ not affected by this option.
+
memory_[no]localevents
Only populate memory.events with data for the current cgroup,
and not any subtrees. This is legacy behaviour, the default
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -89,19 +89,32 @@ enum {
CGRP_ROOT_NS_DELEGATE = (1 << 3),

/*
+ * Reduce latencies on dynamic cgroup modifications such as task
+ * migrations and controller on/offs by disabling percpu operation on
+ * cgroup_threadgroup_rwsem. This makes hot path operations such as
+ * forks and exits into the slow path and more expensive.
+ *
+ * The static usage pattern of creating a cgroup, enabling controllers,
+ * and then seeding it with CLONE_INTO_CGROUP doesn't require write
+ * locking cgroup_threadgroup_rwsem and thus doesn't benefit from
+ * favordynmod.
+ */
+ CGRP_ROOT_FAVOR_DYNMODS = (1 << 4),
+
+ /*
* Enable cpuset controller in v1 cgroup to use v2 behavior.
*/
- CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
+ CGRP_ROOT_CPUSET_V2_MODE = (1 << 16),

/*
* Enable legacy local memory.events.
*/
- CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 5),
+ CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 17),

/*
* Enable recursive subtree protection
*/
- CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 6),
+ CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18),
};

/* cftype->flags */
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -936,6 +936,16 @@ if CGROUPS
config PAGE_COUNTER
bool

+config CGROUP_FAVOR_DYNMODS
+ bool "Favor dynamic modification latency reduction by default"
+ help
+ This option enables the "favordynmods" mount option by default
+ which reduces the latencies of dynamic cgroup modifications such
+ as task migrations and controller on/offs at the cost of making
+ hot path operations such as forks and exits more expensive.
+
+ Say N if unsure.
+
config MEMCG
bool "Memory controller"
select PAGE_COUNTER
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -233,6 +233,7 @@ void cgroup_kn_unlock(struct kernfs_node
int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
struct cgroup_namespace *ns);

+void cgroup_favor_dynmods(struct cgroup_root *root, bool favor);
void cgroup_free_root(struct cgroup_root *root);
void init_cgroup_root(struct cgroup_fs_context *ctx);
int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask);
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -875,6 +875,8 @@ static int cgroup1_show_options(struct s
seq_puts(seq, ",xattr");
if (root->flags & CGRP_ROOT_CPUSET_V2_MODE)
seq_puts(seq, ",cpuset_v2_mode");
+ if (root->flags & CGRP_ROOT_FAVOR_DYNMODS)
+ seq_puts(seq, ",favordynmods");

spin_lock(&release_agent_path_lock);
if (strlen(root->release_agent_path))
@@ -898,6 +900,8 @@ enum cgroup1_param {
Opt_noprefix,
Opt_release_agent,
Opt_xattr,
+ Opt_favordynmods,
+ Opt_nofavordynmods,
};

const struct fs_parameter_spec cgroup1_fs_parameters[] = {
@@ -909,6 +913,8 @@ const struct fs_parameter_spec cgroup1_f
fsparam_flag ("noprefix", Opt_noprefix),
fsparam_string("release_agent", Opt_release_agent),
fsparam_flag ("xattr", Opt_xattr),
+ fsparam_flag ("favordynmods", Opt_favordynmods),
+ fsparam_flag ("nofavordynmods", Opt_nofavordynmods),
{}
};

@@ -960,6 +966,12 @@ int cgroup1_parse_param(struct fs_contex
case Opt_xattr:
ctx->flags |= CGRP_ROOT_XATTR;
break;
+ case Opt_favordynmods:
+ ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+ break;
+ case Opt_nofavordynmods:
+ ctx->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
+ break;
case Opt_release_agent:
/* Specifying two release agents is forbidden */
if (ctx->release_agent)
@@ -1211,8 +1223,11 @@ static int cgroup1_root_to_use(struct fs
init_cgroup_root(ctx);

ret = cgroup_setup_root(root, ctx->subsys_mask);
- if (ret)
+ if (!ret)
+ cgroup_favor_dynmods(root, ctx->flags & CGRP_ROOT_FAVOR_DYNMODS);
+ else
cgroup_free_root(root);
+
return ret;
}

--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1305,6 +1305,20 @@ struct cgroup_root *cgroup_root_from_kf(
return root_cgrp->root;
}

+void cgroup_favor_dynmods(struct cgroup_root *root, bool favor)
+{
+ bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS;
+
+ /* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */
+ if (favor && !favoring) {
+ rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
+ root->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+ } else if (!favor && favoring) {
+ rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
+ root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
+ }
+}
+
static int cgroup_init_root_id(struct cgroup_root *root)
{
int id;
@@ -1365,6 +1379,7 @@ static void cgroup_destroy_root(struct c
cgroup_root_count--;
}

+ cgroup_favor_dynmods(root, false);
cgroup_exit_root_id(root);

mutex_unlock(&cgroup_mutex);
@@ -1858,6 +1873,7 @@ int cgroup_show_path(struct seq_file *sf

enum cgroup2_param {
Opt_nsdelegate, Opt_nonsdelegate,
+ Opt_favordynmods, Opt_nofavordynmods,
Opt_memory_localevents, Opt_memory_nolocalevents,
Opt_memory_recursiveprot, Opt_memory_norecursiveprot,
nr__cgroup2_params
@@ -1866,6 +1882,8 @@ enum cgroup2_param {
static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
fsparam_flag("nsdelegate", Opt_nsdelegate),
fsparam_flag("nonsdelegate", Opt_nonsdelegate),
+ fsparam_flag("favordynmods", Opt_favordynmods),
+ fsparam_flag("nofavordynmods", Opt_nofavordynmods),
fsparam_flag("memory_localevents", Opt_memory_localevents),
fsparam_flag("memory_nolocalevents", Opt_memory_nolocalevents),
fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot),
@@ -1890,6 +1908,12 @@ static int cgroup2_parse_param(struct fs
case Opt_nonsdelegate:
ctx->flags &= ~CGRP_ROOT_NS_DELEGATE;
return 0;
+ case Opt_favordynmods:
+ ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+ return 0;
+ case Opt_nofavordynmods:
+ ctx->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
+ return 0;
case Opt_memory_localevents:
ctx->flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
return 0;
@@ -1914,6 +1938,9 @@ static void apply_cgroup_root_flags(unsi
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;

+ cgroup_favor_dynmods(&cgrp_dfl_root,
+ root_flags & CGRP_ROOT_FAVOR_DYNMODS);
+
if (root_flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
else
@@ -1930,6 +1957,8 @@ static int cgroup_show_options(struct se
{
if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
seq_puts(seq, ",nsdelegate");
+ if (cgrp_dfl_root.flags & CGRP_ROOT_FAVOR_DYNMODS)
+ seq_puts(seq, ",favordynmods");
if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
seq_puts(seq, ",memory_localevents");
if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
@@ -1980,7 +2009,8 @@ void init_cgroup_root(struct cgroup_fs_c
cgrp->root = root;
init_cgroup_housekeeping(cgrp);

- root->flags = ctx->flags;
+ /* DYNMODS must be modified through cgroup_favor_dynmods() */
+ root->flags = ctx->flags & ~CGRP_ROOT_FAVOR_DYNMODS;
if (ctx->release_agent)
strscpy(root->release_agent_path, ctx->release_agent, PATH_MAX);
if (ctx->name)
@@ -2202,6 +2232,10 @@ static int cgroup_init_fs_context(struct
put_user_ns(fc->user_ns);
fc->user_ns = get_user_ns(ctx->ns->user_ns);
fc->global = true;
+
+#ifdef CONFIG_CGROUP_FAVOR_DYNMODS
+ ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+#endif
return 0;
}

@@ -5854,12 +5888,6 @@ int __init cgroup_init(void)

cgroup_rstat_boot();

- /*
- * The latency of the synchronize_rcu() is too high for cgroups,
- * avoid it at the cost of forcing all readers into the slow path.
- */
- rcu_sync_enter_start(&cgroup_threadgroup_rwsem.rss);
-
get_user_ns(init_cgroup_ns.user_ns);

mutex_lock(&cgroup_mutex);
@@ -6771,6 +6799,7 @@ static ssize_t features_show(struct kobj
{
return snprintf(buf, PAGE_SIZE,
"nsdelegate\n"
+ "favordynmods\n"
"memory_localevents\n"
"memory_recursiveprot\n");
}

2022-07-25 12:17:11

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH RESEND 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional

On 07/23, Tejun Heo wrote:
>
> +void cgroup_favor_dynmods(struct cgroup_root *root, bool favor)
> +{
> + bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS;
> +
> + /* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */
> + if (favor && !favoring) {
> + rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
> + root->flags |= CGRP_ROOT_FAVOR_DYNMODS;
> + } else if (!favor && favoring) {
> + rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
> + root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
> + }
> +}

I see no problems in this patch. But just for record, we do not need
synchronize_rcu() in the "favor && !favoring" case, so we cab probably
do something like

--- a/kernel/rcu/sync.c
+++ b/kernel/rcu/sync.c
@@ -118,7 +118,7 @@ static void rcu_sync_func(struct rcu_head *rhp)
* optimize away the grace-period wait via a state machine implemented
* by rcu_sync_enter(), rcu_sync_exit(), and rcu_sync_func().
*/
-void rcu_sync_enter(struct rcu_sync *rsp)
+void __rcu_sync_enter(struct rcu_sync *rsp, bool wait)
{
int gp_state;

@@ -146,13 +146,20 @@ void rcu_sync_enter(struct rcu_sync *rsp)
* See the comment above, this simply does the "synchronous"
* call_rcu(rcu_sync_func) which does GP_ENTER -> GP_PASSED.
*/
- synchronize_rcu();
- rcu_sync_func(&rsp->cb_head);
- /* Not really needed, wait_event() would see GP_PASSED. */
- return;
+ if (wait) {
+ synchronize_rcu();
+ rcu_sync_func(&rsp->cb_head);
+ } else {
+ rcu_sync_call(rsp);
+ }
+ } else if (wait) {
+ wait_event(rsp->gp_wait, READ_ONCE(rsp->gp_state) >= GP_PASSED);
}
+}

- wait_event(rsp->gp_wait, READ_ONCE(rsp->gp_state) >= GP_PASSED);
+void rcu_sync_enter(struct rcu_sync *rsp)
+{
+ __rcu_sync_enter(rsp, true);
}

/**

later.

__rcu_sync_enter(rsp, false) works just like rcu_sync_enter_start() but it can
be safely called at any moment.

And can't resist, off-topic question... Say, cgroup_attach_task_all() does

mutex_lock(&cgroup_mutex);
percpu_down_write(&cgroup_threadgroup_rwsem);

and this means that synchronize_rcu() can be called with cgroup_mutex held.
Perhaps it makes sense to change this code to do

rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
mutex_lock(&cgroup_mutex);
percpu_down_write(&cgroup_threadgroup_rwsem);
...
percpu_up_write(&cgroup_threadgroup_rwsem);
mutex_unlock(&cgroup_mutex);
rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);

? Just curious.

> - /*
> - * The latency of the synchronize_rcu() is too high for cgroups,
> - * avoid it at the cost of forcing all readers into the slow path.
> - */
> - rcu_sync_enter_start(&cgroup_threadgroup_rwsem.rss);

Note that it doesn't have other users, probably you can kill it.

Oleg.

2022-07-25 14:44:21

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH RESEND 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional

On Sat, Jul 23, 2022 at 04:28:28AM -1000, Tejun Heo wrote:
> 3942a9bd7b58 ("locking, rcu, cgroup: Avoid synchronize_sched() in
> __cgroup_procs_write()") disabled percpu operations on threadgroup_rwsem
> because the impiled synchronize_rcu() on write locking was pushing up the
> latencies too much for android which constantly moves processes between
> cgroups.
>
> This makes the hotter paths - fork and exit - slower as they're always
> forced into the slow path. There is no reason to force this on everyone
> especially given that more common static usage pattern can now completely
> avoid write-locking the rwsem. Write-locking is elided when turning on and
> off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a
> cgroup without grabbing the rwsem.
>
> Restore the default percpu operations and introduce the mount option
> "favordynmods" and config option CGROUP_FAVOR_DYNMODS for users who need
> lower latencies for the dynamic operations.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Cc: Michal Koutný <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: John Stultz <[email protected]>
> Cc: Dmitry Shmidt <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> ---

Seems sane,
Acked-by: Christian Brauner (Microsoft) <[email protected]>

2022-07-26 15:08:43

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH RESEND 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional

On Sat, Jul 23, 2022 at 04:28:28AM -1000, Tejun Heo <[email protected]> wrote:
> This makes the hotter paths - fork and exit - slower as they're always
> forced into the slow path. There is no reason to force this on everyone
> especially given that more common static usage pattern can now completely
> avoid write-locking the rwsem. Write-locking is elided when turning on and
> off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a
> cgroup without grabbing the rwsem.

Just a practical note that CLONE_INTO_CGROUP may not be so widespread
yet [1][2].
But generally, the change makes sense to me.


> + CGRP_ROOT_FAVOR_DYNMODS = (1 << 4),
> +
> + /*
> * Enable cpuset controller in v1 cgroup to use v2 behavior.
> */
> - CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
> + CGRP_ROOT_CPUSET_V2_MODE = (1 << 16),
>
> /*
> * Enable legacy local memory.events.
> */
> - CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 5),
> + CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 17),
>
> /*
> * Enable recursive subtree protection
> */
> - CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 6),
> + CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18),

Why this new gap in flag bits?

[1] https://github.com/systemd/systemd/pull/16706
[2] https://github.com/search?q=org%3Aopencontainers+CLONE_INTO_CGROUP&type=all (empty)


Attachments:
(No filename) (1.32 kB)
signature.asc (235.00 B)
Digital signature
Download all attachments

2022-07-26 17:40:14

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH RESEND 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional

Hello,

On Tue, Jul 26, 2022 at 04:32:57PM +0200, Michal Koutn? wrote:
> On Sat, Jul 23, 2022 at 04:28:28AM -1000, Tejun Heo <[email protected]> wrote:
> > This makes the hotter paths - fork and exit - slower as they're always
> > forced into the slow path. There is no reason to force this on everyone
> > especially given that more common static usage pattern can now completely
> > avoid write-locking the rwsem. Write-locking is elided when turning on and
> > off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a
> > cgroup without grabbing the rwsem.
>
> Just a practical note that CLONE_INTO_CGROUP may not be so widespread
> yet [1][2].
> But generally, the change makes sense to me.

Yeah, I was disappoinetd that it wasn't being used by systemd already. It'd
be great if the glibc situation can be rectified soon because this is a much
better interface.

> > + CGRP_ROOT_FAVOR_DYNMODS = (1 << 4),
> > +
> > + /*
> > * Enable cpuset controller in v1 cgroup to use v2 behavior.
> > */
> > - CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
> > + CGRP_ROOT_CPUSET_V2_MODE = (1 << 16),
> >
> > /*
> > * Enable legacy local memory.events.
> > */
> > - CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 5),
> > + CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 17),
> >
> > /*
> > * Enable recursive subtree protection
> > */
> > - CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 6),
> > + CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18),
>
> Why this new gap in flag bits?

To distinguish core and per-controller flags.

Thanks.

--
tejun

2022-07-26 23:24:41

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH RESEND 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional

Hello, Oleg.

On Mon, Jul 25, 2022 at 02:12:09PM +0200, Oleg Nesterov wrote:
> I see no problems in this patch. But just for record, we do not need
> synchronize_rcu() in the "favor && !favoring" case, so we cab probably
> do something like
>
> @@ -146,13 +146,20 @@ void rcu_sync_enter(struct rcu_sync *rsp)
> * See the comment above, this simply does the "synchronous"
> * call_rcu(rcu_sync_func) which does GP_ENTER -> GP_PASSED.
> */
> + if (wait) {
> + synchronize_rcu();
> + rcu_sync_func(&rsp->cb_head);
> + } else {
> + rcu_sync_call(rsp);
> + }
> + } else if (wait) {
> + wait_event(rsp->gp_wait, READ_ONCE(rsp->gp_state) >= GP_PASSED);
...
> later.
>
> __rcu_sync_enter(rsp, false) works just like rcu_sync_enter_start() but it can
> be safely called at any moment.

Yeah, I originally used rcu_sync_enter_start() but quickly found out that it
can't be reverted reliably. Given how cold the option switching path is, I
think it's fine to pay an extra synchronize_rcu() there rather than adding
more complexity to rcu_sync_enter() unless this will be useful somewhere
else too.

> And can't resist, off-topic question... Say, cgroup_attach_task_all() does
>
> mutex_lock(&cgroup_mutex);
> percpu_down_write(&cgroup_threadgroup_rwsem);
>
> and this means that synchronize_rcu() can be called with cgroup_mutex held.
> Perhaps it makes sense to change this code to do
>
> rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
> mutex_lock(&cgroup_mutex);
> percpu_down_write(&cgroup_threadgroup_rwsem);
> ...
> percpu_up_write(&cgroup_threadgroup_rwsem);
> mutex_unlock(&cgroup_mutex);
> rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
>
> ? Just curious.

I'm not quite following. Are you saying that if we switching the rwsem into
slow mode before grabbing the locks, we can avoid inducing latencies on
other users? Hmm... assuming that I'm understanding you correctly, one
problem with that approach is that everyone would be doing synchronize_rcu()
whether they want to change favoring state. In vast majority of cases, users
won't care about this flag but most users will end up mounting cgroup and do
the rcu_sync_enter(), so we'd end up adding a grace period wait in most boot
scenarios. It's not a lot in itself but seems less desriable than making the
users who want to change the mode pay at the time of changing.

> > - /*
> > - * The latency of the synchronize_rcu() is too high for cgroups,
> > - * avoid it at the cost of forcing all readers into the slow path.
> > - */
> > - rcu_sync_enter_start(&cgroup_threadgroup_rwsem.rss);
>
> Note that it doesn't have other users, probably you can kill it.

Ah, nice, will do that.

Thanks.

--
tejun

2022-07-27 18:55:30

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH RESEND 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional

Hi Tejun,

On 07/26, Tejun Heo wrote:
>
> > __rcu_sync_enter(rsp, false) works just like rcu_sync_enter_start() but it can
> > be safely called at any moment.
>
> Yeah, I originally used rcu_sync_enter_start() but quickly found out that it
> can't be reverted reliably. Given how cold the option switching path is, I
> think it's fine to pay an extra synchronize_rcu() there rather than adding
> more complexity to rcu_sync_enter() unless this will be useful somewhere
> else too.

Yes, agreed. As I said, this is just for record, so that I can find this (simple)
patch on lkml if we have another user of __rcu_sync_enter(rsp, bool wait).

> > And can't resist, off-topic question... Say, cgroup_attach_task_all() does
> >
> > mutex_lock(&cgroup_mutex);
> > percpu_down_write(&cgroup_threadgroup_rwsem);
> >
> > and this means that synchronize_rcu() can be called with cgroup_mutex held.
> > Perhaps it makes sense to change this code to do
> >
> > rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
> > mutex_lock(&cgroup_mutex);
> > percpu_down_write(&cgroup_threadgroup_rwsem);
> > ...
> > percpu_up_write(&cgroup_threadgroup_rwsem);
> > mutex_unlock(&cgroup_mutex);
> > rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
> >
> > ? Just curious.
>
> I'm not quite following.

Me too ;)

> Are you saying that if we switching the rwsem into
> slow mode before grabbing the locks, we can avoid inducing latencies on
> other users?

Well yes, in that another mutex_lock(&cgroup_mutex) won't sleep until
synchronize_rcu() (called under cgroup_mutex) completes.

> Hmm... assuming that I'm understanding you correctly, one
> problem with that approach is that everyone would be doing synchronize_rcu()
> whether they want to change favoring state.

Hmm... I didn't mean the changing if favoring state... And in any case,
this won't cause any additional synchronize_rcu().

Nevermind, please forget, this probably makes no sense.

Oleg.