Hi,
This series adds cgroup awareness to unbound workqueues.
This is the second version since Bandan Das's post from a few years ago[1].
The design is completely new, but the code is still in development and I'm
posting early to get feedback on the design. Is this a good direction?
Thanks,
Daniel
Summary
-------
Cgroup controllers don't throttle workqueue workers for the most part, so
resource-intensive works run unchecked. Fix it by adding a new type of work
that causes the assigned worker to attach to the given cgroup.
Motivation
----------
Workqueue workers are currently always attached to root cgroups. If a task in
a child cgroup queues a resource-intensive work, the resource limits of the
child cgroup generally don't throttle the worker, with some exceptions such as
writeback.
My use case for this work is kernel multithreading, the series formerly known
as ktask[2] that I'm now trying to combine with padata according to feedback
from the last post. Helper threads in a multithreaded job may consume lots of
resources that aren't properly accounted to the cgroup of the task that started
the job.
Basic Idea
----------
I know of two basic ways to fix this, with other ideas welcome. They both use
the existing cgroup migration path to move workers to different cgroups.
#1 Maintain per-cgroup worker pools and queue works on these pools. A
worker in the pool is migrated once to the pool's assigned cgroup when
the worker is first created.
These days, users can have hundreds or thousands of cgroups on their systems,
which means that #1 could cause as many workers to be created across the pools,
bringing back the problems of MT workqueues.[3] The concurrency level could be
managed across the pools, but I don't see how to avoid thrashing on worker
creation and destruction with even demand for workers across cgroups. So #1
doesn't seem like the right way forward.
#2 Migrate a worker to the desired cgroup before it runs the work.
Worker pools are shared across cgroups, and workers migrate to
different cgroups as needed.
#2 has some issues of its own, namely cgroup_mutex and
cgroup_threadgroup_rwsem. These prevent concurrent worker migrations, so for
this to work scalably, these locks should be fixed. css_set_lock and
controller-specific locks may then also be a problem. Nevertheless, #2 keeps
the total number of workers low to accommodate systems with many cgroups.
This RFC implements #2. If the design looks good, I can start working on
fixing the locks, and I'd be thrilled if others wanted to help with this.
A third alternative arose late in the development of this series that takes
inspiration from proxy execution, in which a task's scheduling context and
execution context are treated separately[4]. The idea is to allow a proxy task
to temporarily assume the cgroup characteristics of another task so that it can
use the other task's cgroup-related task_struct fields. The worker avoids the
performance and scalability cost of the migration path, but it also doesn't run
the attach callbacks, so controllers wouldn't work as designed without adding
special logic in various places to account for this situation. That doesn't
sound immediately appealing, but I haven't thought about it for very long.
Data Structures
---------------
Cgroup awareness is implemented per work with a new type of work item:
struct cgroup_work {
struct work_struct work;
#ifdef CONFIG_CGROUPS
struct cgroup *cgroup;
#endif
};
The cgroup field contains the cgroup to which the assigned worker should
attach. A new type is used so only those users who want cgroup awareness incur
the space overhead of the cgroup pointer. This feature is supported only for
cgroups on the default hierarchy, so one cgroup pointer is sufficient.
Workqueues may be created with the WQ_CGROUP flag. The flag means that cgroup
works, and only cgroup works, may be queued on this workqueue. Cgroup works
aren't allowed to be queued on !WQ_CGROUP workqueues.
This separation avoids the issues that come from cgroup_works and regular works
being queued together, such as distinguishing between the two on a worklist,
which probably means adding a new work data bit causing increased memory usage
from higher pool_workqueue alignment, or creating multiple worklists and
dealing fairly with, "which worklist do I pick from next?"
Migrating Kernel Threads
------------------------
Migrated worker pids appear in cgroup.procs and cgroup.threads, and they block
cgroup destruction (cgroup_rmdir) just as user tasks do. To alleviate this
somewhat, workers that have finished their work migrate themselves back to the
root cgroup before sleeping.
In addition, it's probably best to allow userland to destroy a cgroup when only
kernel threads remain (no user tasks left), with destruction finishing in the
background once all kernel threads have been migrated out. The reason is, it's
consistent with current cgroup behavior in which management apps, libraries,
etc may expect destruction to succeed when all known tasks have been moved out.
So that's tentatively on my TODO, but I'm curious what people think.
It's possible for task migration to fail for several reasons. On failure, the
worker tries migrating itself to the root cgroup. In case _this_ fails, the
code currently throws a warning, but it seems best to design this so that
migrating a kernel thread to the root can't fail. Otherwise, with both
failures, we account work to an unrelated, random cgroup.
Priority Inversion
------------------
One concern with cgroup-aware workers may be priority inversion[5]. I couldn't
find where this was discussed in detail, but it seems the issue is that a
worker could be throttled by some resource limit from its attached cgroup,
causing other work items' execution to be delayed a long time.
However, this doesn't seem to be a problem because of how worker pools are
managed. There's an invariant that at least one idle worker should exist in a
pool before a worker begins processing works, so that there will be at least
one worker per work item, avoiding the inversion.
It's possible that works from a large number of different resource-constrained
cgroups could cause as many workers to be created, with creation eventually
failing due for example to pid exhaustion, but in that extreme case workqueue
will retry repeatedly with a CREATE_COOLDOWN timeout. This seems good enough,
but I'm open to other ideas.
Testing
-------
A little, not a lot. I've sanity-checked that some controllers throttle
workers as expected (memory, cpu, pids), "believe" rdma should work, haven't
looked at io yet, and know cpuset is broken. For cpuset, I need to fix
->can_attach() for bound kthreads and reconcile the controller's cpumasks with
worker cpumasks.
In one experiment on a large Xeon server, a kernel thread was migrated 2M times
back and forth between two cgroups. The mean time per migration was 1 usec, so
cgroup-aware work items should take much longer than that for the migration to
be worth it.
TODO
----
- scale cgroup_mutex and cgroup_threadcgroup_rwsem
- support the cpuset controller, and reconcile that with workqueue NUMA
awareness and worker_pool cpumasks
- support the io controller
- make kernel thread migration to the root cgroup always succeed
- maybe allow userland to destroy a cgroup with only kernel threads
Dependencies
------------
This series is against 5.1 plus some kernel multithreading patches (formerly
ktask). A branch with everything is available at
git://oss.oracle.com/git/linux-dmjordan.git cauwq-rfc-v2
The multithreading patches don't incorporate some of the feedback from the last
post[2] (yet) because I'm in the process of addressing the larger design
comments.
[1] http://lkml.kernel.org/r/[email protected]
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] https://lore.kernel.org/lkml/[email protected]/
[4] https://lore.kernel.org/lkml/[email protected]/
[5] https://lore.kernel.org/netdev/[email protected]/
Daniel Jordan (5):
cgroup: add cgroup v2 interfaces to migrate kernel threads
workqueue, cgroup: add cgroup-aware workqueues
workqueue, memcontrol: make memcg throttle workqueue workers
workqueue, cgroup: add test module
ktask, cgroup: attach helper threads to the master thread's cgroup
include/linux/cgroup.h | 43 +++
include/linux/workqueue.h | 85 +++++
kernel/cgroup/cgroup-internal.h | 1 -
kernel/cgroup/cgroup.c | 48 ++-
kernel/ktask.c | 32 +-
kernel/workqueue.c | 263 +++++++++++++-
kernel/workqueue_internal.h | 50 +++
lib/Kconfig.debug | 12 +
lib/Makefile | 1 +
lib/test_cgroup_workqueue.c | 325 ++++++++++++++++++
mm/memcontrol.c | 26 +-
.../selftests/cgroup_workqueue/Makefile | 9 +
.../testing/selftests/cgroup_workqueue/config | 1 +
.../cgroup_workqueue/test_cgroup_workqueue.sh | 104 ++++++
14 files changed, 963 insertions(+), 37 deletions(-)
create mode 100644 lib/test_cgroup_workqueue.c
create mode 100644 tools/testing/selftests/cgroup_workqueue/Makefile
create mode 100644 tools/testing/selftests/cgroup_workqueue/config
create mode 100755 tools/testing/selftests/cgroup_workqueue/test_cgroup_workqueue.sh
base-commit: e93c9c99a629c61837d5a7fc2120cd2b6c70dbdd
prerequisite-patch-id: 253830d9ec7ed8f9d10127c1bc61f2489c40f042
prerequisite-patch-id: 0fa4fe0d879ae76f8e16d15982d799d84f67bee3
prerequisite-patch-id: e2e8229b9d1a1efa75262910a597902e136a6214
prerequisite-patch-id: f67900739fe811de1d4d06e19a5aa180b46396d8
prerequisite-patch-id: 7349e563091d065d4ace565f3daca139d7d470ad
prerequisite-patch-id: 6cd37c09bb0902519d574080b5c56d61755935fe
prerequisite-patch-id: a07d6676fbb5ed07486a3160e6eced91ecef1842
prerequisite-patch-id: 17baa0481806fc48dcb32caffb973120ac599c8d
prerequisite-patch-id: 6e629bbeb6efdd69aa733731bc91690346f26f21
prerequisite-patch-id: 59630f99305aa371e024b652e754d67614481c29
prerequisite-patch-id: 46c713fed894a530e9a0a83ca2a7c1ae2c787a5f
prerequisite-patch-id: 4b233711a8bdd1ef9fa82e364f07595526036ff4
prerequisite-patch-id: 9caefc8d5d48d6ec42bad4336fa38a28118296da
prerequisite-patch-id: 04d008e4ffbe499ebc5b466b7777dabff9a6c77e
prerequisite-patch-id: 396a88dad48473d4b54f539fadeb7d601020098d
--
2.21.0
Test the basic functionality of cgroup aware workqueues. Inspired by
Matthew Wilcox's test_xarray.c
Signed-off-by: Daniel Jordan <[email protected]>
---
kernel/workqueue.c | 1 +
lib/Kconfig.debug | 12 +
lib/Makefile | 1 +
lib/test_cgroup_workqueue.c | 325 ++++++++++++++++++
.../selftests/cgroup_workqueue/Makefile | 9 +
.../testing/selftests/cgroup_workqueue/config | 1 +
.../cgroup_workqueue/test_cgroup_workqueue.sh | 104 ++++++
7 files changed, 453 insertions(+)
create mode 100644 lib/test_cgroup_workqueue.c
create mode 100644 tools/testing/selftests/cgroup_workqueue/Makefile
create mode 100644 tools/testing/selftests/cgroup_workqueue/config
create mode 100755 tools/testing/selftests/cgroup_workqueue/test_cgroup_workqueue.sh
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c8cc69e296c0..15459b5bb0bf 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1825,6 +1825,7 @@ bool queue_cgroup_work_node(int node, struct workqueue_struct *wq,
local_irq_restore(flags);
return ret;
}
+EXPORT_SYMBOL_GPL(queue_cgroup_work_node);
static inline bool worker_in_child_cgroup(struct worker *worker)
{
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index d5a4a4036d2f..9909a306c142 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2010,6 +2010,18 @@ config TEST_STACKINIT
If unsure, say N.
+config TEST_CGROUP_WQ
+ tristate "Test cgroup-aware workqueues at runtime"
+ depends on CGROUPS
+ help
+ Test cgroup-aware workqueues, in which workers attach to the
+ cgroup specified by the queueing task. Basic test coverage
+ for whether workers attach to the expected cgroup, both for
+ cgroup-aware and unaware works, and whether workers are
+ throttled by the memory controller.
+
+ If unsure, say N.
+
endif # RUNTIME_TESTING_MENU
config MEMTEST
diff --git a/lib/Makefile b/lib/Makefile
index 18c2be516ab4..d08b4a50bfd1 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -92,6 +92,7 @@ obj-$(CONFIG_TEST_OBJAGG) += test_objagg.o
obj-$(CONFIG_TEST_STACKINIT) += test_stackinit.o
obj-$(CONFIG_TEST_LIVEPATCH) += livepatch/
+obj-$(CONFIG_TEST_CGROUP_WQ) += test_cgroup_workqueue.o
ifeq ($(CONFIG_DEBUG_KOBJECT),y)
CFLAGS_kobject.o += -DDEBUG
diff --git a/lib/test_cgroup_workqueue.c b/lib/test_cgroup_workqueue.c
new file mode 100644
index 000000000000..466ec4e6e55b
--- /dev/null
+++ b/lib/test_cgroup_workqueue.c
@@ -0,0 +1,325 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * test_cgroup_workqueue.c: Test cgroup-aware workqueues
+ * Copyright (c) 2019 Oracle and/or its affiliates. All rights reserved.
+ * Author: Daniel Jordan <[email protected]>
+ *
+ * Inspired by Matthew Wilcox's test_xarray.c.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/kconfig.h>
+#include <linux/module.h>
+#include <linux/random.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/vmalloc.h>
+#include <linux/workqueue.h>
+
+static atomic_long_t cwq_tests_run = ATOMIC_LONG_INIT(0);
+static atomic_long_t cwq_tests_passed = ATOMIC_LONG_INIT(0);
+
+static struct workqueue_struct *cwq_cgroup_aware_wq;
+static struct workqueue_struct *cwq_cgroup_unaware_wq;
+
+/* cgroup v2 hierarchy mountpoint */
+static char *cwq_cgrp_root_path = "/test_cgroup_workqueue";
+module_param(cwq_cgrp_root_path, charp, 0444);
+static struct cgroup *cwq_cgrp_root;
+
+static char *cwq_cgrp_1_path = "/cwq_1";
+module_param(cwq_cgrp_1_path, charp, 0444);
+static struct cgroup *cwq_cgrp_1;
+
+static char *cwq_cgrp_2_path = "/cwq_2";
+module_param(cwq_cgrp_2_path, charp, 0444);
+static struct cgroup *cwq_cgrp_2;
+
+static char *cwq_cgrp_3_path = "/cwq_3";
+module_param(cwq_cgrp_3_path, charp, 0444);
+static struct cgroup *cwq_cgrp_3;
+
+static size_t cwq_memcg_max = 1ul << 20;
+module_param(cwq_memcg_max, ulong, 0444);
+
+#define CWQ_BUG(x, test_name) ({ \
+ int __ret_cwq_bug = !!(x); \
+ atomic_long_inc(&cwq_tests_run); \
+ if (__ret_cwq_bug) { \
+ pr_warn("BUG at %s:%d\n", __func__, __LINE__); \
+ pr_warn("%s\n", (test_name)); \
+ dump_stack(); \
+ } else { \
+ atomic_long_inc(&cwq_tests_passed); \
+ } \
+ __ret_cwq_bug; \
+})
+
+#define CWQ_BUG_ON(x) CWQ_BUG((x), "<no test name>")
+
+struct cwq_data {
+ struct cgroup_work cwork;
+ struct cgroup *expected_cgroup;
+ const char *test_name;
+};
+
+#define CWQ_INIT_DATA(data, func, expected_cgrp) do { \
+ INIT_WORK_ONSTACK(&(data)->work, (func)); \
+ (data)->expected_cgroup = (expected_cgrp); \
+} while (0)
+
+static void cwq_verify_worker_cgroup(struct work_struct *work)
+{
+ struct cgroup_work *cwork = container_of(work, struct cgroup_work,
+ work);
+ struct cwq_data *data = container_of(cwork, struct cwq_data, cwork);
+ struct cgroup *worker_cgroup;
+
+ CWQ_BUG(!(current->flags & PF_WQ_WORKER), data->test_name);
+
+ rcu_read_lock();
+ worker_cgroup = task_dfl_cgroup(current);
+ rcu_read_unlock();
+
+ CWQ_BUG(worker_cgroup != data->expected_cgroup, data->test_name);
+}
+
+static noinline void cwq_test_reg_work_on_cgrp_unaware_wq(void)
+{
+ struct cwq_data data;
+
+ data.expected_cgroup = cwq_cgrp_root;
+ data.test_name = __func__;
+ INIT_WORK_ONSTACK(&data.cwork.work, cwq_verify_worker_cgroup);
+
+ CWQ_BUG_ON(!queue_work(cwq_cgroup_unaware_wq, &data.cwork.work));
+ flush_work(&data.cwork.work);
+}
+
+static noinline void cwq_test_cgrp_work_on_cgrp_aware_wq(void)
+{
+ struct cwq_data data;
+
+ data.expected_cgroup = cwq_cgrp_1;
+ data.test_name = __func__;
+ INIT_CGROUP_WORK_ONSTACK(&data.cwork, cwq_verify_worker_cgroup);
+
+ CWQ_BUG_ON(!queue_cgroup_work(cwq_cgroup_aware_wq, &data.cwork,
+ cwq_cgrp_1));
+ flush_work(&data.cwork.work);
+}
+
+static struct cgroup *cwq_get_random_cgroup(void)
+{
+ switch (prandom_u32_max(4)) {
+ case 1: return cwq_cgrp_1;
+ case 2: return cwq_cgrp_2;
+ case 3: return cwq_cgrp_3;
+ default: return cwq_cgrp_root;
+ }
+}
+
+#define CWQ_NWORK 256
+static noinline void cwq_test_many_cgrp_works_on_cgrp_aware_wq(void)
+{
+ int i;
+ struct cwq_data *data_array = kmalloc_array(CWQ_NWORK,
+ sizeof(struct cwq_data),
+ GFP_KERNEL);
+ if (CWQ_BUG_ON(!data_array))
+ return;
+
+ for (i = 0; i < CWQ_NWORK; ++i) {
+ struct cgroup *cgrp = cwq_get_random_cgroup();
+
+ data_array[i].expected_cgroup = cgrp;
+ data_array[i].test_name = __func__;
+ INIT_CGROUP_WORK(&data_array[i].cwork,
+ cwq_verify_worker_cgroup);
+ CWQ_BUG_ON(!queue_cgroup_work(cwq_cgroup_aware_wq,
+ &data_array[i].cwork,
+ cgrp));
+ }
+
+ for (i = 0; i < CWQ_NWORK; ++i)
+ flush_work(&data_array[i].cwork.work);
+
+ kfree(data_array);
+}
+
+static void cwq_verify_worker_obeys_memcg(struct work_struct *work)
+{
+ struct cgroup_work *cwork = container_of(work, struct cgroup_work,
+ work);
+ struct cwq_data *data = container_of(cwork, struct cwq_data, cwork);
+ struct cgroup *worker_cgroup;
+ void *mem;
+
+ CWQ_BUG(!(current->flags & PF_WQ_WORKER), data->test_name);
+
+ rcu_read_lock();
+ worker_cgroup = task_dfl_cgroup(current);
+ rcu_read_unlock();
+
+ CWQ_BUG(worker_cgroup != data->expected_cgroup, data->test_name);
+
+ mem = __vmalloc(cwq_memcg_max * 2, __GFP_ACCOUNT | __GFP_NOWARN,
+ PAGE_KERNEL);
+ if (data->expected_cgroup == cwq_cgrp_2) {
+ /*
+ * cwq_cgrp_2 has its memory.max set to cwq_memcg_max, so the
+ * allocation should fail.
+ */
+ CWQ_BUG(mem, data->test_name);
+ } else {
+ /*
+ * Other cgroups don't have a memory.max limit, so the
+ * allocation should succeed.
+ */
+ CWQ_BUG(!mem, data->test_name);
+ }
+ vfree(mem);
+}
+
+static noinline void cwq_test_reg_work_is_not_throttled_by_memcg(void)
+{
+ struct cwq_data data;
+
+ if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || !cwq_memcg_max)
+ return;
+
+ data.expected_cgroup = cwq_cgrp_root;
+ data.test_name = __func__;
+ INIT_WORK_ONSTACK(&data.cwork.work, cwq_verify_worker_obeys_memcg);
+ CWQ_BUG_ON(!queue_work(cwq_cgroup_unaware_wq, &data.cwork.work));
+ flush_work(&data.cwork.work);
+}
+
+static noinline void cwq_test_cgrp_work_is_throttled_by_memcg(void)
+{
+ struct cwq_data data;
+
+ if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || !cwq_memcg_max)
+ return;
+
+ /*
+ * The kselftest shell script enables the memory controller in
+ * cwq_cgrp_2 and sets memory.max to cwq_memcg_max.
+ */
+ data.expected_cgroup = cwq_cgrp_2;
+ data.test_name = __func__;
+ INIT_CGROUP_WORK_ONSTACK(&data.cwork, cwq_verify_worker_obeys_memcg);
+
+ CWQ_BUG_ON(!queue_cgroup_work(cwq_cgroup_aware_wq, &data.cwork,
+ cwq_cgrp_2));
+ flush_work(&data.cwork.work);
+}
+
+static noinline void cwq_test_cgrp_work_is_not_throttled_by_memcg(void)
+{
+ struct cwq_data data;
+
+ if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || !cwq_memcg_max)
+ return;
+
+ /*
+ * The kselftest shell script doesn't set a memory limit for cwq_cgrp_1
+ * or _3, so a cgroup work should still be able to allocate memory
+ * above that limit.
+ */
+ data.expected_cgroup = cwq_cgrp_1;
+ data.test_name = __func__;
+ INIT_CGROUP_WORK_ONSTACK(&data.cwork, cwq_verify_worker_obeys_memcg);
+
+ CWQ_BUG_ON(!queue_cgroup_work(cwq_cgroup_aware_wq, &data.cwork,
+ cwq_cgrp_1));
+ flush_work(&data.cwork.work);
+
+ /*
+ * And cgroup workqueues shouldn't choke on a cgroup that's disabled
+ * the memory controller, such as cwq_cgrp_3.
+ */
+ data.expected_cgroup = cwq_cgrp_3;
+ data.test_name = __func__;
+ INIT_CGROUP_WORK_ONSTACK(&data.cwork, cwq_verify_worker_obeys_memcg);
+
+ CWQ_BUG_ON(!queue_cgroup_work(cwq_cgroup_aware_wq, &data.cwork,
+ cwq_cgrp_3));
+ flush_work(&data.cwork.work);
+}
+
+static int cwq_init(void)
+{
+ s64 passed, run;
+
+ pr_warn("cgroup workqueue test module\n");
+
+ cwq_cgroup_aware_wq = alloc_workqueue("cwq_cgroup_aware_wq",
+ WQ_UNBOUND | WQ_CGROUP, 0);
+ if (!cwq_cgroup_aware_wq) {
+ pr_warn("cwq_cgroup_aware_wq allocation failed\n");
+ return -EAGAIN;
+ }
+
+ cwq_cgroup_unaware_wq = alloc_workqueue("cwq_cgroup_unaware_wq",
+ WQ_UNBOUND, 0);
+ if (!cwq_cgroup_unaware_wq) {
+ pr_warn("cwq_cgroup_unaware_wq allocation failed\n");
+ goto alloc_wq_fail;
+ }
+
+ cwq_cgrp_root = cgroup_get_from_path("/");
+ if (IS_ERR(cwq_cgrp_root)) {
+ pr_warn("can't get root cgroup\n");
+ goto cgroup_get_fail;
+ }
+
+ cwq_cgrp_1 = cgroup_get_from_path(cwq_cgrp_1_path);
+ if (IS_ERR(cwq_cgrp_1)) {
+ pr_warn("can't get child cgroup 1\n");
+ goto cgroup_get_fail;
+ }
+
+ cwq_cgrp_2 = cgroup_get_from_path(cwq_cgrp_2_path);
+ if (IS_ERR(cwq_cgrp_2)) {
+ pr_warn("can't get child cgroup 2\n");
+ goto cgroup_get_fail;
+ }
+
+ cwq_cgrp_3 = cgroup_get_from_path(cwq_cgrp_3_path);
+ if (IS_ERR(cwq_cgrp_3)) {
+ pr_warn("can't get child cgroup 3\n");
+ goto cgroup_get_fail;
+ }
+
+ cwq_test_reg_work_on_cgrp_unaware_wq();
+ cwq_test_cgrp_work_on_cgrp_aware_wq();
+ cwq_test_many_cgrp_works_on_cgrp_aware_wq();
+ cwq_test_reg_work_is_not_throttled_by_memcg();
+ cwq_test_cgrp_work_is_throttled_by_memcg();
+ cwq_test_cgrp_work_is_not_throttled_by_memcg();
+
+ passed = atomic_long_read(&cwq_tests_passed);
+ run = atomic_long_read(&cwq_tests_run);
+ pr_warn("cgroup workqueues: %lld of %lld tests passed\n", passed, run);
+ return (run == passed) ? 0 : -EINVAL;
+
+cgroup_get_fail:
+ destroy_workqueue(cwq_cgroup_unaware_wq);
+alloc_wq_fail:
+ destroy_workqueue(cwq_cgroup_aware_wq);
+ return -EAGAIN; /* better ideas? */
+}
+
+static void cwq_exit(void)
+{
+ destroy_workqueue(cwq_cgroup_aware_wq);
+ destroy_workqueue(cwq_cgroup_unaware_wq);
+}
+
+module_init(cwq_init);
+module_exit(cwq_exit);
+MODULE_AUTHOR("Daniel Jordan <[email protected]>");
+MODULE_LICENSE("GPL");
diff --git a/tools/testing/selftests/cgroup_workqueue/Makefile b/tools/testing/selftests/cgroup_workqueue/Makefile
new file mode 100644
index 000000000000..2ba1cd922670
--- /dev/null
+++ b/tools/testing/selftests/cgroup_workqueue/Makefile
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0
+
+all:
+
+TEST_PROGS := test_cgroup_workqueue.sh
+
+include ../lib.mk
+
+clean:
diff --git a/tools/testing/selftests/cgroup_workqueue/config b/tools/testing/selftests/cgroup_workqueue/config
new file mode 100644
index 000000000000..ae38b8f3c3db
--- /dev/null
+++ b/tools/testing/selftests/cgroup_workqueue/config
@@ -0,0 +1 @@
+CONFIG_TEST_CGROUP_WQ=m
diff --git a/tools/testing/selftests/cgroup_workqueue/test_cgroup_workqueue.sh b/tools/testing/selftests/cgroup_workqueue/test_cgroup_workqueue.sh
new file mode 100755
index 000000000000..33251276d2cf
--- /dev/null
+++ b/tools/testing/selftests/cgroup_workqueue/test_cgroup_workqueue.sh
@@ -0,0 +1,104 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# Runs cgroup workqueue kernel module tests
+
+# hierarchy: root
+# / \
+# 1 2
+# /
+# 3
+CG_ROOT='/test_cgroup_workqueue'
+CG_1='/cwq_1'
+CG_2='/cwq_2'
+CG_3='/cwq_3'
+MEMCG_MAX="$((2**16))" # small but arbitrary amount
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+cg_mnt=''
+cg_mnt_mounted=''
+cg_1_created=''
+cg_2_created=''
+cg_3_created=''
+
+cleanup()
+{
+ [ -n "$cg_3_created" ] && rmdir "$CG_ROOT/$CG_1/$CG_3" || exit 1
+ [ -n "$cg_2_created" ] && rmdir "$CG_ROOT/$CG_2" || exit 1
+ [ -n "$cg_1_created" ] && rmdir "$CG_ROOT/$CG_1" || exit 1
+ [ -n "$cg_mnt_mounted" ] && umount "$CG_ROOT" || exit 1
+ [ -n "$cg_mnt_created" ] && rmdir "$CG_ROOT" || exit 1
+ exit "$1"
+}
+
+if ! /sbin/modprobe -q -n test_cgroup_workqueue; then
+ echo "cgroup_workqueue: module test_cgroup_workqueue not found [SKIP]"
+ exit $ksft_skip
+fi
+
+# Setup cgroup v2 hierarchy
+if mkdir "$CG_ROOT"; then
+ cg_mnt_created=1
+else
+ echo "cgroup_workqueue: can't create cgroup mountpoint at $CG_ROOT"
+ cleanup 1
+fi
+
+if mount -t cgroup2 none "$CG_ROOT"; then
+ cg_mnt_mounted=1
+else
+ echo "cgroup_workqueue: couldn't mount cgroup hierarchy at $CG_ROOT"
+ cleanup 1
+fi
+
+if grep -q memory "$CG_ROOT/cgroup.controllers"; then
+ /bin/echo +memory > "$CG_ROOT/cgroup.subtree_control"
+ cwq_memcg_max="$MEMCG_MAX"
+else
+ # Tell test module not to run memory.max tests.
+ cwq_memcg_max=0
+fi
+
+if mkdir "$CG_ROOT/$CG_1"; then
+ cg_1_created=1
+else
+ echo "cgroup_workqueue: can't mkdir $CG_ROOT/$CG_1"
+ cleanup 1
+fi
+
+if mkdir "$CG_ROOT/$CG_2"; then
+ cg_2_created=1
+else
+ echo "cgroup_workqueue: can't mkdir $CG_ROOT/$CG_2"
+ cleanup 1
+fi
+
+if mkdir "$CG_ROOT/$CG_1/$CG_3"; then
+ cg_3_created=1
+ # Ensure the memory controller is disabled as expected in $CG_3's
+ # parent, $CG_1, for testing.
+ if grep -q memory "$CG_ROOT/$CG_1/cgroup.subtree_control"; then
+ /bin/echo -memory > "$CG_ROOT/$CG_1/cgroup.subtree_control"
+ fi
+else
+ echo "cgroup_workqueue: can't mkdir $CG_ROOT/$CG_1/$CG_3"
+ cleanup 1
+fi
+
+if (( cwq_memcg_max != 0 )); then
+ /bin/echo "$MEMCG_MAX" > "$CG_ROOT/$CG_2/memory.max"
+fi
+
+if /sbin/modprobe -q test_cgroup_workqueue cwq_cgrp_root_path="$CG_ROOT" \
+ cwq_cgrp_1_path="$CG_1" \
+ cwq_cgrp_2_path="$CG_2" \
+ cwq_cgrp_3_path="$CG_1/$CG_3" \
+ cwq_memcg_max="$cwq_memcg_max"; then
+ echo "cgroup_workqueue: ok"
+ /sbin/modprobe -q -r test_cgroup_workqueue
+ cleanup 0
+else
+ echo "cgroup_workqueue: [FAIL]"
+ cleanup 1
+fi
--
2.21.0
Hello, Daniel.
On Wed, Jun 05, 2019 at 09:36:45AM -0400, Daniel Jordan wrote:
> My use case for this work is kernel multithreading, the series formerly known
> as ktask[2] that I'm now trying to combine with padata according to feedback
> from the last post. Helper threads in a multithreaded job may consume lots of
> resources that aren't properly accounted to the cgroup of the task that started
> the job.
Can you please go into more details on the use cases?
For memory and io, we're generally going for remote charging, where a
kthread explicitly says who the specific io or allocation is for,
combined with selective back-charging, where the resource is charged
and consumed unconditionally even if that would put the usage above
the current limits temporarily. From what I've been seeing recently,
combination of the two give us really good control quality without
being too invasive across the stack.
CPU doesn't have a backcharging mechanism yet and depending on the use
case, we *might* need to put kthreads in different cgroups. However,
such use cases might not be that abundant and there may be gotaches
which require them to be force-executed and back-charged (e.g. fs
compression from global reclaim).
Thanks.
--
tejun
Hi Tejun,
On Wed, Jun 05, 2019 at 06:53:19AM -0700, Tejun Heo wrote:
> On Wed, Jun 05, 2019 at 09:36:45AM -0400, Daniel Jordan wrote:
> > My use case for this work is kernel multithreading, the series formerly known
> > as ktask[2] that I'm now trying to combine with padata according to feedback
> > from the last post. Helper threads in a multithreaded job may consume lots of
> > resources that aren't properly accounted to the cgroup of the task that started
> > the job.
>
> Can you please go into more details on the use cases?
Sure, quoting from the last ktask post:
A single CPU can spend an excessive amount of time in the kernel operating
on large amounts of data. Often these situations arise during initialization-
and destruction-related tasks, where the data involved scales with system size.
These long-running jobs can slow startup and shutdown of applications and the
system itself while extra CPUs sit idle.
To ensure that applications and the kernel continue to perform well as core
counts and memory sizes increase, harness these idle CPUs to complete such jobs
more quickly.
ktask is a generic framework for parallelizing CPU-intensive work in the
kernel. The API is generic enough to add concurrency to many different kinds
of tasks--for example, zeroing a range of pages or evicting a list of
inodes--and aims to save its clients the trouble of splitting up the work,
choosing the number of threads to use, maintaining an efficient concurrency
level, starting these threads, and load balancing the work between them.
So far the users of the framework primarily consume CPU and memory.
> For memory and io, we're generally going for remote charging, where a
> kthread explicitly says who the specific io or allocation is for,
> combined with selective back-charging, where the resource is charged
> and consumed unconditionally even if that would put the usage above
> the current limits temporarily. From what I've been seeing recently,
> combination of the two give us really good control quality without
> being too invasive across the stack.
Yes, for memory I actually use remote charging. In patch 3 the worker's
current->active_memcg field is changed to match that of the cgroup associated
with the work.
Cc Shakeel, since we're talking about it.
> CPU doesn't have a backcharging mechanism yet and depending on the use
> case, we *might* need to put kthreads in different cgroups. However,
> such use cases might not be that abundant and there may be gotaches
> which require them to be force-executed and back-charged (e.g. fs
> compression from global reclaim).
The CPU-intensiveness of these works is one of the reasons for actually putting
the workers through the migration path. I don't know of a way to get the
workers to respect the cpu controller (and even cpuset for that matter) without
doing that.
Thanks for the quick feedback.
Daniel
Hi Tejun,
On Wed, Jun 05, 2019 at 06:53:19AM -0700, Tejun Heo wrote:
> Hello, Daniel.
>
> On Wed, Jun 05, 2019 at 09:36:45AM -0400, Daniel Jordan wrote:
> > My use case for this work is kernel multithreading, the series formerly known
> > as ktask[2] that I'm now trying to combine with padata according to feedback
> > from the last post. Helper threads in a multithreaded job may consume lots of
> > resources that aren't properly accounted to the cgroup of the task that started
> > the job.
>
> Can you please go into more details on the use cases?
If I remember correctly, the original Bandan's work was about using
workqueues instead of kthreads in vhost.
> For memory and io, we're generally going for remote charging, where a
> kthread explicitly says who the specific io or allocation is for,
> combined with selective back-charging, where the resource is charged
> and consumed unconditionally even if that would put the usage above
> the current limits temporarily. From what I've been seeing recently,
> combination of the two give us really good control quality without
> being too invasive across the stack.
>
> CPU doesn't have a backcharging mechanism yet and depending on the use
> case, we *might* need to put kthreads in different cgroups. However,
> such use cases might not be that abundant and there may be gotaches
> which require them to be force-executed and back-charged (e.g. fs
> compression from global reclaim).
>
> Thanks.
>
> --
> tejun
>
--
Sincerely yours,
Mike.
Hello,
On Thu, Jun 06, 2019 at 09:15:26AM +0300, Mike Rapoport wrote:
> > Can you please go into more details on the use cases?
>
> If I remember correctly, the original Bandan's work was about using
> workqueues instead of kthreads in vhost.
For vhosts, I think it might be better to stick with kthread or
kthread_worker given that they can consume lots of cpu cycles over a
long period of time and we want to keep persistent track of scheduling
states.
Thanks.
--
tejun
Hello, Daniel.
On Wed, Jun 05, 2019 at 11:32:29AM -0400, Daniel Jordan wrote:
> Sure, quoting from the last ktask post:
>
> A single CPU can spend an excessive amount of time in the kernel operating
> on large amounts of data. Often these situations arise during initialization-
> and destruction-related tasks, where the data involved scales with system size.
> These long-running jobs can slow startup and shutdown of applications and the
> system itself while extra CPUs sit idle.
>
> To ensure that applications and the kernel continue to perform well as core
> counts and memory sizes increase, harness these idle CPUs to complete such jobs
> more quickly.
>
> ktask is a generic framework for parallelizing CPU-intensive work in the
> kernel. The API is generic enough to add concurrency to many different kinds
> of tasks--for example, zeroing a range of pages or evicting a list of
> inodes--and aims to save its clients the trouble of splitting up the work,
> choosing the number of threads to use, maintaining an efficient concurrency
> level, starting these threads, and load balancing the work between them.
Yeah, that rings a bell.
> > For memory and io, we're generally going for remote charging, where a
> > kthread explicitly says who the specific io or allocation is for,
> > combined with selective back-charging, where the resource is charged
> > and consumed unconditionally even if that would put the usage above
> > the current limits temporarily. From what I've been seeing recently,
> > combination of the two give us really good control quality without
> > being too invasive across the stack.
>
> Yes, for memory I actually use remote charging. In patch 3 the worker's
> current->active_memcg field is changed to match that of the cgroup associated
> with the work.
I see.
> > CPU doesn't have a backcharging mechanism yet and depending on the use
> > case, we *might* need to put kthreads in different cgroups. However,
> > such use cases might not be that abundant and there may be gotaches
> > which require them to be force-executed and back-charged (e.g. fs
> > compression from global reclaim).
>
> The CPU-intensiveness of these works is one of the reasons for actually putting
> the workers through the migration path. I don't know of a way to get the
> workers to respect the cpu controller (and even cpuset for that matter) without
> doing that.
So, I still think it'd likely be better to go back-charging route than
actually putting kworkers in non-root cgroups. That's gonna be way
cheaper, simpler and makes avoiding inadvertent priority inversions
trivial.
Thanks.
--
tejun
On Tue, Jun 11, 2019 at 12:55:49PM -0700, Tejun Heo wrote:
> > > CPU doesn't have a backcharging mechanism yet and depending on the use
> > > case, we *might* need to put kthreads in different cgroups. However,
> > > such use cases might not be that abundant and there may be gotaches
> > > which require them to be force-executed and back-charged (e.g. fs
> > > compression from global reclaim).
> >
> > The CPU-intensiveness of these works is one of the reasons for actually putting
> > the workers through the migration path. I don't know of a way to get the
> > workers to respect the cpu controller (and even cpuset for that matter) without
> > doing that.
>
> So, I still think it'd likely be better to go back-charging route than
> actually putting kworkers in non-root cgroups. That's gonna be way
> cheaper, simpler and makes avoiding inadvertent priority inversions
> trivial.
Ok, I'll experiment with backcharging in the cpu controller. Initial plan is
to smooth out resource usage by backcharging after each chunk of work that each
helper thread does rather than do one giant backcharge after the multithreaded
job is over. May turn out better performance-wise to do it less often than
this.
I'll also experiment with getting workqueue workers to respect cpuset without
migrating. Seems to make sense to use the intersection of an unbound worker's
cpumask and the cpuset's cpumask, and make some compromises if the result is
empty.
Daniel