2012-11-28 21:34:31

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET cgroup/for-3.8] cpuset: decouple cpuset locking from cgroup core

Hello, guys.

Depending on cgroup core locking - cgroup_mutex - is messy and makes
cgroup prone to locking dependency problems. The current code already
has lock dependency loop - memcg nests get_online_cpus() inside
cgroup_mutex. cpuset the other way around.

Regardless of the locking details, whatever is protecting cgroup has
inherently to be something outer to most other locking constructs.
cgroup calls into a lot of major subsystems which in turn have to
perform subsystem-specific locking. Trying to nest cgroup
synchronization inside other locks isn't something which can work
well.

cgroup now has enough API to allow subsystems to implement their own
locking and cgroup_mutex is scheduled to be made private to cgroup
core. This patchset makes cpuset implement its own locking instead of
relying on cgroup_mutex.

cpuset is rather nasty in this respect. Some of it seems to have come
from the implementation history - cgroup core grew out of cpuset - but
big part stems from cpuset's need to migrate tasks to an ancestor
cgroup when an hotunplug event makes a cpuset empty (w/o any cpu or
memory).

This patchset decouples cpuset locking from cgroup_mutex. After the
patchset, cpuset uses cpuset-specific cpuset_mutex instead of
cgroup_mutex. This also removes the lockdep warning triggered during
cpu offlining (see 0009).

Note that this leaves memcg as the only external user of cgroup_mutex.
Michal, Kame, can you guys please convert memcg to use its own locking
too?

This patchset contains the following thirteen patches.

0001-cpuset-remove-unused-cpuset_unlock.patch
0002-cpuset-remove-fast-exit-path-from-remove_tasks_in_em.patch
0003-cpuset-introduce-css_on-offline.patch
0004-cpuset-introduce-CS_ONLINE.patch
0005-cpuset-introduce-cpuset_for_each_child.patch
0006-cpuset-cleanup-cpuset-_can-_attach.patch
0007-cpuset-drop-async_rebuild_sched_domains.patch
0008-cpuset-reorganize-CPU-memory-hotplug-handling.patch
0009-cpuset-don-t-nest-cgroup_mutex-inside-get_online_cpu.patch
0010-cpuset-make-CPU-memory-hotplug-propagation-asynchron.patch
0011-cpuset-pin-down-cpus-and-mems-while-a-task-is-being-.patch
0012-cpuset-schedule-hotplug-propagation-from-cpuset_atta.patch
0013-cpuset-replace-cgroup_mutex-locking-with-cpuset-inte.patch

0001-0006 are prep patches.

0007-0009 make cpuset nest get_online_cpus() inside cgroup_mutex, not
the other way around.

0010-0012 plug holes which would be exposed by switching to
cpuset-specific locking.

0013 replaces cgroup_mutex with cpuset_mutex.

This patchset is on top of cgroup/for-3.8 (fddfb02ad0) and also
available in the following git branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cpuset-locking

diffstat follows.

kernel/cpuset.c | 750 +++++++++++++++++++++++++++++++-------------------------
1 file changed, 423 insertions(+), 327 deletions(-)

Thanks.

--
tejun


2012-11-28 21:34:33

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 01/13] cpuset: remove unused cpuset_unlock()

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cpuset.c | 11 -----------
1 file changed, 11 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index b017887..a423774 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2412,17 +2412,6 @@ int __cpuset_node_allowed_hardwall(int node, gfp_t gfp_mask)
}

/**
- * cpuset_unlock - release lock on cpuset changes
- *
- * Undo the lock taken in a previous cpuset_lock() call.
- */
-
-void cpuset_unlock(void)
-{
- mutex_unlock(&callback_mutex);
-}
-
-/**
* cpuset_mem_spread_node() - On which node to begin search for a file page
* cpuset_slab_spread_node() - On which node to begin search for a slab page
*
--
1.7.11.7

2012-11-28 21:34:38

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 02/13] cpuset: remove fast exit path from remove_tasks_in_empty_cpuset()

The function isn't that hot, the overhead of missing the fast exit is
low, the test itself depends heavily on cgroup internals, and it's
gonna be a hindrance when trying to decouple cpuset locking from
cgroup core. Remove the fast exit path.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cpuset.c | 8 --------
1 file changed, 8 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index a423774..54b2b73 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1968,14 +1968,6 @@ static void remove_tasks_in_empty_cpuset(struct cpuset *cs)
struct cpuset *parent;

/*
- * The cgroup's css_sets list is in use if there are tasks
- * in the cpuset; the list is empty if there are none;
- * the cs->css.refcnt seems always 0.
- */
- if (list_empty(&cs->css.cgroup->css_sets))
- return;
-
- /*
* Find its next-highest non-empty parent, (top cpuset
* has online cpus, so can't be empty).
*/
--
1.7.11.7

2012-11-28 21:34:47

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 06/13] cpuset: cleanup cpuset[_can]_attach()

cpuset_can_attach() prepare global variables cpus_attach and
cpuset_attach_nodemask_{to|from} which are used by cpuset_attach().
There is no reason to prepare in cpuset_can_attach(). The same
information can be accessed from cpuset_attach().

Move the prepartion logic from cpuset_can_attach() to cpuset_attach()
and make the global variables static ones inside cpuset_attach().

While at it, convert cpus_attach to cpumask_t from cpumask_var_t.
There's no reason to mess with dynamic allocation on a static buffer.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cpuset.c | 34 +++++++++++++---------------------
1 file changed, 13 insertions(+), 21 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 5a52ed6..8bdd983 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1395,15 +1395,6 @@ static int fmeter_getrate(struct fmeter *fmp)
return val;
}

-/*
- * Protected by cgroup_lock. The nodemasks must be stored globally because
- * dynamically allocating them is not allowed in can_attach, and they must
- * persist until attach.
- */
-static cpumask_var_t cpus_attach;
-static nodemask_t cpuset_attach_nodemask_from;
-static nodemask_t cpuset_attach_nodemask_to;
-
/* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
{
@@ -1430,19 +1421,15 @@ static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
return ret;
}

- /* prepare for attach */
- if (cs == &top_cpuset)
- cpumask_copy(cpus_attach, cpu_possible_mask);
- else
- guarantee_online_cpus(cs, cpus_attach);
-
- guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
-
return 0;
}

static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
{
+ /* static bufs protected by cgroup_mutex */
+ static cpumask_t cpus_attach;
+ static nodemask_t cpuset_attach_nodemask_from;
+ static nodemask_t cpuset_attach_nodemask_to;
struct mm_struct *mm;
struct task_struct *task;
struct task_struct *leader = cgroup_taskset_first(tset);
@@ -1450,12 +1437,20 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
struct cpuset *cs = cgroup_cs(cgrp);
struct cpuset *oldcs = cgroup_cs(oldcgrp);

+ /* prepare for attach */
+ if (cs == &top_cpuset)
+ cpumask_copy(&cpus_attach, cpu_possible_mask);
+ else
+ guarantee_online_cpus(cs, &cpus_attach);
+
+ guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+
cgroup_taskset_for_each(task, cgrp, tset) {
/*
* can_attach beforehand should guarantee that this doesn't
* fail. TODO: have a better way to handle failure here
*/
- WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
+ WARN_ON_ONCE(set_cpus_allowed_ptr(task, &cpus_attach));

cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
cpuset_update_task_spread_flag(cs, task);
@@ -1958,9 +1953,6 @@ int __init cpuset_init(void)
if (err < 0)
return err;

- if (!alloc_cpumask_var(&cpus_attach, GFP_KERNEL))
- BUG();
-
number_of_cpusets = 1;
return 0;
}
--
1.7.11.7

2012-11-28 21:34:55

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 09/13] cpuset: don't nest cgroup_mutex inside get_online_cpus()

CPU / memory hotplug path currently grabs cgroup_mutex from hotplug
event notifications. As other places nest the other way around, we
end up with lockdep warning attached below.

We want to keep cgroup_mutex outer to most locks including hotplug
ones. Break the circular dependency by handling hotplug from a work
item. Convert cpuset_handle_hotplug() to cpuset_hotplug_workfn() and
schedule it from the hotplug notifications. As the function can
already handle multiple mixed events without any input, converting it
to a work function is trivial.

This decouples cpuset hotplug handling from the notification callbacks
and there can be an arbitrary delay between the actual event and
updates to cpuset. Scheduler and mm can handle it fine but moving
tasks out of an empty cpuset may race against writes to the cpuset
restoring execution resources which can lead to confusing behavior.
Flush hotplug work item from cpuset_write_resmask() to avoid such
confusions.

======================================================
[ INFO: possible circular locking dependency detected ]
3.7.0-rc4-work+ #42 Not tainted
-------------------------------------------------------
bash/645 is trying to acquire lock:
(cgroup_mutex){+.+.+.}, at: [<ffffffff8110c5b7>] cgroup_lock+0x17/0x20

but task is already holding lock:
(cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8109300f>] cpu_hotplug_begin+0x2f/0x60

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (cpu_hotplug.lock){+.+.+.}:
[<ffffffff810f8357>] lock_acquire+0x97/0x1e0
[<ffffffff81be4701>] mutex_lock_nested+0x61/0x3b0
[<ffffffff810930fc>] get_online_cpus+0x3c/0x60
[<ffffffff811152fb>] rebuild_sched_domains_locked+0x1b/0x70
[<ffffffff81116718>] cpuset_write_resmask+0x298/0x2c0
[<ffffffff8110f70f>] cgroup_file_write+0x1ef/0x300
[<ffffffff811c3b78>] vfs_write+0xa8/0x160
[<ffffffff811c3e82>] sys_write+0x52/0xa0
[<ffffffff81be89c2>] system_call_fastpath+0x16/0x1b

-> #0 (cgroup_mutex){+.+.+.}:
[<ffffffff810f74de>] __lock_acquire+0x14ce/0x1d20
[<ffffffff810f8357>] lock_acquire+0x97/0x1e0
[<ffffffff81be4701>] mutex_lock_nested+0x61/0x3b0
[<ffffffff8110c5b7>] cgroup_lock+0x17/0x20
[<ffffffff81116deb>] cpuset_handle_hotplug+0x1b/0x560
[<ffffffff8111744e>] cpuset_update_active_cpus+0xe/0x10
[<ffffffff810d0587>] cpuset_cpu_inactive+0x47/0x50
[<ffffffff810c1476>] notifier_call_chain+0x66/0x150
[<ffffffff810c156e>] __raw_notifier_call_chain+0xe/0x10
[<ffffffff81092fa0>] __cpu_notify+0x20/0x40
[<ffffffff81b9827e>] _cpu_down+0x7e/0x2f0
[<ffffffff81b98526>] cpu_down+0x36/0x50
[<ffffffff81b9c12d>] store_online+0x5d/0xe0
[<ffffffff816b6ef8>] dev_attr_store+0x18/0x30
[<ffffffff8123bb50>] sysfs_write_file+0xe0/0x150
[<ffffffff811c3b78>] vfs_write+0xa8/0x160
[<ffffffff811c3e82>] sys_write+0x52/0xa0
[<ffffffff81be89c2>] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(cpu_hotplug.lock);
lock(cgroup_mutex);
lock(cpu_hotplug.lock);
lock(cgroup_mutex);

*** DEADLOCK ***

5 locks held by bash/645:
#0: (&buffer->mutex){+.+.+.}, at: [<ffffffff8123bab8>] sysfs_write_file+0x48/0x150
#1: (s_active#42){.+.+.+}, at: [<ffffffff8123bb38>] sysfs_write_file+0xc8/0x150
#2: (x86_cpu_hotplug_driver_mutex){+.+...}, at: [<ffffffff81079277>] cpu_hotplug_driver_lock+0x17/0x20
#3: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff81093157>] cpu_maps_update_begin+0x17/0x20
#4: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8109300f>] cpu_hotplug_begin+0x2f/0x60

stack backtrace:
Pid: 645, comm: bash Not tainted 3.7.0-rc4-work+ #42
Call Trace:
[<ffffffff81bdadfd>] print_circular_bug+0x28e/0x29f
[<ffffffff810f74de>] __lock_acquire+0x14ce/0x1d20
[<ffffffff810f8357>] lock_acquire+0x97/0x1e0
[<ffffffff81be4701>] mutex_lock_nested+0x61/0x3b0
[<ffffffff8110c5b7>] cgroup_lock+0x17/0x20
[<ffffffff81116deb>] cpuset_handle_hotplug+0x1b/0x560
[<ffffffff8111744e>] cpuset_update_active_cpus+0xe/0x10
[<ffffffff810d0587>] cpuset_cpu_inactive+0x47/0x50
[<ffffffff810c1476>] notifier_call_chain+0x66/0x150
[<ffffffff810c156e>] __raw_notifier_call_chain+0xe/0x10
[<ffffffff81092fa0>] __cpu_notify+0x20/0x40
[<ffffffff81b9827e>] _cpu_down+0x7e/0x2f0
[<ffffffff81b98526>] cpu_down+0x36/0x50
[<ffffffff81b9c12d>] store_online+0x5d/0xe0
[<ffffffff816b6ef8>] dev_attr_store+0x18/0x30
[<ffffffff8123bb50>] sysfs_write_file+0xe0/0x150
[<ffffffff811c3b78>] vfs_write+0xa8/0x160
[<ffffffff811c3e82>] sys_write+0x52/0xa0
[<ffffffff81be89c2>] system_call_fastpath+0x16/0x1b

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cpuset.c | 28 ++++++++++++++++++++++++----
1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4ab3e4c..b530fba 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -252,6 +252,13 @@ static char cpuset_nodelist[CPUSET_NODELIST_LEN];
static DEFINE_SPINLOCK(cpuset_buffer_lock);

/*
+ * CPU / memory hotplug is handled asynchronously.
+ */
+static void cpuset_hotplug_workfn(struct work_struct *work);
+
+static DECLARE_WORK(cpuset_hotplug_work, cpuset_hotplug_workfn);
+
+/*
* This is ugly, but preserves the userspace API for existing cpuset
* users. If someone tries to mount the "cpuset" filesystem, we
* silently switch it to mount "cgroup" instead
@@ -1518,6 +1525,19 @@ static int cpuset_write_resmask(struct cgroup *cgrp, struct cftype *cft,
struct cpuset *cs = cgroup_cs(cgrp);
struct cpuset *trialcs;

+ /*
+ * CPU or memory hotunplug may leave @cs w/o any execution
+ * resources, in which case the hotplug code asynchronously updates
+ * configuration and transfers all tasks to the nearest ancestor
+ * which can execute.
+ *
+ * As writes to "cpus" or "mems" may restore @cs's execution
+ * resources, wait for the previously scheduled operations before
+ * proceeding, so that we don't end up keep removing tasks added
+ * after execution capability is restored.
+ */
+ flush_work(&cpuset_hotplug_work);
+
if (!cgroup_lock_live_group(cgrp))
return -ENODEV;

@@ -2045,7 +2065,7 @@ static void cpuset_propagate_hotplug(struct cpuset *cs)
}

/**
- * cpuset_handle_hotplug - handle CPU/memory hot[un]plug
+ * cpuset_hotplug_workfn - handle CPU/memory hotunplug for a cpuset
*
* This function is called after either CPU or memory configuration has
* changed and updates cpuset accordingly. The top_cpuset is always
@@ -2060,7 +2080,7 @@ static void cpuset_propagate_hotplug(struct cpuset *cs)
* Note that CPU offlining during suspend is ignored. We don't modify
* cpusets across suspend/resume cycles at all.
*/
-static void cpuset_handle_hotplug(void)
+static void cpuset_hotplug_workfn(struct work_struct *work)
{
static cpumask_t new_cpus, tmp_cpus;
static nodemask_t new_mems, tmp_mems;
@@ -2127,7 +2147,7 @@ static void cpuset_handle_hotplug(void)

void cpuset_update_active_cpus(bool cpu_online)
{
- cpuset_handle_hotplug();
+ schedule_work(&cpuset_hotplug_work);
}

#ifdef CONFIG_MEMORY_HOTPLUG
@@ -2139,7 +2159,7 @@ void cpuset_update_active_cpus(bool cpu_online)
static int cpuset_track_online_nodes(struct notifier_block *self,
unsigned long action, void *arg)
{
- cpuset_handle_hotplug();
+ schedule_work(&cpuset_hotplug_work);
return NOTIFY_OK;
}
#endif
--
1.7.11.7

2012-11-28 21:34:43

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 05/13] cpuset: introduce cpuset_for_each_child()

Instead of iterating cgroup->children directly, introduce and use
cpuset_for_each_child() which wraps cgroup_for_each_child() and
performs online check. As it uses the generic iterator, it requires
RCU read locking too.

As cpuset is currently protected by cgroup_mutex, non-online cpusets
aren't visible to all the iterations and this patch currently doesn't
make any functional difference. This will be used to de-couple cpuset
locking from cgroup core.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cpuset.c | 85 ++++++++++++++++++++++++++++++++++++---------------------
1 file changed, 54 insertions(+), 31 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 22b521d..5a52ed6 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -200,6 +200,19 @@ static struct cpuset top_cpuset = {
(1 << CS_MEM_EXCLUSIVE)),
};

+/**
+ * cpuset_for_each_child - traverse online children of a cpuset
+ * @child_cs: loop cursor pointing to the current child
+ * @pos_cgrp: used for iteration
+ * @parent_cs: target cpuset to walk children of
+ *
+ * Walk @child_cs through the online children of @parent_cs. Must be used
+ * with RCU read locked.
+ */
+#define cpuset_for_each_child(child_cs, pos_cgrp, parent_cs) \
+ cgroup_for_each_child((pos_cgrp), (parent_cs)->css.cgroup) \
+ if (is_cpuset_online(((child_cs) = cgroup_cs((pos_cgrp)))))
+
/*
* There are two global mutexes guarding cpuset structures. The first
* is the main control groups cgroup_mutex, accessed via
@@ -419,48 +432,55 @@ static int validate_change(const struct cpuset *cur, const struct cpuset *trial)
{
struct cgroup *cont;
struct cpuset *c, *par;
+ int ret;
+
+ rcu_read_lock();

/* Each of our child cpusets must be a subset of us */
- list_for_each_entry(cont, &cur->css.cgroup->children, sibling) {
- if (!is_cpuset_subset(cgroup_cs(cont), trial))
- return -EBUSY;
- }
+ ret = -EBUSY;
+ cpuset_for_each_child(c, cont, cur)
+ if (!is_cpuset_subset(c, trial))
+ goto out;

/* Remaining checks don't apply to root cpuset */
+ ret = 0;
if (cur == &top_cpuset)
- return 0;
+ goto out;

par = cur->parent;

/* We must be a subset of our parent cpuset */
+ ret = -EACCES;
if (!is_cpuset_subset(trial, par))
- return -EACCES;
+ goto out;

/*
* If either I or some sibling (!= me) is exclusive, we can't
* overlap
*/
- list_for_each_entry(cont, &par->css.cgroup->children, sibling) {
- c = cgroup_cs(cont);
+ ret = -EINVAL;
+ cpuset_for_each_child(c, cont, par) {
if ((is_cpu_exclusive(trial) || is_cpu_exclusive(c)) &&
c != cur &&
cpumask_intersects(trial->cpus_allowed, c->cpus_allowed))
- return -EINVAL;
+ goto out;
if ((is_mem_exclusive(trial) || is_mem_exclusive(c)) &&
c != cur &&
nodes_intersects(trial->mems_allowed, c->mems_allowed))
- return -EINVAL;
+ goto out;
}

/* Cpusets with tasks can't have empty cpus_allowed or mems_allowed */
- if (cgroup_task_count(cur->css.cgroup)) {
- if (cpumask_empty(trial->cpus_allowed) ||
- nodes_empty(trial->mems_allowed)) {
- return -ENOSPC;
- }
- }
+ ret = -ENOSPC;
+ if (cgroup_task_count(cur->css.cgroup) &&
+ (cpumask_empty(trial->cpus_allowed) ||
+ nodes_empty(trial->mems_allowed)))
+ goto out;

- return 0;
+ ret = 0;
+out:
+ rcu_read_unlock();
+ return ret;
}

#ifdef CONFIG_SMP
@@ -501,10 +521,10 @@ update_domain_attr_tree(struct sched_domain_attr *dattr, struct cpuset *c)
if (is_sched_load_balance(cp))
update_domain_attr(dattr, cp);

- list_for_each_entry(cont, &cp->css.cgroup->children, sibling) {
- child = cgroup_cs(cont);
+ rcu_read_lock();
+ cpuset_for_each_child(child, cont, cp)
list_add_tail(&child->stack_list, &q);
- }
+ rcu_read_unlock();
}
}

@@ -623,10 +643,10 @@ static int generate_sched_domains(cpumask_var_t **domains,
continue;
}

- list_for_each_entry(cont, &cp->css.cgroup->children, sibling) {
- child = cgroup_cs(cont);
+ rcu_read_lock();
+ cpuset_for_each_child(child, cont, cp)
list_add_tail(&child->stack_list, &q);
- }
+ rcu_read_unlock();
}

for (i = 0; i < csn; i++)
@@ -1824,7 +1844,8 @@ static int cpuset_css_online(struct cgroup *cgrp)
{
struct cpuset *cs = cgroup_cs(cgrp);
struct cpuset *parent = cs->parent;
- struct cgroup *tmp_cg;
+ struct cpuset *tmp_cs;
+ struct cgroup *pos_cg;

if (!parent)
return 0;
@@ -1853,12 +1874,14 @@ static int cpuset_css_online(struct cgroup *cgrp)
* changed to grant parent->cpus_allowed-sibling_cpus_exclusive
* (and likewise for mems) to the new cgroup.
*/
- list_for_each_entry(tmp_cg, &cgrp->parent->children, sibling) {
- struct cpuset *tmp_cs = cgroup_cs(tmp_cg);
-
- if (is_mem_exclusive(tmp_cs) || is_cpu_exclusive(tmp_cs))
+ rcu_read_lock();
+ cpuset_for_each_child(tmp_cs, pos_cg, parent) {
+ if (is_mem_exclusive(tmp_cs) || is_cpu_exclusive(tmp_cs)) {
+ rcu_read_unlock();
return 0;
+ }
}
+ rcu_read_unlock();

mutex_lock(&callback_mutex);
cs->mems_allowed = parent->mems_allowed;
@@ -2027,10 +2050,10 @@ static struct cpuset *cpuset_next(struct list_head *queue)

cp = list_first_entry(queue, struct cpuset, stack_list);
list_del(queue->next);
- list_for_each_entry(cont, &cp->css.cgroup->children, sibling) {
- child = cgroup_cs(cont);
+ rcu_read_lock();
+ cpuset_for_each_child(child, cont, cp)
list_add_tail(&child->stack_list, queue);
- }
+ rcu_read_unlock();

return cp;
}
--
1.7.11.7

2012-11-28 21:35:00

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 11/13] cpuset: pin down cpus and mems while a task is being attached

cpuset is scheduled to be decoupled from cgroup_lock which will make
configuration updates race with task migration. Any config update
will be allowed to happen between ->can_attach() and ->attach(). If
such config update removes either all cpus or mems, by the time
->attach() is called, the condition verified by ->can_attach(), that
the cpuset is capable of hosting the tasks, is no longer true.

This patch adds cpuset->attach_in_progress which is incremented from
->can_attach() and decremented when the attach operation finishes
either successfully or not. validate_change() treats cpusets w/
non-zero ->attach_in_progress like cpusets w/ tasks and refuses to
remove all cpus or mems from it.

This currently doesn't make any functional difference as everything is
protected by cgroup_mutex but enables decoupling the locking.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cpuset.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 3558250..68a0906 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -91,6 +91,12 @@ struct cpuset {

struct fmeter fmeter; /* memory_pressure filter */

+ /*
+ * Tasks are being attached to this cpuset. Used to prevent
+ * zeroing cpus/mems_allowed between ->can_attach() and ->attach().
+ */
+ int attach_in_progress;
+
/* partition number for rebuild_sched_domains() */
int pn;

@@ -468,9 +474,12 @@ static int validate_change(const struct cpuset *cur, const struct cpuset *trial)
goto out;
}

- /* Cpusets with tasks can't have empty cpus_allowed or mems_allowed */
+ /*
+ * Cpusets with tasks - existing or newly being attached - can't
+ * have empty cpus_allowed or mems_allowed.
+ */
ret = -ENOSPC;
- if (cgroup_task_count(cur->css.cgroup) &&
+ if ((cgroup_task_count(cur->css.cgroup) || cur->attach_in_progress) &&
(cpumask_empty(trial->cpus_allowed) ||
nodes_empty(trial->mems_allowed)))
goto out;
@@ -1386,9 +1395,21 @@ static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
return ret;
}

+ /*
+ * Mark attach is in progress. This makes validate_change() fail
+ * changes which zero cpus/mems_allowed.
+ */
+ cs->attach_in_progress++;
+
return 0;
}

+static void cpuset_cancel_attach(struct cgroup *cgrp,
+ struct cgroup_taskset *tset)
+{
+ cgroup_cs(cgrp)->attach_in_progress--;
+}
+
static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
{
/* static bufs protected by cgroup_mutex */
@@ -1435,6 +1456,8 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
&cpuset_attach_nodemask_to);
mmput(mm);
}
+
+ cs->attach_in_progress--;
}

/* The various types of files and directories in a cpuset file system */
@@ -1902,6 +1925,7 @@ struct cgroup_subsys cpuset_subsys = {
.css_offline = cpuset_css_offline,
.css_free = cpuset_css_free,
.can_attach = cpuset_can_attach,
+ .cancel_attach = cpuset_cancel_attach,
.attach = cpuset_attach,
.subsys_id = cpuset_subsys_id,
.base_cftypes = files,
--
1.7.11.7

2012-11-28 21:35:07

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 13/13] cpuset: replace cgroup_mutex locking with cpuset internal locking

Supposedly for historical reasons, cpuset depends on cgroup core for
locking. It depends on cgroup_mutex in cgroup callbacks and grabs
cgroup_mutex from other places where it wants to be synchronized.
This is majorly messy and highly prone to introducing circular locking
dependency especially because cgroup_mutex is supposed to be one of
the outermost locks.

As previous patches already plugged possible races which may happen by
decoupling from cgroup_mutex, replacing cgroup_mutex with cpuset
specific cpuset_mutex is mostly straight-forward. Introduce
cpuset_mutex, replace all occurrences of cgroup_mutex with it, and add
cpuset_mutex locking to places which inherited cgroup_mutex from
cgroup core.

The only complication is from cpuset wanting to initiate task
migration when a cpuset loses all cpus or memory nodes. Task
migration may go through full cgroup and all subsystem locking and
should be initiated without holding any cpuset specific lock; however,
a previous patch already made hotplug handled asynchronously and
moving the task migration part outside other locks is easy.
cpuset_propagate_hotplug_workfn() now invokes
remove_tasks_in_empty_cpuset() without holding any lock.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cpuset.c | 186 ++++++++++++++++++++++++++++++++------------------------
1 file changed, 106 insertions(+), 80 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 79be3f0..2ee0e03 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -208,23 +208,20 @@ static struct cpuset top_cpuset = {
if (is_cpuset_online(((child_cs) = cgroup_cs((pos_cgrp)))))

/*
- * There are two global mutexes guarding cpuset structures. The first
- * is the main control groups cgroup_mutex, accessed via
- * cgroup_lock()/cgroup_unlock(). The second is the cpuset-specific
- * callback_mutex, below. They can nest. It is ok to first take
- * cgroup_mutex, then nest callback_mutex. We also require taking
- * task_lock() when dereferencing a task's cpuset pointer. See "The
- * task_lock() exception", at the end of this comment.
- *
- * A task must hold both mutexes to modify cpusets. If a task
- * holds cgroup_mutex, then it blocks others wanting that mutex,
- * ensuring that it is the only task able to also acquire callback_mutex
- * and be able to modify cpusets. It can perform various checks on
- * the cpuset structure first, knowing nothing will change. It can
- * also allocate memory while just holding cgroup_mutex. While it is
- * performing these checks, various callback routines can briefly
- * acquire callback_mutex to query cpusets. Once it is ready to make
- * the changes, it takes callback_mutex, blocking everyone else.
+ * There are two global mutexes guarding cpuset structures - cpuset_mutex
+ * and callback_mutex. The latter may nest inside the former. We also
+ * require taking task_lock() when dereferencing a task's cpuset pointer.
+ * See "The task_lock() exception", at the end of this comment.
+ *
+ * A task must hold both mutexes to modify cpusets. If a task holds
+ * cpuset_mutex, then it blocks others wanting that mutex, ensuring that it
+ * is the only task able to also acquire callback_mutex and be able to
+ * modify cpusets. It can perform various checks on the cpuset structure
+ * first, knowing nothing will change. It can also allocate memory while
+ * just holding cpuset_mutex. While it is performing these checks, various
+ * callback routines can briefly acquire callback_mutex to query cpusets.
+ * Once it is ready to make the changes, it takes callback_mutex, blocking
+ * everyone else.
*
* Calls to the kernel memory allocator can not be made while holding
* callback_mutex, as that would risk double tripping on callback_mutex
@@ -246,6 +243,7 @@ static struct cpuset top_cpuset = {
* guidelines for accessing subsystem state in kernel/cgroup.c
*/

+static DEFINE_MUTEX(cpuset_mutex);
static DEFINE_MUTEX(callback_mutex);

/*
@@ -351,7 +349,7 @@ static void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask)
/*
* update task's spread flag if cpuset's page/slab spread flag is set
*
- * Called with callback_mutex/cgroup_mutex held
+ * Called with callback_mutex/cpuset_mutex held
*/
static void cpuset_update_task_spread_flag(struct cpuset *cs,
struct task_struct *tsk)
@@ -371,7 +369,7 @@ static void cpuset_update_task_spread_flag(struct cpuset *cs,
*
* One cpuset is a subset of another if all its allowed CPUs and
* Memory Nodes are a subset of the other, and its exclusive flags
- * are only set if the other's are set. Call holding cgroup_mutex.
+ * are only set if the other's are set. Call holding cpuset_mutex.
*/

static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q)
@@ -420,7 +418,7 @@ static void free_trial_cpuset(struct cpuset *trial)
* If we replaced the flag and mask values of the current cpuset
* (cur) with those values in the trial cpuset (trial), would
* our various subset and exclusive rules still be valid? Presumes
- * cgroup_mutex held.
+ * cpuset_mutex held.
*
* 'cur' is the address of an actual, in-use cpuset. Operations
* such as list traversal that depend on the actual address of the
@@ -555,7 +553,7 @@ update_domain_attr_tree(struct sched_domain_attr *dattr, struct cpuset *c)
* domains when operating in the severe memory shortage situations
* that could cause allocation failures below.
*
- * Must be called with cgroup_lock held.
+ * Must be called with cpuset_mutex held.
*
* The three key local variables below are:
* q - a linked-list queue of cpuset pointers, used to implement a
@@ -766,7 +764,7 @@ done:
* 'cpus' is removed, then call this routine to rebuild the
* scheduler's dynamic sched domains.
*
- * Call with cgroup_mutex held. Takes get_online_cpus().
+ * Call with cpuset_mutex held. Takes get_online_cpus().
*/
static void rebuild_sched_domains_locked(void)
{
@@ -774,7 +772,7 @@ static void rebuild_sched_domains_locked(void)
cpumask_var_t *doms;
int ndoms;

- WARN_ON_ONCE(!cgroup_lock_is_held());
+ lockdep_assert_held(&cpuset_mutex);
get_online_cpus();

/* Generate domain masks and attrs */
@@ -800,9 +798,9 @@ static int generate_sched_domains(cpumask_var_t **domains,

void rebuild_sched_domains(void)
{
- cgroup_lock();
+ mutex_lock(&cpuset_mutex);
rebuild_sched_domains_locked();
- cgroup_unlock();
+ mutex_unlock(&cpuset_mutex);
}

/**
@@ -810,7 +808,7 @@ void rebuild_sched_domains(void)
* @tsk: task to test
* @scan: struct cgroup_scanner contained in its struct cpuset_hotplug_scanner
*
- * Call with cgroup_mutex held. May take callback_mutex during call.
+ * Call with cpuset_mutex held. May take callback_mutex during call.
* Called for each task in a cgroup by cgroup_scan_tasks().
* Return nonzero if this tasks's cpus_allowed mask should be changed (in other
* words, if its mask is not equal to its cpuset's mask).
@@ -831,7 +829,7 @@ static int cpuset_test_cpumask(struct task_struct *tsk,
* cpus_allowed mask needs to be changed.
*
* We don't need to re-check for the cgroup/cpuset membership, since we're
- * holding cgroup_lock() at this point.
+ * holding cpuset_mutex at this point.
*/
static void cpuset_change_cpumask(struct task_struct *tsk,
struct cgroup_scanner *scan)
@@ -844,7 +842,7 @@ static void cpuset_change_cpumask(struct task_struct *tsk,
* @cs: the cpuset in which each task's cpus_allowed mask needs to be changed
* @heap: if NULL, defer allocating heap memory to cgroup_scan_tasks()
*
- * Called with cgroup_mutex held
+ * Called with cpuset_mutex held
*
* The cgroup_scan_tasks() function will scan all the tasks in a cgroup,
* calling callback functions for each.
@@ -934,7 +932,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
* Temporarilly set tasks mems_allowed to target nodes of migration,
* so that the migration code can allocate pages on these nodes.
*
- * Call holding cgroup_mutex, so current's cpuset won't change
+ * Call holding cpuset_mutex, so current's cpuset won't change
* during this call, as manage_mutex holds off any cpuset_attach()
* calls. Therefore we don't need to take task_lock around the
* call to guarantee_online_mems(), as we know no one is changing
@@ -1009,7 +1007,7 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
/*
* Update task's mems_allowed and rebind its mempolicy and vmas' mempolicy
* of it to cpuset's new mems_allowed, and migrate pages to new nodes if
- * memory_migrate flag is set. Called with cgroup_mutex held.
+ * memory_migrate flag is set. Called with cpuset_mutex held.
*/
static void cpuset_change_nodemask(struct task_struct *p,
struct cgroup_scanner *scan)
@@ -1018,7 +1016,7 @@ static void cpuset_change_nodemask(struct task_struct *p,
struct cpuset *cs;
int migrate;
const nodemask_t *oldmem = scan->data;
- static nodemask_t newmems; /* protected by cgroup_mutex */
+ static nodemask_t newmems; /* protected by cpuset_mutex */

cs = cgroup_cs(scan->cg);
guarantee_online_mems(cs, &newmems);
@@ -1045,7 +1043,7 @@ static void *cpuset_being_rebound;
* @oldmem: old mems_allowed of cpuset cs
* @heap: if NULL, defer allocating heap memory to cgroup_scan_tasks()
*
- * Called with cgroup_mutex held
+ * Called with cpuset_mutex held
* No return value. It's guaranteed that cgroup_scan_tasks() always returns 0
* if @heap != NULL.
*/
@@ -1067,7 +1065,7 @@ static void update_tasks_nodemask(struct cpuset *cs, const nodemask_t *oldmem,
* take while holding tasklist_lock. Forks can happen - the
* mpol_dup() cpuset_being_rebound check will catch such forks,
* and rebind their vma mempolicies too. Because we still hold
- * the global cgroup_mutex, we know that no other rebind effort
+ * the global cpuset_mutex, we know that no other rebind effort
* will be contending for the global variable cpuset_being_rebound.
* It's ok if we rebind the same mm twice; mpol_rebind_mm()
* is idempotent. Also migrate pages in each mm to new nodes.
@@ -1086,7 +1084,7 @@ static void update_tasks_nodemask(struct cpuset *cs, const nodemask_t *oldmem,
* mempolicies and if the cpuset is marked 'memory_migrate',
* migrate the tasks pages to the new memory.
*
- * Call with cgroup_mutex held. May take callback_mutex during call.
+ * Call with cpuset_mutex held. May take callback_mutex during call.
* Will take tasklist_lock, scan tasklist for tasks in cpuset cs,
* lock each such tasks mm->mmap_sem, scan its vma's and rebind
* their mempolicies to the cpusets new mems_allowed.
@@ -1184,7 +1182,7 @@ static int update_relax_domain_level(struct cpuset *cs, s64 val)
* Called by cgroup_scan_tasks() for each task in a cgroup.
*
* We don't need to re-check for the cgroup/cpuset membership, since we're
- * holding cgroup_lock() at this point.
+ * holding cpuset_mutex at this point.
*/
static void cpuset_change_flag(struct task_struct *tsk,
struct cgroup_scanner *scan)
@@ -1197,7 +1195,7 @@ static void cpuset_change_flag(struct task_struct *tsk,
* @cs: the cpuset in which each task's spread flags needs to be changed
* @heap: if NULL, defer allocating heap memory to cgroup_scan_tasks()
*
- * Called with cgroup_mutex held
+ * Called with cpuset_mutex held
*
* The cgroup_scan_tasks() function will scan all the tasks in a cgroup,
* calling callback functions for each.
@@ -1222,7 +1220,7 @@ static void update_tasks_flags(struct cpuset *cs, struct ptr_heap *heap)
* cs: the cpuset to update
* turning_on: whether the flag is being set or cleared
*
- * Call with cgroup_mutex held.
+ * Call with cpuset_mutex held.
*/

static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
@@ -1370,15 +1368,18 @@ static int fmeter_getrate(struct fmeter *fmp)
return val;
}

-/* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
+/* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
{
struct cpuset *cs = cgroup_cs(cgrp);
struct task_struct *task;
int ret;

+ mutex_lock(&cpuset_mutex);
+
+ ret = -ENOSPC;
if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
- return -ENOSPC;
+ goto out_unlock;

cgroup_taskset_for_each(task, cgrp, tset) {
/*
@@ -1390,10 +1391,12 @@ static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
* set_cpus_allowed_ptr() on all attached tasks before
* cpus_allowed may be changed.
*/
+ ret = -EINVAL;
if (task->flags & PF_THREAD_BOUND)
- return -EINVAL;
- if ((ret = security_task_setscheduler(task)))
- return ret;
+ goto out_unlock;
+ ret = security_task_setscheduler(task);
+ if (ret)
+ goto out_unlock;
}

/*
@@ -1401,19 +1404,23 @@ static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
* changes which zero cpus/mems_allowed.
*/
cs->attach_in_progress++;
-
- return 0;
+ ret = 0;
+out_unlock:
+ mutex_unlock(&cpuset_mutex);
+ return ret;
}

static void cpuset_cancel_attach(struct cgroup *cgrp,
struct cgroup_taskset *tset)
{
+ mutex_lock(&cpuset_mutex);
cgroup_cs(cgrp)->attach_in_progress--;
+ mutex_unlock(&cpuset_mutex);
}

static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
{
- /* static bufs protected by cgroup_mutex */
+ /* static bufs protected by cpuset_mutex */
static cpumask_t cpus_attach;
static nodemask_t cpuset_attach_nodemask_from;
static nodemask_t cpuset_attach_nodemask_to;
@@ -1424,6 +1431,8 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
struct cpuset *cs = cgroup_cs(cgrp);
struct cpuset *oldcs = cgroup_cs(oldcgrp);

+ mutex_lock(&cpuset_mutex);
+
/* prepare for attach */
if (cs == &top_cpuset)
cpumask_copy(&cpus_attach, cpu_possible_mask);
@@ -1467,6 +1476,8 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
*/
if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
schedule_cpuset_propagate_hotplug(cs);
+
+ mutex_unlock(&cpuset_mutex);
}

/* The various types of files and directories in a cpuset file system */
@@ -1488,12 +1499,13 @@ typedef enum {

static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
{
- int retval = 0;
struct cpuset *cs = cgroup_cs(cgrp);
cpuset_filetype_t type = cft->private;
+ int retval = -ENODEV;

- if (!cgroup_lock_live_group(cgrp))
- return -ENODEV;
+ mutex_lock(&cpuset_mutex);
+ if (!is_cpuset_online(cs))
+ goto out_unlock;

switch (type) {
case FILE_CPU_EXCLUSIVE:
@@ -1527,18 +1539,20 @@ static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
retval = -EINVAL;
break;
}
- cgroup_unlock();
+out_unlock:
+ mutex_unlock(&cpuset_mutex);
return retval;
}

static int cpuset_write_s64(struct cgroup *cgrp, struct cftype *cft, s64 val)
{
- int retval = 0;
struct cpuset *cs = cgroup_cs(cgrp);
cpuset_filetype_t type = cft->private;
+ int retval = -ENODEV;

- if (!cgroup_lock_live_group(cgrp))
- return -ENODEV;
+ mutex_lock(&cpuset_mutex);
+ if (!is_cpuset_online(cs))
+ goto out_unlock;

switch (type) {
case FILE_SCHED_RELAX_DOMAIN_LEVEL:
@@ -1548,7 +1562,8 @@ static int cpuset_write_s64(struct cgroup *cgrp, struct cftype *cft, s64 val)
retval = -EINVAL;
break;
}
- cgroup_unlock();
+out_unlock:
+ mutex_unlock(&cpuset_mutex);
return retval;
}

@@ -1558,9 +1573,9 @@ static int cpuset_write_s64(struct cgroup *cgrp, struct cftype *cft, s64 val)
static int cpuset_write_resmask(struct cgroup *cgrp, struct cftype *cft,
const char *buf)
{
- int retval = 0;
struct cpuset *cs = cgroup_cs(cgrp);
struct cpuset *trialcs;
+ int retval = -ENODEV;

/*
* CPU or memory hotunplug may leave @cs w/o any execution
@@ -1580,13 +1595,14 @@ static int cpuset_write_resmask(struct cgroup *cgrp, struct cftype *cft,
flush_work(&cpuset_hotplug_work);
flush_workqueue(cpuset_propagate_hotplug_wq);

- if (!cgroup_lock_live_group(cgrp))
- return -ENODEV;
+ mutex_lock(&cpuset_mutex);
+ if (!is_cpuset_online(cs))
+ goto out_unlock;

trialcs = alloc_trial_cpuset(cs);
if (!trialcs) {
retval = -ENOMEM;
- goto out;
+ goto out_unlock;
}

switch (cft->private) {
@@ -1602,8 +1618,8 @@ static int cpuset_write_resmask(struct cgroup *cgrp, struct cftype *cft,
}

free_trial_cpuset(trialcs);
-out:
- cgroup_unlock();
+out_unlock:
+ mutex_unlock(&cpuset_mutex);
return retval;
}

@@ -1861,6 +1877,8 @@ static int cpuset_css_online(struct cgroup *cgrp)
if (!parent)
return 0;

+ mutex_lock(&cpuset_mutex);
+
set_bit(CS_ONLINE, &cs->flags);
if (is_spread_page(parent))
set_bit(CS_SPREAD_PAGE, &cs->flags);
@@ -1870,7 +1888,7 @@ static int cpuset_css_online(struct cgroup *cgrp)
number_of_cpusets++;

if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &cgrp->flags))
- return 0;
+ goto out_unlock;

/*
* Clone @parent's configuration if CGRP_CPUSET_CLONE_CHILDREN is
@@ -1889,7 +1907,7 @@ static int cpuset_css_online(struct cgroup *cgrp)
cpuset_for_each_child(tmp_cs, pos_cg, parent) {
if (is_mem_exclusive(tmp_cs) || is_cpu_exclusive(tmp_cs)) {
rcu_read_unlock();
- return 0;
+ goto out_unlock;
}
}
rcu_read_unlock();
@@ -1898,7 +1916,8 @@ static int cpuset_css_online(struct cgroup *cgrp)
cs->mems_allowed = parent->mems_allowed;
cpumask_copy(cs->cpus_allowed, parent->cpus_allowed);
mutex_unlock(&callback_mutex);
-
+out_unlock:
+ mutex_unlock(&cpuset_mutex);
return 0;
}

@@ -1906,8 +1925,7 @@ static void cpuset_css_offline(struct cgroup *cgrp)
{
struct cpuset *cs = cgroup_cs(cgrp);

- /* css_offline is called w/o cgroup_mutex, grab it */
- cgroup_lock();
+ mutex_lock(&cpuset_mutex);

if (is_sched_load_balance(cs))
update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
@@ -1915,7 +1933,7 @@ static void cpuset_css_offline(struct cgroup *cgrp)
number_of_cpusets--;
clear_bit(CS_ONLINE, &cs->flags);

- cgroup_unlock();
+ mutex_unlock(&cpuset_mutex);
}

/*
@@ -1987,7 +2005,9 @@ static void cpuset_do_move_task(struct task_struct *tsk,
{
struct cgroup *new_cgroup = scan->data;

+ cgroup_lock();
cgroup_attach_task(new_cgroup, tsk);
+ cgroup_unlock();
}

/**
@@ -1995,7 +2015,7 @@ static void cpuset_do_move_task(struct task_struct *tsk,
* @from: cpuset in which the tasks currently reside
* @to: cpuset to which the tasks will be moved
*
- * Called with cgroup_mutex held
+ * Called with cpuset_mutex held
* callback_mutex must not be held, as cpuset_attach() will take it.
*
* The cgroup_scan_tasks() function will scan all the tasks in a cgroup,
@@ -2022,9 +2042,6 @@ static void move_member_tasks_to_cpuset(struct cpuset *from, struct cpuset *to)
* removing that CPU or node from all cpusets. If this removes the
* last CPU or node from a cpuset, then move the tasks in the empty
* cpuset to its next-highest non-empty parent.
- *
- * Called with cgroup_mutex held
- * callback_mutex must not be held, as cpuset_attach() will take it.
*/
static void remove_tasks_in_empty_cpuset(struct cpuset *cs)
{
@@ -2080,8 +2097,9 @@ static void cpuset_propagate_hotplug_workfn(struct work_struct *work)
static cpumask_t off_cpus;
static nodemask_t off_mems, tmp_mems;
struct cpuset *cs = container_of(work, struct cpuset, hotplug_work);
+ bool is_empty;

- cgroup_lock();
+ mutex_lock(&cpuset_mutex);

cpumask_andnot(&off_cpus, cs->cpus_allowed, top_cpuset.cpus_allowed);
nodes_andnot(off_mems, cs->mems_allowed, top_cpuset.mems_allowed);
@@ -2103,10 +2121,18 @@ static void cpuset_propagate_hotplug_workfn(struct work_struct *work)
update_tasks_nodemask(cs, &tmp_mems, NULL);
}

- if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
- remove_tasks_in_empty_cpuset(cs);
+ is_empty = cpumask_empty(cs->cpus_allowed) ||
+ nodes_empty(cs->mems_allowed);

- cgroup_unlock();
+ mutex_unlock(&cpuset_mutex);
+
+ /*
+ * If @cs became empty, move tasks to the nearest ancestor with
+ * execution resources. This is full cgroup operation which will
+ * also call back into cpuset. Should be done outside any lock.
+ */
+ if (is_empty)
+ remove_tasks_in_empty_cpuset(cs);

/* the following may free @cs, should be the last operation */
css_put(&cs->css);
@@ -2160,7 +2186,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
bool cpus_updated, mems_updated;
bool cpus_offlined, mems_offlined;

- cgroup_lock();
+ mutex_lock(&cpuset_mutex);

/* fetch the available cpus/mems and find out which changed how */
cpumask_copy(&new_cpus, cpu_active_mask);
@@ -2202,7 +2228,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
schedule_cpuset_propagate_hotplug(cs);
}

- cgroup_unlock();
+ mutex_unlock(&cpuset_mutex);

/* wait for propagations to finish */
flush_workqueue(cpuset_propagate_hotplug_wq);
@@ -2213,9 +2239,9 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
cpumask_var_t *doms;
int ndoms;

- cgroup_lock();
+ mutex_lock(&cpuset_mutex);
ndoms = generate_sched_domains(&doms, &attr);
- cgroup_unlock();
+ mutex_unlock(&cpuset_mutex);

partition_sched_domains(ndoms, doms, attr);
}
@@ -2630,7 +2656,7 @@ void __cpuset_memory_pressure_bump(void)
* - Used for /proc/<pid>/cpuset.
* - No need to task_lock(tsk) on this tsk->cpuset reference, as it
* doesn't really matter if tsk->cpuset changes after we read it,
- * and we take cgroup_mutex, keeping cpuset_attach() from changing it
+ * and we take cpuset_mutex, keeping cpuset_attach() from changing it
* anyway.
*/
static int proc_cpuset_show(struct seq_file *m, void *unused_v)
@@ -2653,7 +2679,7 @@ static int proc_cpuset_show(struct seq_file *m, void *unused_v)
goto out_free;

retval = -EINVAL;
- cgroup_lock();
+ mutex_lock(&cpuset_mutex);
css = task_subsys_state(tsk, cpuset_subsys_id);
retval = cgroup_path(css->cgroup, buf, PAGE_SIZE);
if (retval < 0)
@@ -2661,7 +2687,7 @@ static int proc_cpuset_show(struct seq_file *m, void *unused_v)
seq_puts(m, buf);
seq_putc(m, '\n');
out_unlock:
- cgroup_unlock();
+ mutex_unlock(&cpuset_mutex);
put_task_struct(tsk);
out_free:
kfree(buf);
--
1.7.11.7

2012-11-28 21:35:45

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 12/13] cpuset: schedule hotplug propagation from cpuset_attach() if the cpuset is empty

cpuset is scheduled to be decoupled from cgroup_lock which will make
hotplug handling race with task migration. cpus or mems will be
allowed to go offline between ->can_attach() and ->attach(). If
hotplug takes down all cpus or mems of a cpuset while attach is in
progress, ->attach() may end up putting tasks into an empty cpuset.

This patchset makes ->attach() schedule hotplug propagation if the
cpuset is empty after attaching is complete. This will move the tasks
to the nearest ancestor which can execute and the end result would be
as if hotplug handling happened after the tasks finished attaching.

cpuset_write_resmask() now also flushes cpuset_propagate_hotplug_wq to
wait for propagations scheduled directly by cpuset_attach().

This currently doesn't make any functional difference as everything is
protected by cgroup_mutex but enables decoupling the locking.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cpuset.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 68a0906..79be3f0 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -266,6 +266,7 @@ static struct workqueue_struct *cpuset_propagate_hotplug_wq;

static void cpuset_hotplug_workfn(struct work_struct *work);
static void cpuset_propagate_hotplug_workfn(struct work_struct *work);
+static void schedule_cpuset_propagate_hotplug(struct cpuset *cs);

static DECLARE_WORK(cpuset_hotplug_work, cpuset_hotplug_workfn);

@@ -1458,6 +1459,14 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
}

cs->attach_in_progress--;
+
+ /*
+ * We may have raced with CPU/memory hotunplug. Trigger hotplug
+ * propagation if @cs doesn't have any CPU or memory. It will move
+ * the newly added tasks to the nearest parent which can execute.
+ */
+ if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
+ schedule_cpuset_propagate_hotplug(cs);
}

/* The various types of files and directories in a cpuset file system */
@@ -1563,8 +1572,13 @@ static int cpuset_write_resmask(struct cgroup *cgrp, struct cftype *cft,
* resources, wait for the previously scheduled operations before
* proceeding, so that we don't end up keep removing tasks added
* after execution capability is restored.
+ *
+ * Flushing cpuset_hotplug_work is enough to synchronize against
+ * hotplug hanlding; however, cpuset_attach() may schedule
+ * propagation work directly. Flush the workqueue too.
*/
flush_work(&cpuset_hotplug_work);
+ flush_workqueue(cpuset_propagate_hotplug_wq);

if (!cgroup_lock_live_group(cgrp))
return -ENODEV;
--
1.7.11.7

2012-11-28 21:36:09

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 10/13] cpuset: make CPU / memory hotplug propagation asynchronous

cpuset_hotplug_workfn() has been invoking cpuset_propagate_hotplug()
directly to propagate hotplug updates to !root cpusets; however, this
has the following problems.

* cpuset locking is scheduled to be decoupled from cgroup_mutex,
cgroup_mutex will be unexported, and cgroup_attach_task() will do
cgroup locking internally, so propagation can't synchronously move
tasks to a parent cgroup while walking the hierarchy.

* We can't use cgroup generic tree iterator because propagation to
each cpuset may sleep. With propagation done asynchronously, we can
lose the rather ugly cpuset specific iteration.

Convert cpuset_propagate_hotplug() to
cpuset_propagate_hotplug_workfn() and execute it from newly added
cpuset->hotplug_work. The work items are run on an ordered workqueue,
so the propagation order is preserved. cpuset_hotplug_workfn()
schedules all propagations while holding cgroup_mutex and waits for
completion without cgroup_mutex. Each in-flight propagation holds a
reference to the cpuset->css.

This patch doesn't cause any functional difference.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cpuset.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 48 insertions(+), 6 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index b530fba..3558250 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -99,6 +99,8 @@ struct cpuset {

/* used for walking a cpuset hierarchy */
struct list_head stack_list;
+
+ struct work_struct hotplug_work;
};

/* Retrieve the cpuset for a cgroup */
@@ -254,7 +256,10 @@ static DEFINE_SPINLOCK(cpuset_buffer_lock);
/*
* CPU / memory hotplug is handled asynchronously.
*/
+static struct workqueue_struct *cpuset_propagate_hotplug_wq;
+
static void cpuset_hotplug_workfn(struct work_struct *work);
+static void cpuset_propagate_hotplug_workfn(struct work_struct *work);

static DECLARE_WORK(cpuset_hotplug_work, cpuset_hotplug_workfn);

@@ -1802,6 +1807,7 @@ static struct cgroup_subsys_state *cpuset_css_alloc(struct cgroup *cont)
cpumask_clear(cs->cpus_allowed);
nodes_clear(cs->mems_allowed);
fmeter_init(&cs->fmeter);
+ INIT_WORK(&cs->hotplug_work, cpuset_propagate_hotplug_workfn);
cs->relax_domain_level = -1;
cs->parent = cgroup_cs(cont->parent);

@@ -2024,21 +2030,20 @@ static struct cpuset *cpuset_next(struct list_head *queue)
}

/**
- * cpuset_propagate_hotplug - propagate CPU/memory hotplug to a cpuset
+ * cpuset_propagate_hotplug_workfn - propagate CPU/memory hotplug to a cpuset
* @cs: cpuset in interest
*
* Compare @cs's cpu and mem masks against top_cpuset and if some have gone
* offline, update @cs accordingly. If @cs ends up with no CPU or memory,
* all its tasks are moved to the nearest ancestor with both resources.
- *
- * Should be called with cgroup_mutex held.
*/
-static void cpuset_propagate_hotplug(struct cpuset *cs)
+static void cpuset_propagate_hotplug_workfn(struct work_struct *work)
{
static cpumask_t off_cpus;
static nodemask_t off_mems, tmp_mems;
+ struct cpuset *cs = container_of(work, struct cpuset, hotplug_work);

- WARN_ON_ONCE(!cgroup_lock_is_held());
+ cgroup_lock();

cpumask_andnot(&off_cpus, cs->cpus_allowed, top_cpuset.cpus_allowed);
nodes_andnot(off_mems, cs->mems_allowed, top_cpuset.mems_allowed);
@@ -2062,6 +2067,36 @@ static void cpuset_propagate_hotplug(struct cpuset *cs)

if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
remove_tasks_in_empty_cpuset(cs);
+
+ cgroup_unlock();
+
+ /* the following may free @cs, should be the last operation */
+ css_put(&cs->css);
+}
+
+/**
+ * schedule_cpuset_propagate_hotplug - schedule hotplug propagation to a cpuset
+ * @cs: cpuset of interest
+ *
+ * Schedule cpuset_propagate_hotplug_workfn() which will update CPU and
+ * memory masks according to top_cpuset.
+ */
+static void schedule_cpuset_propagate_hotplug(struct cpuset *cs)
+{
+ /*
+ * Pin @cs. The refcnt will be released when the work item
+ * finishes executing.
+ */
+ if (!css_tryget(&cs->css))
+ return;
+
+ /*
+ * Queue @cs->empty_cpuset_work. If already pending, lose the css
+ * ref. cpuset_propagate_hotplug_wq is ordered and propagation
+ * will happen in the order this function is called.
+ */
+ if (!queue_work(cpuset_propagate_hotplug_wq, &cs->hotplug_work))
+ css_put(&cs->css);
}

/**
@@ -2126,11 +2161,14 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
list_add_tail(&top_cpuset.stack_list, &queue);
while ((cs = cpuset_next(&queue)))
if (cs != &top_cpuset)
- cpuset_propagate_hotplug(cs);
+ schedule_cpuset_propagate_hotplug(cs);
}

cgroup_unlock();

+ /* wait for propagations to finish */
+ flush_workqueue(cpuset_propagate_hotplug_wq);
+
/* rebuild sched domains if cpus_allowed has changed */
if (cpus_updated) {
struct sched_domain_attr *attr;
@@ -2176,6 +2214,10 @@ void __init cpuset_init_smp(void)
top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];

hotplug_memory_notifier(cpuset_track_online_nodes, 10);
+
+ cpuset_propagate_hotplug_wq =
+ alloc_ordered_workqueue("cpuset_hotplug", 0);
+ BUG_ON(!cpuset_propagate_hotplug_wq);
}

/**
--
1.7.11.7

2012-11-28 21:36:46

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 08/13] cpuset: reorganize CPU / memory hotplug handling

Reorganize hotplug path to prepare for async hotplug handling.

* Both CPU and memory hotplug handlings are collected into a single
function - cpuset_handle_hotplug(). It doesn't take any argument
but compares the current setttings of top_cpuset against what's
actually available to determine what happened. This function
directly updates top_cpuset. If there are CPUs or memory nodes
which are taken down, cpuset_propagate_hotplug() in invoked on all
!root cpusets.

* cpuset_propagate_hotplug() is responsible for updating the specified
cpuset so that it doesn't include any resource which isn't available
to top_cpuset. If no CPU or memory is left after update, all tasks
are moved to the nearest ancestor with both resources.

* update_tasks_cpumask() and update_tasks_nodemask() are now always
called after cpus or mems masks are updated even if the cpuset
doesn't have any task. This is for brevity and not expected to have
any measureable effect.

* cpu_active_mask and N_HIGH_MEMORY are read exactly once per
cpuset_handle_hotplug() invocation, all cpusets share the same view
of what resources are available, and cpuset_handle_hotplug() can
handle multiple resources going up and down. These properties will
allow async operation.

The reorganization, while drastic, is equivalent and shouldn't cause
any behavior difference. This will enable making hotplug handling
async and remove get_online_cpus() -> cgroup_mutex nesting.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cpuset.c | 223 ++++++++++++++++++++++++++------------------------------
1 file changed, 105 insertions(+), 118 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 2049504..4ab3e4c 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -140,12 +140,6 @@ typedef enum {
CS_SPREAD_SLAB,
} cpuset_flagbits_t;

-/* the type of hotplug event */
-enum hotplug_event {
- CPUSET_CPU_OFFLINE,
- CPUSET_MEM_OFFLINE,
-};
-
/* convenient tests for these bits */
static inline bool is_cpuset_online(const struct cpuset *cs)
{
@@ -2009,116 +2003,131 @@ static struct cpuset *cpuset_next(struct list_head *queue)
return cp;
}

-
-/*
- * Walk the specified cpuset subtree upon a hotplug operation (CPU/Memory
- * online/offline) and update the cpusets accordingly.
- * For regular CPU/Mem hotplug, look for empty cpusets; the tasks of such
- * cpuset must be moved to a parent cpuset.
- *
- * Called with cgroup_mutex held. We take callback_mutex to modify
- * cpus_allowed and mems_allowed.
+/**
+ * cpuset_propagate_hotplug - propagate CPU/memory hotplug to a cpuset
+ * @cs: cpuset in interest
*
- * This walk processes the tree from top to bottom, completing one layer
- * before dropping down to the next. It always processes a node before
- * any of its children.
+ * Compare @cs's cpu and mem masks against top_cpuset and if some have gone
+ * offline, update @cs accordingly. If @cs ends up with no CPU or memory,
+ * all its tasks are moved to the nearest ancestor with both resources.
*
- * In the case of memory hot-unplug, it will remove nodes from N_HIGH_MEMORY
- * if all present pages from a node are offlined.
+ * Should be called with cgroup_mutex held.
*/
-static void
-scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
+static void cpuset_propagate_hotplug(struct cpuset *cs)
{
- LIST_HEAD(queue);
- struct cpuset *cp; /* scans cpusets being updated */
- static nodemask_t oldmems; /* protected by cgroup_mutex */
-
- list_add_tail((struct list_head *)&root->stack_list, &queue);
-
- switch (event) {
- case CPUSET_CPU_OFFLINE:
- while ((cp = cpuset_next(&queue)) != NULL) {
+ static cpumask_t off_cpus;
+ static nodemask_t off_mems, tmp_mems;

- /* Continue past cpusets with all cpus online */
- if (cpumask_subset(cp->cpus_allowed, cpu_active_mask))
- continue;
-
- /* Remove offline cpus from this cpuset. */
- mutex_lock(&callback_mutex);
- cpumask_and(cp->cpus_allowed, cp->cpus_allowed,
- cpu_active_mask);
- mutex_unlock(&callback_mutex);
-
- /* Move tasks from the empty cpuset to a parent */
- if (cpumask_empty(cp->cpus_allowed))
- remove_tasks_in_empty_cpuset(cp);
- else
- update_tasks_cpumask(cp, NULL);
- }
- break;
-
- case CPUSET_MEM_OFFLINE:
- while ((cp = cpuset_next(&queue)) != NULL) {
-
- /* Continue past cpusets with all mems online */
- if (nodes_subset(cp->mems_allowed,
- node_states[N_HIGH_MEMORY]))
- continue;
+ WARN_ON_ONCE(!cgroup_lock_is_held());

- oldmems = cp->mems_allowed;
+ cpumask_andnot(&off_cpus, cs->cpus_allowed, top_cpuset.cpus_allowed);
+ nodes_andnot(off_mems, cs->mems_allowed, top_cpuset.mems_allowed);

- /* Remove offline mems from this cpuset. */
- mutex_lock(&callback_mutex);
- nodes_and(cp->mems_allowed, cp->mems_allowed,
- node_states[N_HIGH_MEMORY]);
- mutex_unlock(&callback_mutex);
+ /* remove offline cpus from @cs */
+ if (!cpumask_empty(&off_cpus)) {
+ mutex_lock(&callback_mutex);
+ cpumask_andnot(cs->cpus_allowed, cs->cpus_allowed, &off_cpus);
+ mutex_unlock(&callback_mutex);
+ update_tasks_cpumask(cs, NULL);
+ }

- /* Move tasks from the empty cpuset to a parent */
- if (nodes_empty(cp->mems_allowed))
- remove_tasks_in_empty_cpuset(cp);
- else
- update_tasks_nodemask(cp, &oldmems, NULL);
- }
+ /* remove offline mems from @cs */
+ if (!nodes_empty(off_mems)) {
+ tmp_mems = cs->mems_allowed;
+ mutex_lock(&callback_mutex);
+ nodes_andnot(cs->mems_allowed, cs->mems_allowed, off_mems);
+ mutex_unlock(&callback_mutex);
+ update_tasks_nodemask(cs, &tmp_mems, NULL);
}
+
+ if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
+ remove_tasks_in_empty_cpuset(cs);
}

-/*
- * The top_cpuset tracks what CPUs and Memory Nodes are online,
- * period. This is necessary in order to make cpusets transparent
- * (of no affect) on systems that are actively using CPU hotplug
- * but making no active use of cpusets.
- *
- * The only exception to this is suspend/resume, where we don't
- * modify cpusets at all.
+/**
+ * cpuset_handle_hotplug - handle CPU/memory hot[un]plug
*
- * This routine ensures that top_cpuset.cpus_allowed tracks
- * cpu_active_mask on each CPU hotplug (cpuhp) event.
+ * This function is called after either CPU or memory configuration has
+ * changed and updates cpuset accordingly. The top_cpuset is always
+ * synchronized to cpu_active_mask and N_HIGH_MEMORY, which is necessary in
+ * order to make cpusets transparent (of no affect) on systems that are
+ * actively using CPU hotplug but making no active use of cpusets.
*
- * Called within get_online_cpus(). Needs to call cgroup_lock()
- * before calling generate_sched_domains().
+ * Non-root cpusets are only affected by offlining. If any CPUs or memory
+ * nodes have been taken down, cpuset_propagate_hotplug() is invoked on all
+ * descendants.
*
- * @cpu_online: Indicates whether this is a CPU online event (true) or
- * a CPU offline event (false).
+ * Note that CPU offlining during suspend is ignored. We don't modify
+ * cpusets across suspend/resume cycles at all.
*/
-void cpuset_update_active_cpus(bool cpu_online)
+static void cpuset_handle_hotplug(void)
{
- struct sched_domain_attr *attr;
- cpumask_var_t *doms;
- int ndoms;
+ static cpumask_t new_cpus, tmp_cpus;
+ static nodemask_t new_mems, tmp_mems;
+ bool cpus_updated, mems_updated;
+ bool cpus_offlined, mems_offlined;

cgroup_lock();
- mutex_lock(&callback_mutex);
- cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
- mutex_unlock(&callback_mutex);

- if (!cpu_online)
- scan_cpusets_upon_hotplug(&top_cpuset, CPUSET_CPU_OFFLINE);
+ /* fetch the available cpus/mems and find out which changed how */
+ cpumask_copy(&new_cpus, cpu_active_mask);
+ new_mems = node_states[N_HIGH_MEMORY];
+
+ cpus_updated = !cpumask_equal(top_cpuset.cpus_allowed, &new_cpus);
+ cpus_offlined = cpumask_andnot(&tmp_cpus, top_cpuset.cpus_allowed,
+ &new_cpus);
+
+ mems_updated = !nodes_equal(top_cpuset.mems_allowed, new_mems);
+ nodes_andnot(tmp_mems, top_cpuset.mems_allowed, new_mems);
+ mems_offlined = !nodes_empty(tmp_mems);
+
+ /* synchronize cpus_allowed to cpu_active_mask */
+ if (cpus_updated) {
+ mutex_lock(&callback_mutex);
+ cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
+ mutex_unlock(&callback_mutex);
+ /* we don't mess with cpumasks of tasks in top_cpuset */
+ }
+
+ /* synchronize mems_allowed to N_HIGH_MEMORY */
+ if (mems_updated) {
+ tmp_mems = top_cpuset.mems_allowed;
+ mutex_lock(&callback_mutex);
+ top_cpuset.mems_allowed = new_mems;
+ mutex_unlock(&callback_mutex);
+ update_tasks_nodemask(&top_cpuset, &tmp_mems, NULL);
+ }
+
+ /* if cpus or mems went down, we need to propagate to descendants */
+ if (cpus_offlined || mems_offlined) {
+ struct cpuset *cs;
+ LIST_HEAD(queue);
+
+ list_add_tail(&top_cpuset.stack_list, &queue);
+ while ((cs = cpuset_next(&queue)))
+ if (cs != &top_cpuset)
+ cpuset_propagate_hotplug(cs);
+ }

- ndoms = generate_sched_domains(&doms, &attr);
cgroup_unlock();

- /* Have scheduler rebuild the domains */
- partition_sched_domains(ndoms, doms, attr);
+ /* rebuild sched domains if cpus_allowed has changed */
+ if (cpus_updated) {
+ struct sched_domain_attr *attr;
+ cpumask_var_t *doms;
+ int ndoms;
+
+ cgroup_lock();
+ ndoms = generate_sched_domains(&doms, &attr);
+ cgroup_unlock();
+
+ partition_sched_domains(ndoms, doms, attr);
+ }
+}
+
+void cpuset_update_active_cpus(bool cpu_online)
+{
+ cpuset_handle_hotplug();
}

#ifdef CONFIG_MEMORY_HOTPLUG
@@ -2128,31 +2137,9 @@ void cpuset_update_active_cpus(bool cpu_online)
* See cpuset_update_active_cpus() for CPU hotplug handling.
*/
static int cpuset_track_online_nodes(struct notifier_block *self,
- unsigned long action, void *arg)
+ unsigned long action, void *arg)
{
- static nodemask_t oldmems; /* protected by cgroup_mutex */
-
- cgroup_lock();
- switch (action) {
- case MEM_ONLINE:
- oldmems = top_cpuset.mems_allowed;
- mutex_lock(&callback_mutex);
- top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
- mutex_unlock(&callback_mutex);
- update_tasks_nodemask(&top_cpuset, &oldmems, NULL);
- break;
- case MEM_OFFLINE:
- /*
- * needn't update top_cpuset.mems_allowed explicitly because
- * scan_cpusets_upon_hotplug() will update it.
- */
- scan_cpusets_upon_hotplug(&top_cpuset, CPUSET_MEM_OFFLINE);
- break;
- default:
- break;
- }
- cgroup_unlock();
-
+ cpuset_handle_hotplug();
return NOTIFY_OK;
}
#endif
--
1.7.11.7

2012-11-28 21:37:18

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 07/13] cpuset: drop async_rebuild_sched_domains()

get_online_cpus() is already nested inside cgroup_mutex from memcg
when it's registering cpu notifiers. Also, in generall, we want to
make cgroup_mutex one of the outermost locks and be able to use
get_online_cpus() and friends from cgroup methods.

Currently, cpuset avoids nesting get_online_cpus() inside cgroup_mutex
by bouncing sched_domain rebuilding to a work item. As such nesting
is gonna be allowed, remove the workqueue bouncing code and always
rebuild sched_domains synchronously. This also nests
sched_domains_mutex inside cgroup_mutex, which should be okay.

Note that the CPU / memory hotplug path still grabs the two locks in
the reverse order and thus this is a deadlock hazard; however, the two
locks are already deadlock-prone and the hotplug path will be updated
by further patches.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cpuset.c | 76 ++++++++++++---------------------------------------------
1 file changed, 16 insertions(+), 60 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 8bdd983..2049504 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -61,14 +61,6 @@
#include <linux/cgroup.h>

/*
- * Workqueue for cpuset related tasks.
- *
- * Using kevent workqueue may cause deadlock when memory_migrate
- * is set. So we create a separate workqueue thread for cpuset.
- */
-static struct workqueue_struct *cpuset_wq;
-
-/*
* Tracks how many cpusets are currently defined in system.
* When there is only one cpuset (the root cpuset) we can
* short circuit some hooks.
@@ -752,25 +744,25 @@ done:
/*
* Rebuild scheduler domains.
*
- * Call with neither cgroup_mutex held nor within get_online_cpus().
- * Takes both cgroup_mutex and get_online_cpus().
+ * If the flag 'sched_load_balance' of any cpuset with non-empty
+ * 'cpus' changes, or if the 'cpus' allowed changes in any cpuset
+ * which has that flag enabled, or if any cpuset with a non-empty
+ * 'cpus' is removed, then call this routine to rebuild the
+ * scheduler's dynamic sched domains.
*
- * Cannot be directly called from cpuset code handling changes
- * to the cpuset pseudo-filesystem, because it cannot be called
- * from code that already holds cgroup_mutex.
+ * Call with cgroup_mutex held. Takes get_online_cpus().
*/
-static void do_rebuild_sched_domains(struct work_struct *unused)
+static void rebuild_sched_domains_locked(void)
{
struct sched_domain_attr *attr;
cpumask_var_t *doms;
int ndoms;

+ WARN_ON_ONCE(!cgroup_lock_is_held());
get_online_cpus();

/* Generate domain masks and attrs */
- cgroup_lock();
ndoms = generate_sched_domains(&doms, &attr);
- cgroup_unlock();

/* Have scheduler rebuild the domains */
partition_sched_domains(ndoms, doms, attr);
@@ -778,7 +770,7 @@ static void do_rebuild_sched_domains(struct work_struct *unused)
put_online_cpus();
}
#else /* !CONFIG_SMP */
-static void do_rebuild_sched_domains(struct work_struct *unused)
+static void rebuild_sched_domains_locked(void)
{
}

@@ -790,44 +782,11 @@ static int generate_sched_domains(cpumask_var_t **domains,
}
#endif /* CONFIG_SMP */

-static DECLARE_WORK(rebuild_sched_domains_work, do_rebuild_sched_domains);
-
-/*
- * Rebuild scheduler domains, asynchronously via workqueue.
- *
- * If the flag 'sched_load_balance' of any cpuset with non-empty
- * 'cpus' changes, or if the 'cpus' allowed changes in any cpuset
- * which has that flag enabled, or if any cpuset with a non-empty
- * 'cpus' is removed, then call this routine to rebuild the
- * scheduler's dynamic sched domains.
- *
- * The rebuild_sched_domains() and partition_sched_domains()
- * routines must nest cgroup_lock() inside get_online_cpus(),
- * but such cpuset changes as these must nest that locking the
- * other way, holding cgroup_lock() for much of the code.
- *
- * So in order to avoid an ABBA deadlock, the cpuset code handling
- * these user changes delegates the actual sched domain rebuilding
- * to a separate workqueue thread, which ends up processing the
- * above do_rebuild_sched_domains() function.
- */
-static void async_rebuild_sched_domains(void)
-{
- queue_work(cpuset_wq, &rebuild_sched_domains_work);
-}
-
-/*
- * Accomplishes the same scheduler domain rebuild as the above
- * async_rebuild_sched_domains(), however it directly calls the
- * rebuild routine synchronously rather than calling it via an
- * asynchronous work thread.
- *
- * This can only be called from code that is not holding
- * cgroup_mutex (not nested in a cgroup_lock() call.)
- */
void rebuild_sched_domains(void)
{
- do_rebuild_sched_domains(NULL);
+ cgroup_lock();
+ rebuild_sched_domains_locked();
+ cgroup_unlock();
}

/**
@@ -947,7 +906,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
heap_free(&heap);

if (is_load_balanced)
- async_rebuild_sched_domains();
+ rebuild_sched_domains_locked();
return 0;
}

@@ -1195,7 +1154,7 @@ static int update_relax_domain_level(struct cpuset *cs, s64 val)
cs->relax_domain_level = val;
if (!cpumask_empty(cs->cpus_allowed) &&
is_sched_load_balance(cs))
- async_rebuild_sched_domains();
+ rebuild_sched_domains_locked();
}

return 0;
@@ -1287,7 +1246,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
mutex_unlock(&callback_mutex);

if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
- async_rebuild_sched_domains();
+ rebuild_sched_domains_locked();

if (spread_flag_changed)
update_tasks_flags(cs, &heap);
@@ -1905,7 +1864,7 @@ static void cpuset_css_offline(struct cgroup *cgrp)
/*
* If the cpuset being removed has its flag 'sched_load_balance'
* enabled, then simulate turning sched_load_balance off, which
- * will call async_rebuild_sched_domains().
+ * will call rebuild_sched_domains_locked().
*/

static void cpuset_css_free(struct cgroup *cont)
@@ -2210,9 +2169,6 @@ void __init cpuset_init_smp(void)
top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];

hotplug_memory_notifier(cpuset_track_online_nodes, 10);
-
- cpuset_wq = create_singlethread_workqueue("cpuset");
- BUG_ON(!cpuset_wq);
}

/**
--
1.7.11.7

2012-11-28 21:34:41

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 04/13] cpuset: introduce CS_ONLINE

Add CS_ONLINE which is set from css_online() and cleared from
css_offline(). This will enable using generic cgroup iterator while
allowing decoupling cpuset from cgroup internal locking.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cpuset.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 70197ba..22b521d 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -138,6 +138,7 @@ static inline bool task_has_mempolicy(struct task_struct *task)

/* bits in struct cpuset flags field */
typedef enum {
+ CS_ONLINE,
CS_CPU_EXCLUSIVE,
CS_MEM_EXCLUSIVE,
CS_MEM_HARDWALL,
@@ -154,6 +155,11 @@ enum hotplug_event {
};

/* convenient tests for these bits */
+static inline bool is_cpuset_online(const struct cpuset *cs)
+{
+ return test_bit(CS_ONLINE, &cs->flags);
+}
+
static inline int is_cpu_exclusive(const struct cpuset *cs)
{
return test_bit(CS_CPU_EXCLUSIVE, &cs->flags);
@@ -190,7 +196,8 @@ static inline int is_spread_slab(const struct cpuset *cs)
}

static struct cpuset top_cpuset = {
- .flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
+ .flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
+ (1 << CS_MEM_EXCLUSIVE)),
};

/*
@@ -1822,6 +1829,7 @@ static int cpuset_css_online(struct cgroup *cgrp)
if (!parent)
return 0;

+ set_bit(CS_ONLINE, &cs->flags);
if (is_spread_page(parent))
set_bit(CS_SPREAD_PAGE, &cs->flags);
if (is_spread_slab(parent))
@@ -1871,6 +1879,7 @@ static void cpuset_css_offline(struct cgroup *cgrp)
update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);

number_of_cpusets--;
+ clear_bit(CS_ONLINE, &cs->flags);

cgroup_unlock();
}
--
1.7.11.7

2012-11-28 21:38:17

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 03/13] cpuset: introduce ->css_on/offline()

Add cpuset_css_on/offline() and rearrange css init/exit such that,

* Allocation and clearing to the default values happen in css_alloc().
Allocation now uses kzalloc().

* Config inheritance and registration happen in css_online().

* css_offline() undoes what css_online() did.

* css_free() frees.

This doesn't introduce any visible behavior changes. This will help
cleaning up locking.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cpuset.c | 66 ++++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 44 insertions(+), 22 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 54b2b73..70197ba 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1790,15 +1790,12 @@ static struct cftype files[] = {

static struct cgroup_subsys_state *cpuset_css_alloc(struct cgroup *cont)
{
- struct cgroup *parent_cg = cont->parent;
- struct cgroup *tmp_cg;
- struct cpuset *parent, *cs;
+ struct cpuset *cs;

- if (!parent_cg)
+ if (!cont->parent)
return &top_cpuset.css;
- parent = cgroup_cs(parent_cg);

- cs = kmalloc(sizeof(*cs), GFP_KERNEL);
+ cs = kzalloc(sizeof(*cs), GFP_KERNEL);
if (!cs)
return ERR_PTR(-ENOMEM);
if (!alloc_cpumask_var(&cs->cpus_allowed, GFP_KERNEL)) {
@@ -1806,22 +1803,34 @@ static struct cgroup_subsys_state *cpuset_css_alloc(struct cgroup *cont)
return ERR_PTR(-ENOMEM);
}

- cs->flags = 0;
- if (is_spread_page(parent))
- set_bit(CS_SPREAD_PAGE, &cs->flags);
- if (is_spread_slab(parent))
- set_bit(CS_SPREAD_SLAB, &cs->flags);
set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
cpumask_clear(cs->cpus_allowed);
nodes_clear(cs->mems_allowed);
fmeter_init(&cs->fmeter);
cs->relax_domain_level = -1;
+ cs->parent = cgroup_cs(cont->parent);
+
+ return &cs->css;
+}
+
+static int cpuset_css_online(struct cgroup *cgrp)
+{
+ struct cpuset *cs = cgroup_cs(cgrp);
+ struct cpuset *parent = cs->parent;
+ struct cgroup *tmp_cg;
+
+ if (!parent)
+ return 0;
+
+ if (is_spread_page(parent))
+ set_bit(CS_SPREAD_PAGE, &cs->flags);
+ if (is_spread_slab(parent))
+ set_bit(CS_SPREAD_SLAB, &cs->flags);

- cs->parent = parent;
number_of_cpusets++;

- if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &cont->flags))
- goto skip_clone;
+ if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &cgrp->flags))
+ return 0;

/*
* Clone @parent's configuration if CGRP_CPUSET_CLONE_CHILDREN is
@@ -1836,19 +1845,34 @@ static struct cgroup_subsys_state *cpuset_css_alloc(struct cgroup *cont)
* changed to grant parent->cpus_allowed-sibling_cpus_exclusive
* (and likewise for mems) to the new cgroup.
*/
- list_for_each_entry(tmp_cg, &parent_cg->children, sibling) {
+ list_for_each_entry(tmp_cg, &cgrp->parent->children, sibling) {
struct cpuset *tmp_cs = cgroup_cs(tmp_cg);

if (is_mem_exclusive(tmp_cs) || is_cpu_exclusive(tmp_cs))
- goto skip_clone;
+ return 0;
}

mutex_lock(&callback_mutex);
cs->mems_allowed = parent->mems_allowed;
cpumask_copy(cs->cpus_allowed, parent->cpus_allowed);
mutex_unlock(&callback_mutex);
-skip_clone:
- return &cs->css;
+
+ return 0;
+}
+
+static void cpuset_css_offline(struct cgroup *cgrp)
+{
+ struct cpuset *cs = cgroup_cs(cgrp);
+
+ /* css_offline is called w/o cgroup_mutex, grab it */
+ cgroup_lock();
+
+ if (is_sched_load_balance(cs))
+ update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
+
+ number_of_cpusets--;
+
+ cgroup_unlock();
}

/*
@@ -1861,10 +1885,6 @@ static void cpuset_css_free(struct cgroup *cont)
{
struct cpuset *cs = cgroup_cs(cont);

- if (is_sched_load_balance(cs))
- update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
-
- number_of_cpusets--;
free_cpumask_var(cs->cpus_allowed);
kfree(cs);
}
@@ -1872,6 +1892,8 @@ static void cpuset_css_free(struct cgroup *cont)
struct cgroup_subsys cpuset_subsys = {
.name = "cpuset",
.css_alloc = cpuset_css_alloc,
+ .css_online = cpuset_css_online,
+ .css_offline = cpuset_css_offline,
.css_free = cpuset_css_free,
.can_attach = cpuset_can_attach,
.attach = cpuset_attach,
--
1.7.11.7

2012-11-29 11:15:10

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCHSET cgroup/for-3.8] cpuset: decouple cpuset locking from cgroup core

On 11/29/2012 01:34 AM, Tejun Heo wrote:
> This patchset decouples cpuset locking from cgroup_mutex. After the
> patchset, cpuset uses cpuset-specific cpuset_mutex instead of
> cgroup_mutex. This also removes the lockdep warning triggered during
> cpu offlining (see 0009).
>
> Note that this leaves memcg as the only external user of cgroup_mutex.
> Michal, Kame, can you guys please convert memcg to use its own locking
> too?

Not totally. There is still one mention to the cgroup_lock():

static void cpuset_do_move_task(struct task_struct *tsk,
struct cgroup_scanner *scan)
{
struct cgroup *new_cgroup = scan->data;

cgroup_lock();
cgroup_attach_task(new_cgroup, tsk);
cgroup_unlock();
}

And similar problem to this, is the one we have in memcg: We need to
somehow guarantee that no tasks will join the cgroup for some time -
this is why we hold the lock in memcg.

Just calling a function that internally calls the cgroup lock won't do
much, since it won't solve any dependencies - where it is called matters
little.

What I'll try to do, is to come with another specialized lock in cgroup
just for this case. So after taking the cgroup lock, we would also take
an extra lock if we are adding another entry - be it task or children -
to the cgroup.

cpuset and memcg could then take that lock as well, explicitly or
implicitly.

How does it sound?

2012-11-29 14:26:56

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET cgroup/for-3.8] cpuset: decouple cpuset locking from cgroup core

Hello, Glauber.

On Thu, Nov 29, 2012 at 03:14:41PM +0400, Glauber Costa wrote:
> On 11/29/2012 01:34 AM, Tejun Heo wrote:
> > This patchset decouples cpuset locking from cgroup_mutex. After the
> > patchset, cpuset uses cpuset-specific cpuset_mutex instead of
> > cgroup_mutex. This also removes the lockdep warning triggered during
> > cpu offlining (see 0009).
> >
> > Note that this leaves memcg as the only external user of cgroup_mutex.
> > Michal, Kame, can you guys please convert memcg to use its own locking
> > too?
>
> Not totally. There is still one mention to the cgroup_lock():
>
> static void cpuset_do_move_task(struct task_struct *tsk,
> struct cgroup_scanner *scan)
> {
> struct cgroup *new_cgroup = scan->data;
>
> cgroup_lock();
> cgroup_attach_task(new_cgroup, tsk);
> cgroup_unlock();
> }

Which is outside all other locks and scheduled to be moved inside
cgroup_attach_task().

> And similar problem to this, is the one we have in memcg: We need to
> somehow guarantee that no tasks will join the cgroup for some time -
> this is why we hold the lock in memcg.
>
> Just calling a function that internally calls the cgroup lock won't do
> much, since it won't solve any dependencies - where it is called matters
> little.

The dependency is already solved in cpuset.

> What I'll try to do, is to come with another specialized lock in cgroup
> just for this case. So after taking the cgroup lock, we would also take
> an extra lock if we are adding another entry - be it task or children -
> to the cgroup.

No, please don't do that. Just don't invoke cgroup operation inside
any subsystem lock.

Thanks.

--
tejun

2012-11-29 14:36:58

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET cgroup/for-3.8] cpuset: decouple cpuset locking from cgroup core

On Thu, Nov 29, 2012 at 06:26:46AM -0800, Tejun Heo wrote:
> > What I'll try to do, is to come with another specialized lock in cgroup
> > just for this case. So after taking the cgroup lock, we would also take
> > an extra lock if we are adding another entry - be it task or children -
> > to the cgroup.
>
> No, please don't do that. Just don't invoke cgroup operation inside
> any subsystem lock.

To add a bit, you won't be solving any problem by adding more locks
here. cpuset wants to initiate task cgroup migration. It doesn't
matter how many locks cgroup uses internally. You'll have to grab
them all anyway to do that. It's not a problem caused by granularity
of cgroup_lock at all, so there just isn't any logic in dividing locks
for this. So, again, please don't go that direction. What we need to
do is isolating subsystem locking and implementation from cgroup
internals, not complicating cgroup internals even more, and now we
have good enoug API to achieve such isolation.

Thanks.

--
tejun

2012-11-30 03:22:36

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCHSET cgroup/for-3.8] cpuset: decouple cpuset locking from cgroup core

(2012/11/29 6:34), Tejun Heo wrote:
> Hello, guys.
>
> Depending on cgroup core locking - cgroup_mutex - is messy and makes
> cgroup prone to locking dependency problems. The current code already
> has lock dependency loop - memcg nests get_online_cpus() inside
> cgroup_mutex. cpuset the other way around.
>
> Regardless of the locking details, whatever is protecting cgroup has
> inherently to be something outer to most other locking constructs.
> cgroup calls into a lot of major subsystems which in turn have to
> perform subsystem-specific locking. Trying to nest cgroup
> synchronization inside other locks isn't something which can work
> well.
>
> cgroup now has enough API to allow subsystems to implement their own
> locking and cgroup_mutex is scheduled to be made private to cgroup
> core. This patchset makes cpuset implement its own locking instead of
> relying on cgroup_mutex.
>
> cpuset is rather nasty in this respect. Some of it seems to have come
> from the implementation history - cgroup core grew out of cpuset - but
> big part stems from cpuset's need to migrate tasks to an ancestor
> cgroup when an hotunplug event makes a cpuset empty (w/o any cpu or
> memory).
>
> This patchset decouples cpuset locking from cgroup_mutex. After the
> patchset, cpuset uses cpuset-specific cpuset_mutex instead of
> cgroup_mutex. This also removes the lockdep warning triggered during
> cpu offlining (see 0009).
>
> Note that this leaves memcg as the only external user of cgroup_mutex.
> Michal, Kame, can you guys please convert memcg to use its own locking
> too?
>

Hmm. let me see....at quick glance cgroup_lock() is used at
hierarchy policy change
kmem_limit
migration policy change
swapiness change
oom control

Because all aboves takes care of changes in hierarchy,
Having a new memcg's mutex in ->create() may be a way.

Ah, hm, Costa is mentioning task-attach. is the task-attach problem in memcg ?

Thanks,
-Kame











> This patchset contains the following thirteen patches.
>
> 0001-cpuset-remove-unused-cpuset_unlock.patch
> 0002-cpuset-remove-fast-exit-path-from-remove_tasks_in_em.patch
> 0003-cpuset-introduce-css_on-offline.patch
> 0004-cpuset-introduce-CS_ONLINE.patch
> 0005-cpuset-introduce-cpuset_for_each_child.patch
> 0006-cpuset-cleanup-cpuset-_can-_attach.patch
> 0007-cpuset-drop-async_rebuild_sched_domains.patch
> 0008-cpuset-reorganize-CPU-memory-hotplug-handling.patch
> 0009-cpuset-don-t-nest-cgroup_mutex-inside-get_online_cpu.patch
> 0010-cpuset-make-CPU-memory-hotplug-propagation-asynchron.patch
> 0011-cpuset-pin-down-cpus-and-mems-while-a-task-is-being-.patch
> 0012-cpuset-schedule-hotplug-propagation-from-cpuset_atta.patch
> 0013-cpuset-replace-cgroup_mutex-locking-with-cpuset-inte.patch
>
> 0001-0006 are prep patches.
>
> 0007-0009 make cpuset nest get_online_cpus() inside cgroup_mutex, not
> the other way around.
>
> 0010-0012 plug holes which would be exposed by switching to
> cpuset-specific locking.
>
> 0013 replaces cgroup_mutex with cpuset_mutex.
>
> This patchset is on top of cgroup/for-3.8 (fddfb02ad0) and also
> available in the following git branch.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cpuset-locking
>
> diffstat follows.
>
> kernel/cpuset.c | 750 +++++++++++++++++++++++++++++++-------------------------
> 1 file changed, 423 insertions(+), 327 deletions(-)
>
> Thanks.
>
> --
> tejun
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-11-30 08:33:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCHSET cgroup/for-3.8] cpuset: decouple cpuset locking from cgroup core

On Fri 30-11-12 12:21:32, KAMEZAWA Hiroyuki wrote:
> (2012/11/29 6:34), Tejun Heo wrote:
> > Hello, guys.
> >
> > Depending on cgroup core locking - cgroup_mutex - is messy and makes
> > cgroup prone to locking dependency problems. The current code already
> > has lock dependency loop - memcg nests get_online_cpus() inside
> > cgroup_mutex. cpuset the other way around.
> >
> > Regardless of the locking details, whatever is protecting cgroup has
> > inherently to be something outer to most other locking constructs.
> > cgroup calls into a lot of major subsystems which in turn have to
> > perform subsystem-specific locking. Trying to nest cgroup
> > synchronization inside other locks isn't something which can work
> > well.
> >
> > cgroup now has enough API to allow subsystems to implement their own
> > locking and cgroup_mutex is scheduled to be made private to cgroup
> > core. This patchset makes cpuset implement its own locking instead of
> > relying on cgroup_mutex.
> >
> > cpuset is rather nasty in this respect. Some of it seems to have come
> > from the implementation history - cgroup core grew out of cpuset - but
> > big part stems from cpuset's need to migrate tasks to an ancestor
> > cgroup when an hotunplug event makes a cpuset empty (w/o any cpu or
> > memory).
> >
> > This patchset decouples cpuset locking from cgroup_mutex. After the
> > patchset, cpuset uses cpuset-specific cpuset_mutex instead of
> > cgroup_mutex. This also removes the lockdep warning triggered during
> > cpu offlining (see 0009).
> >
> > Note that this leaves memcg as the only external user of cgroup_mutex.
> > Michal, Kame, can you guys please convert memcg to use its own locking
> > too?
> >
>
> Hmm. let me see....at quick glance cgroup_lock() is used at
> hierarchy policy change
> kmem_limit
> migration policy change
> swapiness change
> oom control
>
> Because all aboves takes care of changes in hierarchy,
> Having a new memcg's mutex in ->create() may be a way.
>
> Ah, hm, Costa is mentioning task-attach. is the task-attach problem in
> memcg ?

Yes because we do not want to leak charges if we race with one of the
above hierarchy operation. Swappiness and oom control are not a big
deal. Same applies to migration policy change.
Those could be solved by using the same memcg lock in the attach hook.
Hierarchy policy change would be a bigger issue because the task is
already linked to the group when the callback is called. Same applies to
kmem_limit. Sorry I didn't have time to look into this deeper so I
cannot offer any solution right now.
--
Michal Hocko
SUSE Labs

2012-11-30 09:01:03

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCHSET cgroup/for-3.8] cpuset: decouple cpuset locking from cgroup core

On 11/30/2012 07:21 AM, Kamezawa Hiroyuki wrote:
> (2012/11/29 6:34), Tejun Heo wrote:
>> Hello, guys.
>>
>> Depending on cgroup core locking - cgroup_mutex - is messy and makes
>> cgroup prone to locking dependency problems. The current code already
>> has lock dependency loop - memcg nests get_online_cpus() inside
>> cgroup_mutex. cpuset the other way around.
>>
>> Regardless of the locking details, whatever is protecting cgroup has
>> inherently to be something outer to most other locking constructs.
>> cgroup calls into a lot of major subsystems which in turn have to
>> perform subsystem-specific locking. Trying to nest cgroup
>> synchronization inside other locks isn't something which can work
>> well.
>>
>> cgroup now has enough API to allow subsystems to implement their own
>> locking and cgroup_mutex is scheduled to be made private to cgroup
>> core. This patchset makes cpuset implement its own locking instead of
>> relying on cgroup_mutex.
>>
>> cpuset is rather nasty in this respect. Some of it seems to have come
>> from the implementation history - cgroup core grew out of cpuset - but
>> big part stems from cpuset's need to migrate tasks to an ancestor
>> cgroup when an hotunplug event makes a cpuset empty (w/o any cpu or
>> memory).
>>
>> This patchset decouples cpuset locking from cgroup_mutex. After the
>> patchset, cpuset uses cpuset-specific cpuset_mutex instead of
>> cgroup_mutex. This also removes the lockdep warning triggered during
>> cpu offlining (see 0009).
>>
>> Note that this leaves memcg as the only external user of cgroup_mutex.
>> Michal, Kame, can you guys please convert memcg to use its own locking
>> too?
>>
>
> Hmm. let me see....at quick glance cgroup_lock() is used at
> hierarchy policy change
> kmem_limit
> migration policy change
> swapiness change
> oom control
>
> Because all aboves takes care of changes in hierarchy,
> Having a new memcg's mutex in ->create() may be a way.
>
> Ah, hm, Costa is mentioning task-attach. is the task-attach problem in memcg ?
>

We disallow the kmem limit to be set if a task already exists in the
cgroup. So we can't allow a new task to attach if we are setting the limit.

2012-11-30 09:24:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCHSET cgroup/for-3.8] cpuset: decouple cpuset locking from cgroup core

On Fri 30-11-12 13:00:36, Glauber Costa wrote:
> On 11/30/2012 07:21 AM, Kamezawa Hiroyuki wrote:
> > (2012/11/29 6:34), Tejun Heo wrote:
> >> Hello, guys.
> >>
> >> Depending on cgroup core locking - cgroup_mutex - is messy and makes
> >> cgroup prone to locking dependency problems. The current code already
> >> has lock dependency loop - memcg nests get_online_cpus() inside
> >> cgroup_mutex. cpuset the other way around.
> >>
> >> Regardless of the locking details, whatever is protecting cgroup has
> >> inherently to be something outer to most other locking constructs.
> >> cgroup calls into a lot of major subsystems which in turn have to
> >> perform subsystem-specific locking. Trying to nest cgroup
> >> synchronization inside other locks isn't something which can work
> >> well.
> >>
> >> cgroup now has enough API to allow subsystems to implement their own
> >> locking and cgroup_mutex is scheduled to be made private to cgroup
> >> core. This patchset makes cpuset implement its own locking instead of
> >> relying on cgroup_mutex.
> >>
> >> cpuset is rather nasty in this respect. Some of it seems to have come
> >> from the implementation history - cgroup core grew out of cpuset - but
> >> big part stems from cpuset's need to migrate tasks to an ancestor
> >> cgroup when an hotunplug event makes a cpuset empty (w/o any cpu or
> >> memory).
> >>
> >> This patchset decouples cpuset locking from cgroup_mutex. After the
> >> patchset, cpuset uses cpuset-specific cpuset_mutex instead of
> >> cgroup_mutex. This also removes the lockdep warning triggered during
> >> cpu offlining (see 0009).
> >>
> >> Note that this leaves memcg as the only external user of cgroup_mutex.
> >> Michal, Kame, can you guys please convert memcg to use its own locking
> >> too?
> >>
> >
> > Hmm. let me see....at quick glance cgroup_lock() is used at
> > hierarchy policy change
> > kmem_limit
> > migration policy change
> > swapiness change
> > oom control
> >
> > Because all aboves takes care of changes in hierarchy,
> > Having a new memcg's mutex in ->create() may be a way.
> >
> > Ah, hm, Costa is mentioning task-attach. is the task-attach problem in memcg ?
> >
>
> We disallow the kmem limit to be set if a task already exists in the
> cgroup. So we can't allow a new task to attach if we are setting the limit.

This is racy without additional locking, isn't it?

--
Michal Hocko
SUSE Labs

2012-11-30 09:33:52

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCHSET cgroup/for-3.8] cpuset: decouple cpuset locking from cgroup core

On 11/30/2012 01:24 PM, Michal Hocko wrote:
> On Fri 30-11-12 13:00:36, Glauber Costa wrote:
>> On 11/30/2012 07:21 AM, Kamezawa Hiroyuki wrote:
>>> (2012/11/29 6:34), Tejun Heo wrote:
>>>> Hello, guys.
>>>>
>>>> Depending on cgroup core locking - cgroup_mutex - is messy and makes
>>>> cgroup prone to locking dependency problems. The current code already
>>>> has lock dependency loop - memcg nests get_online_cpus() inside
>>>> cgroup_mutex. cpuset the other way around.
>>>>
>>>> Regardless of the locking details, whatever is protecting cgroup has
>>>> inherently to be something outer to most other locking constructs.
>>>> cgroup calls into a lot of major subsystems which in turn have to
>>>> perform subsystem-specific locking. Trying to nest cgroup
>>>> synchronization inside other locks isn't something which can work
>>>> well.
>>>>
>>>> cgroup now has enough API to allow subsystems to implement their own
>>>> locking and cgroup_mutex is scheduled to be made private to cgroup
>>>> core. This patchset makes cpuset implement its own locking instead of
>>>> relying on cgroup_mutex.
>>>>
>>>> cpuset is rather nasty in this respect. Some of it seems to have come
>>>> from the implementation history - cgroup core grew out of cpuset - but
>>>> big part stems from cpuset's need to migrate tasks to an ancestor
>>>> cgroup when an hotunplug event makes a cpuset empty (w/o any cpu or
>>>> memory).
>>>>
>>>> This patchset decouples cpuset locking from cgroup_mutex. After the
>>>> patchset, cpuset uses cpuset-specific cpuset_mutex instead of
>>>> cgroup_mutex. This also removes the lockdep warning triggered during
>>>> cpu offlining (see 0009).
>>>>
>>>> Note that this leaves memcg as the only external user of cgroup_mutex.
>>>> Michal, Kame, can you guys please convert memcg to use its own locking
>>>> too?
>>>>
>>>
>>> Hmm. let me see....at quick glance cgroup_lock() is used at
>>> hierarchy policy change
>>> kmem_limit
>>> migration policy change
>>> swapiness change
>>> oom control
>>>
>>> Because all aboves takes care of changes in hierarchy,
>>> Having a new memcg's mutex in ->create() may be a way.
>>>
>>> Ah, hm, Costa is mentioning task-attach. is the task-attach problem in memcg ?
>>>
>>
>> We disallow the kmem limit to be set if a task already exists in the
>> cgroup. So we can't allow a new task to attach if we are setting the limit.
>
> This is racy without additional locking, isn't it?
>
Apparently, the way Tejun fixed this for cpuset was by using the
"attach_in_progress" indicator, that IIUC, is flipped up in
->can_attach, and down in ->attach.

A similar scheme would work for us.

And we should also be using a similar scheme for cgroup creation:
the css is not really connected to the parent until after
memcg_alloc_css. So if we use the memcg iterator to figure out if
children exist, we may get a race where we believe no children exist,
but one appear right after.

2012-11-30 09:42:37

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCHSET cgroup/for-3.8] cpuset: decouple cpuset locking from cgroup core

On 11/30/2012 01:24 PM, Michal Hocko wrote:
> On Fri 30-11-12 13:00:36, Glauber Costa wrote:
>> On 11/30/2012 07:21 AM, Kamezawa Hiroyuki wrote:
>>> (2012/11/29 6:34), Tejun Heo wrote:
>>>> Hello, guys.
>>>>
>>>> Depending on cgroup core locking - cgroup_mutex - is messy and makes
>>>> cgroup prone to locking dependency problems. The current code already
>>>> has lock dependency loop - memcg nests get_online_cpus() inside
>>>> cgroup_mutex. cpuset the other way around.
>>>>
>>>> Regardless of the locking details, whatever is protecting cgroup has
>>>> inherently to be something outer to most other locking constructs.
>>>> cgroup calls into a lot of major subsystems which in turn have to
>>>> perform subsystem-specific locking. Trying to nest cgroup
>>>> synchronization inside other locks isn't something which can work
>>>> well.
>>>>
>>>> cgroup now has enough API to allow subsystems to implement their own
>>>> locking and cgroup_mutex is scheduled to be made private to cgroup
>>>> core. This patchset makes cpuset implement its own locking instead of
>>>> relying on cgroup_mutex.
>>>>
>>>> cpuset is rather nasty in this respect. Some of it seems to have come
>>>> from the implementation history - cgroup core grew out of cpuset - but
>>>> big part stems from cpuset's need to migrate tasks to an ancestor
>>>> cgroup when an hotunplug event makes a cpuset empty (w/o any cpu or
>>>> memory).
>>>>
>>>> This patchset decouples cpuset locking from cgroup_mutex. After the
>>>> patchset, cpuset uses cpuset-specific cpuset_mutex instead of
>>>> cgroup_mutex. This also removes the lockdep warning triggered during
>>>> cpu offlining (see 0009).
>>>>
>>>> Note that this leaves memcg as the only external user of cgroup_mutex.
>>>> Michal, Kame, can you guys please convert memcg to use its own locking
>>>> too?
>>>>
>>>
>>> Hmm. let me see....at quick glance cgroup_lock() is used at
>>> hierarchy policy change
>>> kmem_limit
>>> migration policy change
>>> swapiness change
>>> oom control
>>>
>>> Because all aboves takes care of changes in hierarchy,
>>> Having a new memcg's mutex in ->create() may be a way.
>>>
>>> Ah, hm, Costa is mentioning task-attach. is the task-attach problem in memcg ?
>>>
>>
>> We disallow the kmem limit to be set if a task already exists in the
>> cgroup. So we can't allow a new task to attach if we are setting the limit.
>
> This is racy without additional locking, isn't it?
>
Speaking of it: Tejun's tree still lacks the kmem bits. How hard would
it be for you to merge his branch into a temporary branch of your tree?

2012-11-30 09:50:09

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCHSET cgroup/for-3.8] cpuset: decouple cpuset locking from cgroup core

On Fri 30-11-12 13:42:28, Glauber Costa wrote:
[...]
> Speaking of it: Tejun's tree still lacks the kmem bits. How hard would
> it be for you to merge his branch into a temporary branch of your tree?

review-cpuset-locking is based on a post merge window merges so I cannot
merge it as is. I could cherry-pick the series after it is settled. I
have no idea how much conflicts this would bring, though.
--
Michal Hocko
SUSE Labs

2012-11-30 10:00:29

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCHSET cgroup/for-3.8] cpuset: decouple cpuset locking from cgroup core

On 11/30/2012 01:49 PM, Michal Hocko wrote:
> On Fri 30-11-12 13:42:28, Glauber Costa wrote:
> [...]
>> Speaking of it: Tejun's tree still lacks the kmem bits. How hard would
>> it be for you to merge his branch into a temporary branch of your tree?
>
> review-cpuset-locking is based on a post merge window merges so I cannot
> merge it as is. I could cherry-pick the series after it is settled. I
> have no idea how much conflicts this would bring, though.
>
Ok.

I believe the task problem only exist for us for kmem. So I could come
up with a patchset that only deals with child cgroup creation, and
ignore attach for now. So long as we have a mechanism that will work for
it, and don't get lost and forget to patch it when the trees are merged.

Now, what I am actually seeing with cgroup creation, is that the
children will copy a lot of the values from the parent, like swappiness,
hierarchy, etc. Once the child copies it, we should no longer be able to
change those values in the parent: otherwise we'll get funny things like
parent.use_hierarchy = 1, child.use_hierarchy = 0.

One option is to take a global lock in memcg_alloc_css(), and keep it
locked until we did all the cgroup bookkeeping, and then unlock it in
css_online. But I am guessing Tejun won't like it very much.

What do you think about a children counter? If we are going to do things
similar to the attach_in_progress of cpuset, we might very well turn it
into a direct counter so we don't have to iterate at all.

The code would look like: (simplified example for use_hierarchy)

memcg_lock();
if (memcg->nr_children != 0)
return -EINVAL;
else
memcg->use_hierarchy = val
memcg_unlock()

2012-11-30 14:59:33

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET cgroup/for-3.8] cpuset: decouple cpuset locking from cgroup core

Hello,

On Fri, Nov 30, 2012 at 02:00:21PM +0400, Glauber Costa wrote:
> Now, what I am actually seeing with cgroup creation, is that the
> children will copy a lot of the values from the parent, like swappiness,
> hierarchy, etc. Once the child copies it, we should no longer be able to
> change those values in the parent: otherwise we'll get funny things like
> parent.use_hierarchy = 1, child.use_hierarchy = 0.

So, the best way to do this is from ->css_online(). If memcg
synchronizes and inherits from ->css_online(), it can guarantee that
the new cgroup will be visible in any following iterations. Just have
an online flag which is turned on and off from ->css_on/offline() and
ignore any cgroups w/o online set.

> One option is to take a global lock in memcg_alloc_css(), and keep it
> locked until we did all the cgroup bookkeeping, and then unlock it in
> css_online. But I am guessing Tejun won't like it very much.

No, please *NEVER* *EVER* do that. You'll be creating a bunch of
locking dependencies as cgroup walks through different controllers.

memcg should be able to synchornize fully both css on/offlining and
task attachments in memcg proper. Let's please be boring about
locking.

Thanks.

--
tejun

2012-11-30 15:10:10

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCHSET cgroup/for-3.8] cpuset: decouple cpuset locking from cgroup core

On 11/30/2012 06:59 PM, Tejun Heo wrote:
> Hello,
>
> On Fri, Nov 30, 2012 at 02:00:21PM +0400, Glauber Costa wrote:
>> Now, what I am actually seeing with cgroup creation, is that the
>> children will copy a lot of the values from the parent, like swappiness,
>> hierarchy, etc. Once the child copies it, we should no longer be able to
>> change those values in the parent: otherwise we'll get funny things like
>> parent.use_hierarchy = 1, child.use_hierarchy = 0.
>
> So, the best way to do this is from ->css_online(). If memcg
> synchronizes and inherits from ->css_online(), it can guarantee that
> the new cgroup will be visible in any following iterations. Just have
> an online flag which is turned on and off from ->css_on/offline() and
> ignore any cgroups w/o online set.
>
>> One option is to take a global lock in memcg_alloc_css(), and keep it
>> locked until we did all the cgroup bookkeeping, and then unlock it in
>> css_online. But I am guessing Tejun won't like it very much.
>
> No, please *NEVER* *EVER* do that. You'll be creating a bunch of
> locking dependencies as cgroup walks through different controllers.
>
> memcg should be able to synchornize fully both css on/offlining and
> task attachments in memcg proper. Let's please be boring about
> locking.
>

Of course, there was a purely rhetorical statement, as indicated by
"Tejun won't like it very much" =p

Take a look at the final result, I just posted a couple of hours ago.
Let me know if there is still something extremely funny, and I'll look
into fixing it.