2008-08-01 22:59:24

by Max Krasnyansky

[permalink] [raw]
Subject: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

This is an updated version of my previous cpuset patch:
"Make rebuild_sched_domains() usable from any context (take 2)"
It's been merged with the latest cpuset changes in mainline.

rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().

In order to support that properly we need to rework cpuset locking
rules to avoid circular depencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.

The rest of the cpuset code now offloads sched domains rebuilds to
the single threaded workqueue. As suggested by both Paul J. and Paul M.

This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
__rebuild_sched_domains() has been renamed to async_rebuild_sched_domains().

I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.

The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bring cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.

Signed-off-by: Max Krasnyansky <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
kernel/cpuset.c | 297 +++++++++++++++++++++++++++++++------------------------
1 files changed, 168 insertions(+), 129 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index d5ab79c..e655cea 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -14,6 +14,8 @@
* 2003-10-22 Updates by Stephen Hemminger.
* 2004 May-July Rework by Paul Jackson.
* 2006 Rework by Paul Menage to use generic cgroups
+ * 2008 Rework of the scheduler domains and CPU hotplug handling
+ * by Max Krasnyansky
*
* This file is subject to the terms and conditions of the GNU General Public
* License. See the file COPYING in the main directory of the Linux
@@ -58,6 +60,11 @@
#include <linux/cgroup.h>

/*
+ * Workqueue for cpuset related tasks.
+ */
+static struct workqueue_struct *cpuset_wq;
+
+/*
* Tracks how many cpusets are currently defined in system.
* When there is only one cpuset (the root cpuset) we can
* short circuit some hooks.
@@ -236,9 +243,11 @@ static struct cpuset top_cpuset = {

static DEFINE_MUTEX(callback_mutex);

-/* This is ugly, but preserves the userspace API for existing cpuset
+/*
+ * This is ugly, but preserves the userspace API for existing cpuset
* users. If someone tries to mount the "cpuset" filesystem, we
- * silently switch it to mount "cgroup" instead */
+ * silently switch it to mount "cgroup" instead
+ */
static int cpuset_get_sb(struct file_system_type *fs_type,
int flags, const char *unused_dev_name,
void *data, struct vfsmount *mnt)
@@ -473,10 +482,9 @@ static int validate_change(const struct cpuset *cur, const struct cpuset *trial)
}

/*
- * Helper routine for rebuild_sched_domains().
+ * Helper routine for generate_sched_domains().
* Do cpusets a, b have overlapping cpus_allowed masks?
*/
-
static int cpusets_overlap(struct cpuset *a, struct cpuset *b)
{
return cpus_intersects(a->cpus_allowed, b->cpus_allowed);
@@ -518,26 +526,15 @@ update_domain_attr_tree(struct sched_domain_attr *dattr, struct cpuset *c)
}

/*
- * rebuild_sched_domains()
- *
- * This routine will be called to rebuild the scheduler's dynamic
- * sched domains:
- * - if the flag 'sched_load_balance' of any cpuset with non-empty
- * 'cpus' changes,
- * - or if the 'cpus' allowed changes in any cpuset which has that
- * flag enabled,
- * - or if the 'sched_relax_domain_level' of any cpuset which has
- * that flag enabled and with non-empty 'cpus' changes,
- * - or if any cpuset with non-empty 'cpus' is removed,
- * - or if a cpu gets offlined.
- *
- * This routine builds a partial partition of the systems CPUs
- * (the set of non-overlappping cpumask_t's in the array 'part'
- * below), and passes that partial partition to the kernel/sched.c
- * partition_sched_domains() routine, which will rebuild the
- * schedulers load balancing domains (sched domains) as specified
- * by that partial partition. A 'partial partition' is a set of
- * non-overlapping subsets whose union is a subset of that set.
+ * generate_sched_domains()
+ *
+ * This function builds a partial partition of the systems CPUs
+ * A 'partial partition' is a set of non-overlapping subsets whose
+ * union is a subset of that set.
+ * The output of this function needs to be passed to kernel/sched.c
+ * partition_sched_domains() routine, which will rebuild the scheduler's
+ * load balancing domains (sched domains) as specified by that partial
+ * partition.
*
* See "What is sched_load_balance" in Documentation/cpusets.txt
* for a background explanation of this.
@@ -547,13 +544,7 @@ update_domain_attr_tree(struct sched_domain_attr *dattr, struct cpuset *c)
* domains when operating in the severe memory shortage situations
* that could cause allocation failures below.
*
- * Call with cgroup_mutex held. May take callback_mutex during
- * call due to the kfifo_alloc() and kmalloc() calls. May nest
- * a call to the get_online_cpus()/put_online_cpus() pair.
- * Must not be called holding callback_mutex, because we must not
- * call get_online_cpus() while holding callback_mutex. Elsewhere
- * the kernel nests callback_mutex inside get_online_cpus() calls.
- * So the reverse nesting would risk an ABBA deadlock.
+ * Must be called with cgroup_lock held.
*
* The three key local variables below are:
* q - a linked-list queue of cpuset pointers, used to implement a
@@ -588,8 +579,8 @@ update_domain_attr_tree(struct sched_domain_attr *dattr, struct cpuset *c)
* element of the partition (one sched domain) to be passed to
* partition_sched_domains().
*/
-
-void rebuild_sched_domains(void)
+static int generate_sched_domains(cpumask_t **domains,
+ struct sched_domain_attr **attributes)
{
LIST_HEAD(q); /* queue of cpusets to be scanned*/
struct cpuset *cp; /* scans q */
@@ -601,23 +592,26 @@ void rebuild_sched_domains(void)
int ndoms; /* number of sched domains in result */
int nslot; /* next empty doms[] cpumask_t slot */

- csa = NULL;
- doms = NULL;
+ ndoms = 0;
+ doms = NULL;
dattr = NULL;
+ csa = NULL;

/* Special case for the 99% of systems with one, full, sched domain */
if (is_sched_load_balance(&top_cpuset)) {
- ndoms = 1;
doms = kmalloc(sizeof(cpumask_t), GFP_KERNEL);
if (!doms)
- goto rebuild;
+ goto done;
+
dattr = kmalloc(sizeof(struct sched_domain_attr), GFP_KERNEL);
if (dattr) {
*dattr = SD_ATTR_INIT;
update_domain_attr_tree(dattr, &top_cpuset);
}
*doms = top_cpuset.cpus_allowed;
- goto rebuild;
+
+ ndoms = 1;
+ goto done;
}

csa = kmalloc(number_of_cpusets * sizeof(cp), GFP_KERNEL);
@@ -680,61 +674,123 @@ restart:
}
}

- /* Convert <csn, csa> to <ndoms, doms> */
+ /*
+ * Now we know how many domains to create.
+ * Convert <csn, csa> to <ndoms, doms> and populate cpu masks.
+ */
doms = kmalloc(ndoms * sizeof(cpumask_t), GFP_KERNEL);
- if (!doms)
- goto rebuild;
+ if (!doms) {
+ ndoms = 0;
+ goto done;
+ }
+
dattr = kmalloc(ndoms * sizeof(struct sched_domain_attr), GFP_KERNEL);

for (nslot = 0, i = 0; i < csn; i++) {
struct cpuset *a = csa[i];
+ cpumask_t *dp;
int apn = a->pn;

- if (apn >= 0) {
- cpumask_t *dp = doms + nslot;
-
- if (nslot == ndoms) {
- static int warnings = 10;
- if (warnings) {
- printk(KERN_WARNING
- "rebuild_sched_domains confused:"
- " nslot %d, ndoms %d, csn %d, i %d,"
- " apn %d\n",
- nslot, ndoms, csn, i, apn);
- warnings--;
- }
- continue;
+ if (apn < 0) {
+ /* Skip completed partitions */
+ continue;
+ }
+
+ dp = doms + nslot;
+
+ if (nslot == ndoms) {
+ static int warnings = 10;
+ if (warnings) {
+ printk(KERN_WARNING
+ "rebuild_sched_domains confused:"
+ " nslot %d, ndoms %d, csn %d, i %d,"
+ " apn %d\n",
+ nslot, ndoms, csn, i, apn);
+ warnings--;
}
+ continue;
+ }

- cpus_clear(*dp);
- if (dattr)
- *(dattr + nslot) = SD_ATTR_INIT;
- for (j = i; j < csn; j++) {
- struct cpuset *b = csa[j];
-
- if (apn == b->pn) {
- cpus_or(*dp, *dp, b->cpus_allowed);
- b->pn = -1;
- if (dattr)
- update_domain_attr_tree(dattr
- + nslot, b);
- }
+ cpus_clear(*dp);
+ if (dattr)
+ *(dattr + nslot) = SD_ATTR_INIT;
+ for (j = i; j < csn; j++) {
+ struct cpuset *b = csa[j];
+
+ if (apn == b->pn) {
+ cpus_or(*dp, *dp, b->cpus_allowed);
+ if (dattr)
+ update_domain_attr_tree(dattr + nslot, b);
+
+ /* Done with this partition */
+ b->pn = -1;
}
- nslot++;
}
+ nslot++;
}
BUG_ON(nslot != ndoms);

-rebuild:
- /* Have scheduler rebuild sched domains */
+done:
+ kfree(csa);
+
+ *domains = doms;
+ *attributes = dattr;
+ return ndoms;
+}
+
+/*
+ * Rebuilds scheduler domains. See generate_sched_domains() description
+ * for details.
+ *
+ * Must be called under get_online_cpus(). This is an external interface
+ * and must not be used inside the cpuset code to avoid circular locking
+ * dependency. cpuset code should use async_rebuild_sched_domains() instead.
+ */
+void rebuild_sched_domains(void)
+{
+ struct sched_domain_attr *attr;
+ cpumask_t *doms;
+ int ndoms;
+
+ /*
+ * We have to iterate cgroup hierarch which requires cgroup lock.
+ */
+ cgroup_lock();
+ ndoms = generate_sched_domains(&doms, &attr);
+ cgroup_unlock();
+
+ /* Have scheduler rebuild the domains */
+ partition_sched_domains(ndoms, doms, attr);
+}
+
+/*
+ * Simply calls rebuild_sched_domains() with correct locking rules.
+ */
+static void rebuild_domains_callback(struct work_struct *work)
+{
get_online_cpus();
- partition_sched_domains(ndoms, doms, dattr);
+ rebuild_sched_domains();
put_online_cpus();
+}
+static DECLARE_WORK(rebuild_domains_work, rebuild_domains_callback);

-done:
- kfree(csa);
- /* Don't kfree(doms) -- partition_sched_domains() does that. */
- /* Don't kfree(dattr) -- partition_sched_domains() does that. */
+/*
+ * Internal helper for rebuilding sched domains when something changes.
+ *
+ * If the flag 'sched_load_balance' of any cpuset with non-empty
+ * 'cpus' changes, or if the 'cpus' allowed changes in any cpuset
+ * which has that flag enabled, or if any cpuset with a non-empty
+ * 'cpus' is removed, then call this routine to rebuild the
+ * scheduler's dynamic sched domains.
+ *
+ * rebuild_sched_domains() must be called under get_online_cpus() and
+ * it needs to take cgroup_lock(). But most of the cpuset code is already
+ * holding cgroup_lock(). In order to avoid incorrect lock nesting we
+ * delegate the actual domain rebuilding to the workqueue.
+ */
+static void async_rebuild_sched_domains(void)
+{
+ queue_work(cpuset_wq, &rebuild_domains_work);
}

/**
@@ -863,7 +919,7 @@ static int update_cpumask(struct cpuset *cs, const char *buf)
return retval;

if (is_load_balanced)
- rebuild_sched_domains();
+ async_rebuild_sched_domains();
return 0;
}

@@ -1090,7 +1146,7 @@ static int update_relax_domain_level(struct cpuset *cs, s64 val)
if (val != cs->relax_domain_level) {
cs->relax_domain_level = val;
if (!cpus_empty(cs->cpus_allowed) && is_sched_load_balance(cs))
- rebuild_sched_domains();
+ async_rebuild_sched_domains();
}

return 0;
@@ -1131,7 +1187,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
mutex_unlock(&callback_mutex);

if (cpus_nonempty && balance_flag_changed)
- rebuild_sched_domains();
+ async_rebuild_sched_domains();

return 0;
}
@@ -1492,6 +1548,9 @@ static u64 cpuset_read_u64(struct cgroup *cont, struct cftype *cft)
default:
BUG();
}
+
+ /* Unreachable but makes gcc happy */
+ return 0;
}

static s64 cpuset_read_s64(struct cgroup *cont, struct cftype *cft)
@@ -1504,6 +1563,9 @@ static s64 cpuset_read_s64(struct cgroup *cont, struct cftype *cft)
default:
BUG();
}
+
+ /* Unrechable but makes gcc happy */
+ return 0;
}


@@ -1692,15 +1754,9 @@ static struct cgroup_subsys_state *cpuset_create(
}

/*
- * Locking note on the strange update_flag() call below:
- *
* If the cpuset being removed has its flag 'sched_load_balance'
* enabled, then simulate turning sched_load_balance off, which
- * will call rebuild_sched_domains(). The get_online_cpus()
- * call in rebuild_sched_domains() must not be made while holding
- * callback_mutex. Elsewhere the kernel nests callback_mutex inside
- * get_online_cpus() calls. So the reverse nesting would risk an
- * ABBA deadlock.
+ * will call rebuild_sched_domains().
*/

static void cpuset_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
@@ -1811,7 +1867,7 @@ static void move_member_tasks_to_cpuset(struct cpuset *from, struct cpuset *to)
}

/*
- * If common_cpu_mem_hotplug_unplug(), below, unplugs any CPUs
+ * If CPU and/or memory hotplug handlers, below, unplug any CPUs
* or memory nodes, we need to walk over the cpuset hierarchy,
* removing that CPU or node from all cpusets. If this removes the
* last CPU or node from a cpuset, then move the tasks in the empty
@@ -1903,35 +1959,6 @@ static void scan_for_empty_cpusets(const struct cpuset *root)
}

/*
- * The cpus_allowed and mems_allowed nodemasks in the top_cpuset track
- * cpu_online_map and node_states[N_HIGH_MEMORY]. Force the top cpuset to
- * track what's online after any CPU or memory node hotplug or unplug event.
- *
- * Since there are two callers of this routine, one for CPU hotplug
- * events and one for memory node hotplug events, we could have coded
- * two separate routines here. We code it as a single common routine
- * in order to minimize text size.
- */
-
-static void common_cpu_mem_hotplug_unplug(int rebuild_sd)
-{
- cgroup_lock();
-
- top_cpuset.cpus_allowed = cpu_online_map;
- top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
- scan_for_empty_cpusets(&top_cpuset);
-
- /*
- * Scheduler destroys domains on hotplug events.
- * Rebuild them based on the current settings.
- */
- if (rebuild_sd)
- rebuild_sched_domains();
-
- cgroup_unlock();
-}
-
-/*
* The top_cpuset tracks what CPUs and Memory Nodes are online,
* period. This is necessary in order to make cpusets transparent
* (of no affect) on systems that are actively using CPU hotplug
@@ -1940,39 +1967,48 @@ static void common_cpu_mem_hotplug_unplug(int rebuild_sd)
* This routine ensures that top_cpuset.cpus_allowed tracks
* cpu_online_map on each CPU hotplug (cpuhp) event.
*/
-
-static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,
+static int cpuset_track_online_cpus(struct notifier_block *unused_nb,
unsigned long phase, void *unused_cpu)
{
+ struct sched_domain_attr *attr;
+ cpumask_t *doms;
+ int ndoms;
+
switch (phase) {
- case CPU_UP_CANCELED:
- case CPU_UP_CANCELED_FROZEN:
- case CPU_DOWN_FAILED:
- case CPU_DOWN_FAILED_FROZEN:
case CPU_ONLINE:
case CPU_ONLINE_FROZEN:
case CPU_DEAD:
case CPU_DEAD_FROZEN:
- common_cpu_mem_hotplug_unplug(1);
break;
+
default:
return NOTIFY_DONE;
}

+ cgroup_lock();
+ top_cpuset.cpus_allowed = cpu_online_map;
+ scan_for_empty_cpusets(&top_cpuset);
+ ndoms = generate_sched_domains(&doms, &attr);
+ cgroup_unlock();
+
+ /* Have scheduler rebuild the domains */
+ partition_sched_domains(ndoms, doms, attr);
+
return NOTIFY_OK;
}

#ifdef CONFIG_MEMORY_HOTPLUG
/*
* Keep top_cpuset.mems_allowed tracking node_states[N_HIGH_MEMORY].
- * Call this routine anytime after you change
- * node_states[N_HIGH_MEMORY].
- * See also the previous routine cpuset_handle_cpuhp().
+ * Call this routine anytime after node_states[N_HIGH_MEMORY] changes.
+ * See also the previous routine cpuset_track_online_cpus().
*/
-
void cpuset_track_online_nodes(void)
{
- common_cpu_mem_hotplug_unplug(0);
+ cgroup_lock();
+ top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
+ scan_for_empty_cpusets(&top_cpuset);
+ cgroup_unlock();
}
#endif

@@ -1987,7 +2023,10 @@ void __init cpuset_init_smp(void)
top_cpuset.cpus_allowed = cpu_online_map;
top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];

- hotcpu_notifier(cpuset_handle_cpuhp, 0);
+ hotcpu_notifier(cpuset_track_online_cpus, 0);
+
+ cpuset_wq = create_singlethread_workqueue("cpuset");
+ BUG_ON(!cpuset_wq);
}

/**
--
1.5.5.1


2008-08-02 11:39:17

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

The Subject line suggests that this latest version of your patch should
apply on top of 2.6.27-rc1. That does not seem to be quite accurate.

As stated in your previous message, it seems you're working on top of
Linus's latest git version, which, at least for the source file
kernel/cpuset.c, is (I suppose) more or less the same as 2.6.27-rc1
plus the 'origin.patch' that is at the top of Andrew's broken out quilt
patch series 2.6.27-rc1-mm1.

That origin.patch includes Li Zefan's patch:

cpuset: clean up cpuset hierarchy traversal code

which Andrew recently sent along to Linus, and which patch is presumed
by this latest version of your patch.

I'll take a further look at your patch now.

We have lots of trees, including various from Ingo, linux-next, Andrew,
and Linus, each in many versions. It helps others if you can state
exactly what tree/version a patch applies to, in each patch version
that you submit.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-08-02 16:32:50

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)



Paul Jackson wrote:
> The Subject line suggests that this latest version of your patch should
> apply on top of 2.6.27-rc1. That does not seem to be quite accurate.
>
> As stated in your previous message, it seems you're working on top of
> Linus's latest git version, which, at least for the source file
> kernel/cpuset.c, is (I suppose) more or less the same as 2.6.27-rc1
> plus the 'origin.patch' that is at the top of Andrew's broken out quilt
> patch series 2.6.27-rc1-mm1.
>
> That origin.patch includes Li Zefan's patch:
>
> cpuset: clean up cpuset hierarchy traversal code
>
> which Andrew recently sent along to Linus, and which patch is presumed
> by this latest version of your patch.
>
> I'll take a further look at your patch now.
>
> We have lots of trees, including various from Ingo, linux-next, Andrew,
> and Linus, each in many versions. It helps others if you can state
> exactly what tree/version a patch applies to, in each patch version
> that you submit.

It's on top of the latest mainline I should not have put -rc1 in the subject
sorry.

Max

2008-08-03 03:51:39

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

Max wrote:
> It's on top of the latest mainline

"the latest mainline" ... I guess that means the same
as what I termed "Linus's latest git version" ?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-08-03 18:07:30

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)



Paul Jackson wrote:
> Max wrote:
>> It's on top of the latest mainline
>
> "the latest mainline" ... I guess that means the same
> as what I termed "Linus's latest git version" ?
>

Yes

Max

2008-08-04 06:00:47

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

So far as I can tell, the logic and design is ok.

I found some of the comments, function names and code factoring
to be confusing or incomplete. And I suspect that the rebuilding
of sched domains in the case that sched_power_savings_store()
is called in kernel/sched.c, on systems using cpusets, is not
needed or desirable (I could easily be wrong here - beware!).

I'm attaching a patch that has the changes that I would like
to suggest for your consideration. I have only recompiled this
patch, for one configuration - x86_64 with CPUSETS. I am hoping,
Max, that you can complete the testing.

The patch below applies to 2.6.27-rc1, plus the first patch,
"origin.patch" in Andrew's 2.6.27-rc1-mm1 quilt patch stack,
plus your (Max's) latest:
[PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

Here's a description of most of what I noticed:

1) Unrelated spacing tweak, changing:
LIST_HEAD(q); /* queue of cpusets to be scanned*/
to:
LIST_HEAD(q); /* queue of cpusets to be scanned */

2) The phrasing:
Must be called with cgroup_lock held.
seems imprecise to me; "cgroup_lock" is the name of a wrapper
function, not of a lock. The underlying lock is cgroup_mutex,
which is still mentioned by name in various kernel/cpuset.c
comments, even though cgroup_mutex is static in kernel/cgroup.c
So I fiddled with the wording of this phrasing.

3) You removed the paragraph:
* ... May take callback_mutex during
* call due to the kfifo_alloc() and kmalloc() calls. May nest
* a call to the get_online_cpus()/put_online_cpus() pair.
* Must not be called holding callback_mutex, because we must not
* call get_online_cpus() while holding callback_mutex. Elsewhere
* the kernel nests callback_mutex inside get_online_cpus() calls.
* So the reverse nesting would risk an ABBA deadlock.

But you didn't replace it with an updated locking description.
I expanded and tweaked various locking comments.

4) The alignment of the right side of consecutive assignment statements,
as in:
ndoms = 0;
doms = NULL;
dattr = NULL;
csa = NULL;
or
*domains = doms;
*attributes = dattr;
is not usually done in kernel code. I suggest obeying convention,
and not aligning these here either.

5) The rebuilding of sched domains in the sched_power_savings_store()
routine in kernel/sched.c struck me as inappropriate for systems
that were managing sched domains using cpusets. So I made that
sched.c rebuild only apply if CONFIG_CPUSETS was not configured,
which in turn enabled me to remove rebuild_sched_domains() from
being externally visible outside kernel/cpuset.c

I don't know if this change is correct, however.

6) The renaming of rebuild_sched_domains() to generate_sched_domains()
was only partial, and along with the added similarly named routines
seemed confusing to me. Also, that rename of rebuild_sched_domains()
was only partially accomplished, not being done in some comments
and in one printk kernel warning.

I reverted that rename.

I also reduced by one the number of functions needed to handle
the asynchronous invocation of this rebuild, essentially collapsing
your routines rebuild_sched_domains() and rebuild_domains_callback()
into a single routine, named rebuild_sched_domains_thread().

Thanks to the above change (5), the naming and structure of these
routines was no longer visible outside kernel/cpuset.c, making
this collapsing of two functions into one easier.

---
include/linux/cpuset.h | 7 ----
kernel/cpuset.c | 73 ++++++++++++++++++++++++-------------------------
kernel/sched.c | 18 ++++--------
3 files changed, 43 insertions(+), 55 deletions(-)

--- 2.6.27-rc1-mm1.orig/include/linux/cpuset.h 2008-08-02 03:51:59.000000000 -0700
+++ 2.6.27-rc1-mm1/include/linux/cpuset.h 2008-08-03 19:24:40.306964316 -0700
@@ -78,8 +78,6 @@ extern void cpuset_track_online_nodes(vo

extern int current_cpuset_is_being_rebound(void);

-extern void rebuild_sched_domains(void);
-
#else /* !CONFIG_CPUSETS */

static inline int cpuset_init_early(void) { return 0; }
@@ -158,11 +156,6 @@ static inline int current_cpuset_is_bein
return 0;
}

-static inline void rebuild_sched_domains(void)
-{
- partition_sched_domains(0, NULL, NULL);
-}
-
#endif /* !CONFIG_CPUSETS */

#endif /* _LINUX_CPUSET_H */
--- 2.6.27-rc1-mm1.orig/kernel/cpuset.c 2008-08-02 09:42:56.000000000 -0700
+++ 2.6.27-rc1-mm1/kernel/cpuset.c 2008-08-03 21:55:31.983690126 -0700
@@ -482,7 +482,7 @@ static int validate_change(const struct
}

/*
- * Helper routine for generate_sched_domains().
+ * Helper routine for rebuild_sched_domains().
* Do cpusets a, b have overlapping cpus_allowed masks?
*/
static int cpusets_overlap(struct cpuset *a, struct cpuset *b)
@@ -526,11 +526,12 @@ update_domain_attr_tree(struct sched_dom
}

/*
- * generate_sched_domains()
+ * rebuild_sched_domains()
*
* This function builds a partial partition of the systems CPUs
* A 'partial partition' is a set of non-overlapping subsets whose
* union is a subset of that set.
+ *
* The output of this function needs to be passed to kernel/sched.c
* partition_sched_domains() routine, which will rebuild the scheduler's
* load balancing domains (sched domains) as specified by that partial
@@ -579,10 +580,10 @@ update_domain_attr_tree(struct sched_dom
* element of the partition (one sched domain) to be passed to
* partition_sched_domains().
*/
-static int generate_sched_domains(cpumask_t **domains,
+static int rebuild_sched_domains(cpumask_t **domains,
struct sched_domain_attr **attributes)
{
- LIST_HEAD(q); /* queue of cpusets to be scanned*/
+ LIST_HEAD(q); /* queue of cpusets to be scanned */
struct cpuset *cp; /* scans q */
struct cpuset **csa; /* array of all cpuset ptrs */
int csn; /* how many cpuset ptrs in csa so far */
@@ -593,9 +594,9 @@ static int generate_sched_domains(cpumas
int nslot; /* next empty doms[] cpumask_t slot */

ndoms = 0;
- doms = NULL;
+ doms = NULL;
dattr = NULL;
- csa = NULL;
+ csa = NULL;

/* Special case for the 99% of systems with one, full, sched domain */
if (is_sched_load_balance(&top_cpuset)) {
@@ -733,49 +734,41 @@ restart:
done:
kfree(csa);

- *domains = doms;
+ *domains = doms;
*attributes = dattr;
return ndoms;
}

/*
- * Rebuilds scheduler domains. See generate_sched_domains() description
- * for details.
+ * Rebuild scheduler domains, called from rebuild_sched_domains_work
+ * work queue.
+ *
+ * Call with neither cgroup_mutex held nor within get_online_cpus().
+ * Takes both cgroup_mutex and get_online_cpus().
*
- * Must be called under get_online_cpus(). This is an external interface
- * and must not be used inside the cpuset code to avoid circular locking
- * dependency. cpuset code should use async_rebuild_sched_domains() instead.
+ * Cannot be directly called from cpuset code handling changes
+ * to the cpuset pseudo-filesystem, because it cannot be called
+ * from code that already holds cgroup_mutex.
*/
-void rebuild_sched_domains(void)
+static void rebuild_sched_domains_thread(struct work_struct *unused)
{
struct sched_domain_attr *attr;
cpumask_t *doms;
int ndoms;

- /*
- * We have to iterate cgroup hierarch which requires cgroup lock.
- */
+ get_online_cpus();
cgroup_lock();
- ndoms = generate_sched_domains(&doms, &attr);
+ ndoms = rebuild_sched_domains(&doms, &attr);
cgroup_unlock();

/* Have scheduler rebuild the domains */
partition_sched_domains(ndoms, doms, attr);
-}
-
-/*
- * Simply calls rebuild_sched_domains() with correct locking rules.
- */
-static void rebuild_domains_callback(struct work_struct *work)
-{
- get_online_cpus();
- rebuild_sched_domains();
put_online_cpus();
}
-static DECLARE_WORK(rebuild_domains_work, rebuild_domains_callback);
+static DECLARE_WORK(rebuild_sched_domains_work, rebuild_sched_domains_thread);

/*
- * Internal helper for rebuilding sched domains when something changes.
+ * Rebuild scheduler domains, asynchronously in a separate thread.
*
* If the flag 'sched_load_balance' of any cpuset with non-empty
* 'cpus' changes, or if the 'cpus' allowed changes in any cpuset
@@ -783,14 +776,19 @@ static DECLARE_WORK(rebuild_domains_work
* 'cpus' is removed, then call this routine to rebuild the
* scheduler's dynamic sched domains.
*
- * rebuild_sched_domains() must be called under get_online_cpus() and
- * it needs to take cgroup_lock(). But most of the cpuset code is already
- * holding cgroup_lock(). In order to avoid incorrect lock nesting we
- * delegate the actual domain rebuilding to the workqueue.
+ * The rebuild_sched_domains() and partition_sched_domains()
+ * routines must nest cgroup_lock() inside get_online_cpus(),
+ * but such cpuset changes as these must nest that locking the
+ * other way, holding cgroup_lock() for much of the code.
+ *
+ * So in order to avoid an ABBA deadlock, the cpuset code handling
+ * these user changes delegates the actual sched domain rebuilding
+ * to a separate workqueue thread, which ends up processing the
+ * above rebuild_sched_domains_thread() function.
*/
static void async_rebuild_sched_domains(void)
{
- queue_work(cpuset_wq, &rebuild_domains_work);
+ queue_work(cpuset_wq, &rebuild_sched_domains_work);
}

/**
@@ -1756,7 +1754,7 @@ static struct cgroup_subsys_state *cpuse
/*
* If the cpuset being removed has its flag 'sched_load_balance'
* enabled, then simulate turning sched_load_balance off, which
- * will call rebuild_sched_domains().
+ * will call async_rebuild_sched_domains().
*/

static void cpuset_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
@@ -1775,7 +1773,7 @@ static void cpuset_destroy(struct cgroup
struct cgroup_subsys cpuset_subsys = {
.name = "cpuset",
.create = cpuset_create,
- .destroy = cpuset_destroy,
+ .destroy = cpuset_destroy,
.can_attach = cpuset_can_attach,
.attach = cpuset_attach,
.populate = cpuset_populate,
@@ -1966,6 +1964,9 @@ static void scan_for_empty_cpusets(const
*
* This routine ensures that top_cpuset.cpus_allowed tracks
* cpu_online_map on each CPU hotplug (cpuhp) event.
+ *
+ * Called within get_online_cpus(). Needs to call cgroup_lock()
+ * before calling rebuild_sched_domains().
*/
static int cpuset_track_online_cpus(struct notifier_block *unused_nb,
unsigned long phase, void *unused_cpu)
@@ -1988,7 +1989,7 @@ static int cpuset_track_online_cpus(stru
cgroup_lock();
top_cpuset.cpus_allowed = cpu_online_map;
scan_for_empty_cpusets(&top_cpuset);
- ndoms = generate_sched_domains(&doms, &attr);
+ ndoms = rebuild_sched_domains(&doms, &attr);
cgroup_unlock();

/* Have scheduler rebuild the domains */
--- 2.6.27-rc1-mm1.orig/kernel/sched.c 2008-08-02 04:12:13.000000000 -0700
+++ 2.6.27-rc1-mm1/kernel/sched.c 2008-08-03 19:55:40.128044004 -0700
@@ -7645,18 +7645,8 @@ match2:
}

#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
-int arch_reinit_sched_domains(void)
-{
- get_online_cpus();
- rebuild_sched_domains();
- put_online_cpus();
- return 0;
-}
-
static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
{
- int ret;
-
if (buf[0] != '0' && buf[0] != '1')
return -EINVAL;

@@ -7665,9 +7655,13 @@ static ssize_t sched_power_savings_store
else
sched_mc_power_savings = (buf[0] == '1');

- ret = arch_reinit_sched_domains();
+#ifndef CONFIG_CPUSETS
+ get_online_cpus();
+ partition_sched_domains(0, NULL, NULL);
+ put_online_cpus();
+#endif

- return ret ? ret : count;
+ return count;
}

#ifdef CONFIG_SCHED_MC


--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-08-04 22:11:29

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

Paul Jackson wrote:
> So far as I can tell, the logic and design is ok.
>
> I found some of the comments, function names and code factoring
> to be confusing or incomplete. And I suspect that the rebuilding
> of sched domains in the case that sched_power_savings_store()
> is called in kernel/sched.c, on systems using cpusets, is not
> needed or desirable (I could easily be wrong here - beware!).
Actually we do need it. See below.

> I'm attaching a patch that has the changes that I would like
> to suggest for your consideration. I have only recompiled this
> patch, for one configuration - x86_64 with CPUSETS. I am hoping,
> Max, that you can complete the testing.
>
> The patch below applies to 2.6.27-rc1, plus the first patch,
> "origin.patch" in Andrew's 2.6.27-rc1-mm1 quilt patch stack,
> plus your (Max's) latest:
> [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)
>
> Here's a description of most of what I noticed:
>
> 1) Unrelated spacing tweak, changing:
> LIST_HEAD(q); /* queue of cpusets to be scanned*/
> to:
> LIST_HEAD(q); /* queue of cpusets to be scanned */
Check.

> 2) The phrasing:
> Must be called with cgroup_lock held.
> seems imprecise to me; "cgroup_lock" is the name of a wrapper
> function, not of a lock. The underlying lock is cgroup_mutex,
> which is still mentioned by name in various kernel/cpuset.c
> comments, even though cgroup_mutex is static in kernel/cgroup.c
> So I fiddled with the wording of this phrasing.
Check.

> 3) You removed the paragraph:
> * ... May take callback_mutex during
> * call due to the kfifo_alloc() and kmalloc() calls. May nest
> * a call to the get_online_cpus()/put_online_cpus() pair.
> * Must not be called holding callback_mutex, because we must not
> * call get_online_cpus() while holding callback_mutex. Elsewhere
> * the kernel nests callback_mutex inside get_online_cpus() calls.
> * So the reverse nesting would risk an ABBA deadlock.
>
> But you didn't replace it with an updated locking description.
> I expanded and tweaked various locking comments.
I think it it's covered by the top level comment that starts with
/*
* There are two global mutexes guarding cpuset structures. The first
* is the main control groups cgroup_mutex, accessed via
* cgroup_lock()/cgroup_unlock().
....

But I'm ok with your suggested version.

> 4) The alignment of the right side of consecutive assignment statements,
> as in:
> ndoms = 0;
> doms = NULL;
> dattr = NULL;
> csa = NULL;
> or
> *domains = doms;
> *attributes = dattr;
> is not usually done in kernel code. I suggest obeying convention,
> and not aligning these here either.
Check.

> 5) The rebuilding of sched domains in the sched_power_savings_store()
> routine in kernel/sched.c struck me as inappropriate for systems
> that were managing sched domains using cpusets. So I made that
> sched.c rebuild only apply if CONFIG_CPUSETS was not configured,
> which in turn enabled me to remove rebuild_sched_domains() from
> being externally visible outside kernel/cpuset.c
>
> I don't know if this change is correct, however.
Actually it is appropriate, and there is one more user of the
arch_reinit_sched_domains() which is S390 topology updates.
Those things (mc_power and topology updates) have to update domain flags based
on the mc/smt power and current topology settings.
This is done in the
__rebuild_sched_domains()
...
SD_INIT(sd, ALLNODES);
...
SD_INIT(sd, MC);
...

SD_INIT(sd,X) uses one of SD initializers defined in the include/linux/topology.h
For example SD_CPU_INIT() includes BALANCE_FOR_PKG_POWER which expands to

#define BALANCE_FOR_PKG_POWER \
((sched_mc_power_savings || sched_smt_power_savings) ? \
SD_POWERSAVINGS_BALANCE : 0)

Yes it's kind convoluted :). Anyway, the point is that we need to rebuild the
domains when those settings change. We could probably write a simpler version
that just iterates existing domains and updates the flags. Maybe some other dat :)

> 6) The renaming of rebuild_sched_domains() to generate_sched_domains()
> was only partial, and along with the added similarly named routines
> seemed confusing to me. Also, that rename of rebuild_sched_domains()
> was only partially accomplished, not being done in some comments
> and in one printk kernel warning.
>
> I reverted that rename.
Actually it was not much of a rename. The only rename was
rebuild_sched_domains() -> async_rebuild_sched_domains() and yes looks like I
missed one or two comment.
The other stuff was basically factoring out the function that generates the
domain masks/attrs from the function that actually tells the scheduler to
rebuild them.
I'd be ok with some other name for generate_sched_domains() if you think it's
confusing.

btw With your current proposal we have
rebuild_sched_domains() - just generates domain masks/attrs
async_rebuild_sched_domains() - generates domain masks/attrs and actually
rebuilds the domains

That I think is more confusing. So I guess my preference would be to have the
generation part separate. And like I explained above we do need to be able to
call rebuild_sched_domains() from the scheduler code (at least at this point).

> I also reduced by one the number of functions needed to handle
> the asynchronous invocation of this rebuild, essentially collapsing
> your routines rebuild_sched_domains() and rebuild_domains_callback()
> into a single routine, named rebuild_sched_domains_thread().
>
> Thanks to the above change (5), the naming and structure of these
> routines was no longer visible outside kernel/cpuset.c, making
> this collapsing of two functions into one easier.
Yes if we did not have to call rebuild_sched_domains() from
arch_reinit_sched_domains() I would've done something similar.

Sounds like you're ok with the general approach and as I mentioned we do need
externally usable rebuild_sched_domains(). So how about this. I'll update
style/comments based on your suggestions but keep the generate_sched_domains()
and partition_sched_domains() split. Or if you have a better name for
generate_sched_domains() we'll use that instead.

Max

2008-08-05 03:56:51

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

Max wrote:
> Actually it is appropriate, and there is one more user of the
> arch_reinit_sched_domains() which is S390 topology updates.
> Those things (mc_power and topology updates) have to update domain flags based
> on the mc/smt power and current topology settings.

Hmmm ... ok I suppose.

Could we have the kernel/sched.c code, in this case, call the
kernel/cpuset.c routine async_rebuild_sched_domains(), rather
than the synchronous rebuild_sched_domains() call (in your naming)
which required details of the get_online_cpus() and put_online_cpus()
wrapping to leak into kernel/sched.c:arch_reinit_sched_domains()
routine?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-08-05 20:31:38

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

Paul Jackson wrote:
> Max wrote:
>> Actually it is appropriate, and there is one more user of the
>> arch_reinit_sched_domains() which is S390 topology updates.
>> Those things (mc_power and topology updates) have to update domain flags based
>> on the mc/smt power and current topology settings.
>
> Hmmm ... ok I suppose.
>
> Could we have the kernel/sched.c code, in this case, call the
> kernel/cpuset.c routine async_rebuild_sched_domains(), rather
> than the synchronous rebuild_sched_domains() call (in your naming)
> which required details of the get_online_cpus() and put_online_cpus()
> wrapping to leak into kernel/sched.c:arch_reinit_sched_domains()
> routine?

It could I guess. But the questions is why ?
I mean the only reason we've introduced workqueue is because lock
nesting got too complicated. Otherwise in all those paths we're already
in a process context and there is no need to schedule a workqueue.

I see your point about get_online_cpus() thing. But it is similar to
partition_sched_domains() which is an external (from the sched pov)
interface and must be called within get_online_cpus() ... put_online_cpus().

Max

2008-08-05 23:07:16

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

Max wrote:
> It could I guess. But the questions is why ?
> I mean the only reason we've introduced workqueue is because lock
> nesting got too complicated. Otherwise in all those paths we're already
> in a process context and there is no need to schedule a workqueue.

My priorities are different than yours.

You look at a code path that, in some cases, is more involved than
necessary, and recommend providing an alternative code path, for
the cases that can get by executing (significantly) fewer CPU cycles.

I look at the number of code paths, lines of code, duplicated code
and number of tests and conditions in the source code, and ask how
to reduce them. I want the least amount of source code, the fewest
alternative paths through that code, the smallest number of logical
tests and code variants, the least amount of code duplication.

The key in this case is that I prefer having just one code path by
which all rebuilds of sched domains are done. Since that rebuild must
be asynchronous for some cases (to avoid ABBA lock nesting problems)
therefore let all sched domains be done by that async path.

The fewer the number of code path variants, the easier it is to
understand the code, and the harder it is to break the code.

Except for frequently executed code paths, which this surely is not,
minimizing software maintenance costs is far more important to me than
minimizing CPU cycles.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-08-06 03:24:49

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

Paul Jackson wrote:
> Max wrote:
>> It could I guess. But the questions is why ?
>> I mean the only reason we've introduced workqueue is because lock
>> nesting got too complicated. Otherwise in all those paths we're already
>> in a process context and there is no need to schedule a workqueue.
>
> My priorities are different than yours.
>
> You look at a code path that, in some cases, is more involved than
> necessary, and recommend providing an alternative code path, for
> the cases that can get by executing (significantly) fewer CPU cycles.
>
> I look at the number of code paths, lines of code, duplicated code
> and number of tests and conditions in the source code, and ask how
> to reduce them. I want the least amount of source code, the fewest
> alternative paths through that code, the smallest number of logical
> tests and code variants, the least amount of code duplication.
>
> The key in this case is that I prefer having just one code path by
> which all rebuilds of sched domains are done. Since that rebuild must
> be asynchronous for some cases (to avoid ABBA lock nesting problems)
> therefore let all sched domains be done by that async path.
>
> The fewer the number of code path variants, the easier it is to
> understand the code, and the harder it is to break the code.
>
> Except for frequently executed code paths, which this surely is not,
> minimizing software maintenance costs is far more important to me than
> minimizing CPU cycles.

Actually I was not really trying to minimize cpu cycles or anything. But it
does seem like an overkill to schedule a workqueue just because we do not
want to expose the fact that rebuild_sched_domains() depends get_online_cpus().

More importantly though if you think about it I'm actually not introducing
any new alternative paths (besides async_rebuild_sched_domains() itself).
Code paths in question have been synchronous since day one, and have been
tested that way. Now you're saying lets make them asynchronous. Peter already
had a concern about async rebuilds from within the cpuset itself. I think those
are fine but I definitely do not want to make domain rebuilds in the cpu hotplug
path asynchronous.

So I'd argue that async_rebuild_domains() is the new path which we have to
make async due to lock nesting issues. The other paths (cpu hotplug, topology
updates, SMT and MC power settings updates) are already synchronous and have
no lock nesting issues and therefore do not need to change (async vs sync).
Also IMO it makes sense to keep both CONFIG_CPUSETS and !CONFIG_CPUSETS
handling as similar as possible to avoid surprises. With your proposal when
cpusets are disabled arch_reinit_sched_domain() will be synchronous, when
they are enabled it will asynchronous.

If you feel strongly about this we can discuss this some more and try it out.
But I really do not see those maintenance saving in this case. We'd actually
be changing more things.

Max

2008-08-06 03:29:37

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

Max wrote:
> But it does seem like an overkill to schedule a workqueue ...

"overkill" -- by what metric?

If something is overkill, it means something is excessive.
Excess (and deficiency) occur along some scale, some metric.

If not by CPU cycles, then by what metric is it overkill?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-08-06 03:53:33

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

Paul Jackson wrote:
> Max wrote:
>> But it does seem like an overkill to schedule a workqueue ...
>
> "overkill" -- by what metric?
>
> If something is overkill, it means something is excessive.
> Excess (and deficiency) occur along some scale, some metric.
>
> If not by CPU cycles, then by what metric is it overkill?
>

Paul, I think you're focusing on the wrong part of my reply ;-).
You are, of course, correct about the overkill metric. I was saying
that saving cycles in that path was not my goal when I wrote
the patch. I just added necessary functions to make existing paths
work they way they currently do and only changed one path because
it had to change due to lock nesting issues.
Along the way simply pointed it out that scheduling work unnecessarily
does seem less efficient. That was not the important part though :)

Max

2008-08-06 04:29:19

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

Max wrote:
> ... the wrong part of my reply ...

Was the right part where you wrote:
> We'd actually be changing more things.

I also don't care that much how much code is changed;
if it must be changed, then change it to what is best,
which may not necessarily be the minimum change.

If an asynchronous sched domain rebuild is needed in
some places, then consider just using it everywhere,
rather than providing two code paths to rebuild, one
sync and one async.

Ask not how we got here; ask where we should be.

In particular, and in this specific case, having the
dual code paths really did make it a little more difficult
for me to understand the code, as evidenced by the back
and forth discussion on how to name the confusingly
similar functions. Such naming controversies are usually
a sign of code duplication or improper factoring.

That additional difficulty in understanding this code
was a key factor in delaying my review of your code.
I'd look at it, my mind was glaze over, and I'd put
it aside for a few days. This happened repeatedly.

Fine code is like fine art ... spare, elegant, compelling
in its expression.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-08-06 05:04:05

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

Paul Jackson wrote:
> Max wrote:
>> ... the wrong part of my reply ...
>
> Was the right part where you wrote:
>> We'd actually be changing more things.
>
> I also don't care that much how much code is changed;
> if it must be changed, then change it to what is best,
> which may not necessarily be the minimum change.
Sure. What I'm saying is that I do not think it's the best
to change all the paths to be async.

> If an asynchronous sched domain rebuild is needed in
> some places, then consider just using it everywhere,
> rather than providing two code paths to rebuild, one
> sync and one async.
I still do not see a good reason why. IMO exceptions are acceptable.
Only domain rebuilds triggered by cpuset fs writes need to be async.
I do not see a good technical reason why the rest needs to be converted
and retested.

> Ask not how we got here; ask where we should be.
>
> In particular, and in this specific case, having the
> dual code paths really did make it a little more difficult
> for me to understand the code, as evidenced by the back
> and forth discussion on how to name the confusingly
> similar functions. Such naming controversies are usually
> a sign of code duplication or improper factoring.
I'm not sure what you're referring to. There was no back an forth.
You suggested reverting the rename and I pointed out that it was not
a rename, I simply factored out the part that generates the
masks. That was it.

> That additional difficulty in understanding this code
> was a key factor in delaying my review of your code.
> I'd look at it, my mind was glaze over, and I'd put
> it aside for a few days. This happened repeatedly.
>
> Fine code is like fine art ... spare, elegant, compelling
> in its expression.
To be fair the fact that you had trouble understanding my code does
not automatically mean that it was not artistic ;-).

Max

2008-08-06 05:46:42

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

Max, replying to pj:
> > ... Such naming controversies are usually
> > a sign of code duplication or improper factoring.
> I'm not sure what you're referring to. There was no back an forth.

I was referring to an earlier discussion, which resulted in a patch
which includes the line:

__rebuild_sched_domains() has been renamed to async_rebuild_sched_domains().

> What I'm saying is that I do not think it's the best
> to change all the paths to be async.

But you seemed to be saying that the reason it was not
best to do so was because it was best not to change more
than necessary.

Perhaps I still don't understand your metric. Apparently
it is not CPU cycles, nor "least change", but something else?

> I still do not see a good reason why. IMO exceptions are acceptable.
> Only domain rebuilds triggered by cpuset fs writes need to be async.
> I do not see a good technical reason why the rest needs to be converted
> and retested.

Well, your metric seems clear enough there - minimize effort of
code conversion and testing.


How about this ... two routines quite identical and parallel,
even in their names, except that one is async and the other not:

==================================================================

/*
* Rebuild scheduler domains, asynchronously in a separate thread.
*
* If the flag 'sched_load_balance' of any cpuset with non-empty
* 'cpus' changes, or if the 'cpus' allowed changes in any cpuset
* which has that flag enabled, or if any cpuset with a non-empty
* 'cpus' is removed, then call this routine to rebuild the
* scheduler's dynamic sched domains.
*
* The rebuild_sched_domains() and partition_sched_domains()
* routines must nest cgroup_lock() inside get_online_cpus(),
* but such cpuset changes as these must nest that locking the
* other way, holding cgroup_lock() for much of the code.
*
* So in order to avoid an ABBA deadlock, the cpuset code handling
* these user changes delegates the actual sched domain rebuilding
* to a separate workqueue thread, which ends up processing the
* above rebuild_sched_domains_thread() function.
*/
static void async_rebuild_sched_domains(void)
{
queue_work(cpuset_wq, &rebuild_sched_domains_work);
}

/*
* Accomplishes the same scheduler domain rebuild as the above
* async_rebuild_sched_domains(), however it directly calls the
* rebuild routine inline, rather than calling it via a separate
* asynchronous work thread.
*
* This can only be called from code that is not holding
* cgroup_mutex (not nested in a cgroup_lock() call.)
*/
void inline_rebuild_sched_domains(void)
{
rebuild_sched_domains_thread(NULL);
}

==================================================================

> To be fair the fact that you had trouble understanding my code does
> not automatically mean that it was not artistic ;-).

Quite so ... my mental capacities are modest and easily distracted ;).

Likely this explains in part why I fuss so much over keeping code
straight forward, with minimal twists, and turns, and duplications
with non-essential variations in detail.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-08-06 20:22:07

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

Paul Jackson wrote:
> How about this ... two routines quite identical and parallel,
> even in their names, except that one is async and the other not:
>
> ==================================================================
>
> /*
> * Rebuild scheduler domains, asynchronously in a separate thread.
> *
> * If the flag 'sched_load_balance' of any cpuset with non-empty
> * 'cpus' changes, or if the 'cpus' allowed changes in any cpuset
> * which has that flag enabled, or if any cpuset with a non-empty
> * 'cpus' is removed, then call this routine to rebuild the
> * scheduler's dynamic sched domains.
> *
> * The rebuild_sched_domains() and partition_sched_domains()
> * routines must nest cgroup_lock() inside get_online_cpus(),
> * but such cpuset changes as these must nest that locking the
> * other way, holding cgroup_lock() for much of the code.
> *
> * So in order to avoid an ABBA deadlock, the cpuset code handling
> * these user changes delegates the actual sched domain rebuilding
> * to a separate workqueue thread, which ends up processing the
> * above rebuild_sched_domains_thread() function.
> */
> static void async_rebuild_sched_domains(void)
> {
> queue_work(cpuset_wq, &rebuild_sched_domains_work);
> }
>
> /*
> * Accomplishes the same scheduler domain rebuild as the above
> * async_rebuild_sched_domains(), however it directly calls the
> * rebuild routine inline, rather than calling it via a separate
> * asynchronous work thread.
> *
> * This can only be called from code that is not holding
> * cgroup_mutex (not nested in a cgroup_lock() call.)
> */
> void inline_rebuild_sched_domains(void)
> {
> rebuild_sched_domains_thread(NULL);
> }
>
> ==================================================================

Sure. That looks fine. Although inline_ will probably be a bit confusing
since one may think that it has something to do with the C 'inline'.
I'd suggest either sync_rebuild_sched_domains() or simply
rebuild_sched_domains(). The later has the advantage that the patch will
not have to touch the scheduler code.

Let me know your preference and I'll respin the patch.

Max

2008-08-06 20:29:36

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

Max K wrote:
> Sure. That looks fine. Although inline_ will probably be a bit confusing
> since one may think that it has something to do with the C 'inline'.
> I'd suggest either sync_rebuild_sched_domains() or simply
> rebuild_sched_domains().

Good point about "inline" confusions.

I am willing to agree to "sync_rebuild_sched_domains()".

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-08-06 20:34:00

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

On Wed, Aug 6, 2008 at 1:29 PM, Paul Jackson <[email protected]> wrote:
> Max K wrote:
>> Sure. That looks fine. Although inline_ will probably be a bit confusing
>> since one may think that it has something to do with the C 'inline'.
>> I'd suggest either sync_rebuild_sched_domains() or simply
>> rebuild_sched_domains().
>
> Good point about "inline" confusions.
>
> I am willing to agree to "sync_rebuild_sched_domains()".

I'd vote in favour of Max's suggestion of just
rebuild_sched_domains(), to localize the changes better.

Paul

2008-08-06 20:36:47

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

Paul Jackson wrote:
> Max K wrote:
>> Sure. That looks fine. Although inline_ will probably be a bit confusing
>> since one may think that it has something to do with the C 'inline'.
>> I'd suggest either sync_rebuild_sched_domains() or simply
>> rebuild_sched_domains().
>
> Good point about "inline" confusions.
>
> I am willing to agree to "sync_rebuild_sched_domains()".

Thank you. I'll go ahead and respin the patch.

Max

2008-08-06 20:57:43

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

Paul M wrote:
> I'd vote in favour of Max's suggestion of just
> rebuild_sched_domains(), to localize the changes better.

I'm outvoted, two to one.

Go for it, Max.

Thank, Paul M.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214