LinuxLists.cc - [PATCH cpuset] Use rebuild_sched_domains() in cpuset_hotplug

2013-04-02 07:16:45

Subject: [PATCH cpuset] Use rebuild_sched_domains() in cpuset_hotplug_workfn()

In cpuset_hotplug_workfn(), partition_sched_domains() is called without
hotplug lock held, which is actually needed (stated in the function
header of partition_sched_domains()).

This patch tries to use rebuild_sched_domains() to solve the above
issue, and makes the code looks a little simpler.

Signed-off-by: Li Zhong <[email protected]>
---
kernel/cpuset.c | 13 ++-----------
1 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4f9dfe4..515a713 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2222,17 +2222,8 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
flush_workqueue(cpuset_propagate_hotplug_wq);

/* rebuild sched domains if cpus_allowed has changed */
- if (cpus_updated) {
- struct sched_domain_attr *attr;
- cpumask_var_t *doms;
- int ndoms;
-
- mutex_lock(&cpuset_mutex);
- ndoms = generate_sched_domains(&doms, &attr);
- mutex_unlock(&cpuset_mutex);
-
- partition_sched_domains(ndoms, doms, attr);
- }
+ if (cpus_updated)
+ rebuild_sched_domains();
}

void cpuset_update_active_cpus(bool cpu_online)

2013-04-03 09:32:49

by Zefan Li

[permalink] [raw]

Subject: Re: [PATCH cpuset] Use rebuild_sched_domains() in cpuset_hotplug_workfn()

On 2013/4/2 15:16, Li Zhong wrote:
> In cpuset_hotplug_workfn(), partition_sched_domains() is called without
> hotplug lock held, which is actually needed (stated in the function
> header of partition_sched_domains()).
>
> This patch tries to use rebuild_sched_domains() to solve the above
> issue, and makes the code looks a little simpler.
>

I guess you found this just by code inspection, right?

The change looks fine and safe at a first glance, but we don't have
time to review the patch for now.

However I don't know why partition_sched_domains() has to be called
with hotplug lock held.

After a quick scan, seems the only place in partition_sched_domains()
that might need hotplug lock is arch_update_cpu_topology().

It would be better if you had CCed the scheduler guys...They may
know the answer.

> Signed-off-by: Li Zhong <[email protected]>
> ---
> kernel/cpuset.c | 13 ++-----------
> 1 files changed, 2 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 4f9dfe4..515a713 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -2222,17 +2222,8 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
> flush_workqueue(cpuset_propagate_hotplug_wq);
>
> /* rebuild sched domains if cpus_allowed has changed */
> - if (cpus_updated) {
> - struct sched_domain_attr *attr;
> - cpumask_var_t *doms;
> - int ndoms;
> -
> - mutex_lock(&cpuset_mutex);
> - ndoms = generate_sched_domains(&doms, &attr);
> - mutex_unlock(&cpuset_mutex);
> -
> - partition_sched_domains(ndoms, doms, attr);
> - }
> + if (cpus_updated)
> + rebuild_sched_domains();
> }
>

2013-04-08 10:17:06

by Li Zhong

[permalink] [raw]

Subject: Re: [PATCH cpuset] Use rebuild_sched_domains() in cpuset_hotplug_workfn()

On Wed, 2013-04-03 at 17:32 +0800, Li Zefan wrote:
> On 2013/4/2 15:16, Li Zhong wrote:
> > In cpuset_hotplug_workfn(), partition_sched_domains() is called without
> > hotplug lock held, which is actually needed (stated in the function
> > header of partition_sched_domains()).
> >
> > This patch tries to use rebuild_sched_domains() to solve the above
> > issue, and makes the code looks a little simpler.
> >
Hi Zefan,

Thanks for your quick review.

> I guess you found this just by code inspection, right?

Ah, yes :)

> The change looks fine and safe at a first glance, but we don't have
> time to review the patch for now.
>
> However I don't know why partition_sched_domains() has to be called
> with hotplug lock held.
>
> After a quick scan, seems the only place in partition_sched_domains()
> that might need hotplug lock is arch_update_cpu_topology().

and cpu_attach_domain(), which is called by partition_sched_domains()
also states that it needs hotplug lock be held.

It seems that the code is moved out of hotplug critical section after it
is converted to work functions.

Thanks, Zhong

>
> It would be better if you had CCed the scheduler guys...They may
> know the answer.
>
> > Signed-off-by: Li Zhong <[email protected]>
> > ---
> > kernel/cpuset.c | 13 ++-----------
> > 1 files changed, 2 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> > index 4f9dfe4..515a713 100644
> > --- a/kernel/cpuset.c
> > +++ b/kernel/cpuset.c
> > @@ -2222,17 +2222,8 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
> > flush_workqueue(cpuset_propagate_hotplug_wq);
> >
> > /* rebuild sched domains if cpus_allowed has changed */
> > - if (cpus_updated) {
> > - struct sched_domain_attr *attr;
> > - cpumask_var_t *doms;
> > - int ndoms;
> > -
> > - mutex_lock(&cpuset_mutex);
> > - ndoms = generate_sched_domains(&doms, &attr);
> > - mutex_unlock(&cpuset_mutex);
> > -
> > - partition_sched_domains(ndoms, doms, attr);
> > - }
> > + if (cpus_updated)
> > + rebuild_sched_domains();
> > }
> >
>
>

2013-04-09 09:59:45

by Li Zhong

[permalink] [raw]

Subject: [RFC PATCH v2 cpuset] Don't pass offlined cpus to partition_sched_domains()

Hi Zefan,

I did some test today, enabling cpuset and online/offline the cpus. It
caused NULL address dereferencing in get_group(). After adding some
debugging code, it seems that it is caused by using some offlined cpus
as the parameter to partition_sched_domains(). More details below:

===================
following is printed in build_sched_domain():
@@ -6476,6 +6486,14 @@ struct sched_domain *build_sched_domain(struct
sched_domain_topology_level *tl,
return child;

cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
+ if (cpumask_empty(sched_domain_span(sd)))
+ printk("lz empty domain %p, mask %pS\n", sd, tl->mask);
+ char cpulist[128];
+ cpumask_scnprintf(cpulist, sizeof(cpulist), cpu_map);
+ printk("lz cpu_map %s\n", cpulist);
+ cpumask_scnprintf(cpulist, sizeof(cpulist), tl->mask(cpu));
+ printk("lz tl->mask(%d) %pS %s\n",cpu, tl->mask, cpulist);
+
if (child) {
sd->level = child->level + 1;
sched_domain_level_max = max(sched_domain_level_max,
sd->level);

[ 232.714502] lz empty domain c0000001f73c0000, mask cpu_smt_mask
+0x0/0x10
[ 232.714515] lz cpu_map 1-2,7,13-15
[ 232.714521] lz tl->mask(1) cpu_smt_mask+0x0/0x10
[ 232.714529] lz cpu_map 1-2,7,13-15
[ 232.714536] lz tl->mask(1) cpu_cpu_mask+0x0/0x10 2,7,13-15
[ 232.714544] lz cpu_map 1-2,7,13-15
[ 232.714551] lz tl->mask(1) sd_numa_mask+0x0/0x10 2,7,13-15
[ 232.714558] lz cpu_map 1-2,7,13-15
[ 232.714565] lz tl->mask(2) cpu_smt_mask+0x0/0x10 2
===================

It seems that one cpu (#1 of the above) could be taken offline in
cpuset_hotplug_workfn(), after top_cpuset's allowed cpus changed and the
propagated work function finished. But generate_sched_domains(), which
uses the cpuset information, still considers this cpu (#1) valid, and
passes it down in the cpumask to partition_sched_domains(). Then we have
an empty sd as the above.

With the empty domain, in get_group(),
if (child)
cpu = cpumask_first(sched_domain_span(child));

this might cause the cpu to be returned a value >= nr_cpu_ids, then
cause bad dereference with the wrong cpu value in the later code.

This following code tries to fix the issue by anding the cpu_active_mask
at the end of generate_sched_domains() for each partition. This might
result the partiton (set of domains) not the best one, but we know
another rebuild (caused by the cpu offline in the middle of
cpuset_hotplug_workfn()) is on the way.

Also it seems that generate_sched_domains() needs be called together
with partition_sched_domains(), under the hotplug lock. Or similarly,
one cpu might be taken offline while doing generate_sched_domains(), and
cause the above issue.

Or maybe we could put the whole cpuset_hotplug_workfn() into the hotplug
lock?

Thanks, Zhong

---
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4f9dfe4..aed8436 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -754,6 +754,12 @@ done:
*/
if (doms == NULL)
ndoms = 1;
+ else {
+ for (nslot = 0; nslot < ndoms; nslot++) {
+ struct cpumask *dp = doms[nslot];
+ cpumask_and(dp, dp, cpu_active_mask);
+ }
+ }

*domains = doms;
*attributes = dattr;
@@ -2222,17 +2228,8 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
flush_workqueue(cpuset_propagate_hotplug_wq);

/* rebuild sched domains if cpus_allowed has changed */
- if (cpus_updated) {
- struct sched_domain_attr *attr;
- cpumask_var_t *doms;
- int ndoms;
-
- mutex_lock(&cpuset_mutex);
- ndoms = generate_sched_domains(&doms, &attr);
- mutex_unlock(&cpuset_mutex);
-
- partition_sched_domains(ndoms, doms, attr);
- }
+ if (cpus_updated)
+ rebuild_sched_domains();
}

void cpuset_update_active_cpus(bool cpu_online)

2013-04-11 08:57:26

by Zefan Li

[permalink] [raw]

Subject: Re: [RFC PATCH v2 cpuset] Don't pass offlined cpus to partition_sched_domains()

On 2013/4/9 17:59, Li Zhong wrote:
> Hi Zefan,
>
> I did some test today, enabling cpuset and online/offline the cpus. It
> caused NULL address dereferencing in get_group(). After adding some
> debugging code, it seems that it is caused by using some offlined cpus
> as the parameter to partition_sched_domains(). More details below:
>
> ===================
> following is printed in build_sched_domain():
> @@ -6476,6 +6486,14 @@ struct sched_domain *build_sched_domain(struct
> sched_domain_topology_level *tl,
> return child;
>
> cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
> + if (cpumask_empty(sched_domain_span(sd)))
> + printk("lz empty domain %p, mask %pS\n", sd, tl->mask);
> + char cpulist[128];
> + cpumask_scnprintf(cpulist, sizeof(cpulist), cpu_map);
> + printk("lz cpu_map %s\n", cpulist);
> + cpumask_scnprintf(cpulist, sizeof(cpulist), tl->mask(cpu));
> + printk("lz tl->mask(%d) %pS %s\n",cpu, tl->mask, cpulist);
> +
> if (child) {
> sd->level = child->level + 1;
> sched_domain_level_max = max(sched_domain_level_max,
> sd->level);
>
> [ 232.714502] lz empty domain c0000001f73c0000, mask cpu_smt_mask
> +0x0/0x10
> [ 232.714515] lz cpu_map 1-2,7,13-15
> [ 232.714521] lz tl->mask(1) cpu_smt_mask+0x0/0x10
> [ 232.714529] lz cpu_map 1-2,7,13-15
> [ 232.714536] lz tl->mask(1) cpu_cpu_mask+0x0/0x10 2,7,13-15
> [ 232.714544] lz cpu_map 1-2,7,13-15
> [ 232.714551] lz tl->mask(1) sd_numa_mask+0x0/0x10 2,7,13-15
> [ 232.714558] lz cpu_map 1-2,7,13-15
> [ 232.714565] lz tl->mask(2) cpu_smt_mask+0x0/0x10 2
> ===================
>
> It seems that one cpu (#1 of the above) could be taken offline in
> cpuset_hotplug_workfn(), after top_cpuset's allowed cpus changed and the
> propagated work function finished. But generate_sched_domains(), which
> uses the cpuset information, still considers this cpu (#1) valid, and
> passes it down in the cpumask to partition_sched_domains(). Then we have
> an empty sd as the above.
>

Thanks for the test and analysis!

So the bug was caused when the cpuset async cpuset handling hasn't been
finished, and at this time user triggered another hotplug operation, right?

Then I think we should wait cpuset to finish its work before starting
a new hotplug operation.

This should fix the bug you observed, and it should also fix another bug,
that the 2nd schedule_work(&cpuset_hotplug_work) returns true, and so
the root cpuset.cpus won't be updated.

Could you test this patch?

--
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index ccd1de8..ece226c 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -118,6 +118,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
task_unlock(current);
}

+extern void cpuset_flush_hotplug_work(void);
+
#else /* !CONFIG_CPUSETS */

static inline int cpuset_init(void) { return 0; }
@@ -232,6 +234,10 @@ static inline bool put_mems_allowed(unsigned int seq)
return true;
}

+static inline void cpuset_flush_hotplug_work(void)
+{
+}
+
#endif /* !CONFIG_CPUSETS */

#endif /* _LINUX_CPUSET_H */
diff --git a/kernel/cpu.c b/kernel/cpu.c
index b5e4ab2..283a993 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -19,6 +19,7 @@
#include <linux/mutex.h>
#include <linux/gfp.h>
#include <linux/suspend.h>
+#include <linux/cpuset.h>

#include "smpboot.h"

@@ -278,6 +279,13 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
if (!cpu_online(cpu))
return -EINVAL;

+ /*
+ * cpuset handles cpu/memory hotplug asynchronously, so it's possible
+ * that cpuset hasn't finished it's hotplug work for the previous
+ * hotplug operation.
+ */
+ cpuset_flush_hotplug_work();
+
cpu_hotplug_begin();

err = __cpu_notify(CPU_DOWN_PREPARE | mod, hcpu, -1, &nr_calls);
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index c55763b..9bb5018 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2141,6 +2141,14 @@ static void schedule_cpuset_propagate_hotplug(struct cpuset *cs)
}

/**
+ * cpuset_flush_hotplug_work() - wait cpuset hotplug work to finish
+ */
+void cpuset_flush_hotplug_work(void)
+{
+ flush_work(&cpuset_hotplug_work);
+}
+
+/**
* cpuset_hotplug_workfn - handle CPU/memory hotunplug for a cpuset
*
* This function is called after either CPU or memory configuration has
@@ -2226,6 +2234,8 @@ static void cpuset_hotplug_workfn(struct work_struct *work)

void cpuset_update_active_cpus(bool cpu_online)
{
+ bool queued;
+
/*
* We're inside cpu hotplug critical region which usually nests
* inside cgroup synchronization. Bounce actual hotplug processing
@@ -2237,7 +2247,8 @@ void cpuset_update_active_cpus(bool cpu_online)
* cpuset_hotplug_workfn() will rebuild it as necessary.
*/
partition_sched_domains(1, NULL, NULL);
- schedule_work(&cpuset_hotplug_work);
+ queued = schedule_work(&cpuset_hotplug_work);
+ WARN_ON(queued);
}

#ifdef CONFIG_MEMORY_HOTPLUG

> With the empty domain, in get_group(),
> if (child)
> cpu = cpumask_first(sched_domain_span(child));
>
> this might cause the cpu to be returned a value >= nr_cpu_ids, then
> cause bad dereference with the wrong cpu value in the later code.
>
> This following code tries to fix the issue by anding the cpu_active_mask
> at the end of generate_sched_domains() for each partition. This might
> result the partiton (set of domains) not the best one, but we know
> another rebuild (caused by the cpu offline in the middle of
> cpuset_hotplug_workfn()) is on the way.
>
> Also it seems that generate_sched_domains() needs be called together
> with partition_sched_domains(), under the hotplug lock. Or similarly,
> one cpu might be taken offline while doing generate_sched_domains(), and
> cause the above issue.
>
> Or maybe we could put the whole cpuset_hotplug_workfn() into the hotplug
> lock?

2013-04-12 10:11:37

by Li Zhong

[permalink] [raw]

Subject: Re: [RFC PATCH v2 cpuset] Don't pass offlined cpus to partition_sched_domains()

On Thu, 2013-04-11 at 16:57 +0800, Li Zefan wrote:
> On 2013/4/9 17:59, Li Zhong wrote:
> > Hi Zefan,
> >
> > I did some test today, enabling cpuset and online/offline the cpus. It
> > caused NULL address dereferencing in get_group(). After adding some
> > debugging code, it seems that it is caused by using some offlined cpus
> > as the parameter to partition_sched_domains(). More details below:
> >
> > ===================
> > following is printed in build_sched_domain():
> > @@ -6476,6 +6486,14 @@ struct sched_domain *build_sched_domain(struct
> > sched_domain_topology_level *tl,
> > return child;
> >
> > cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
> > + if (cpumask_empty(sched_domain_span(sd)))
> > + printk("lz empty domain %p, mask %pS\n", sd, tl->mask);
> > + char cpulist[128];
> > + cpumask_scnprintf(cpulist, sizeof(cpulist), cpu_map);
> > + printk("lz cpu_map %s\n", cpulist);
> > + cpumask_scnprintf(cpulist, sizeof(cpulist), tl->mask(cpu));
> > + printk("lz tl->mask(%d) %pS %s\n",cpu, tl->mask, cpulist);
> > +
> > if (child) {
> > sd->level = child->level + 1;
> > sched_domain_level_max = max(sched_domain_level_max,
> > sd->level);
> >
> > [ 232.714502] lz empty domain c0000001f73c0000, mask cpu_smt_mask
> > +0x0/0x10
> > [ 232.714515] lz cpu_map 1-2,7,13-15
> > [ 232.714521] lz tl->mask(1) cpu_smt_mask+0x0/0x10
> > [ 232.714529] lz cpu_map 1-2,7,13-15
> > [ 232.714536] lz tl->mask(1) cpu_cpu_mask+0x0/0x10 2,7,13-15
> > [ 232.714544] lz cpu_map 1-2,7,13-15
> > [ 232.714551] lz tl->mask(1) sd_numa_mask+0x0/0x10 2,7,13-15
> > [ 232.714558] lz cpu_map 1-2,7,13-15
> > [ 232.714565] lz tl->mask(2) cpu_smt_mask+0x0/0x10 2
> > ===================
> >
> > It seems that one cpu (#1 of the above) could be taken offline in
> > cpuset_hotplug_workfn(), after top_cpuset's allowed cpus changed and the
> > propagated work function finished. But generate_sched_domains(), which
> > uses the cpuset information, still considers this cpu (#1) valid, and
> > passes it down in the cpumask to partition_sched_domains(). Then we have
> > an empty sd as the above.
> >
>
> Thanks for the test and analysis!
>
> So the bug was caused when the cpuset async cpuset handling hasn't been
> finished, and at this time user triggered another hotplug operation, right?

Seems so, the summary is very clear :)

> Then I think we should wait cpuset to finish its work before starting
> a new hotplug operation.

en.

> This should fix the bug you observed, and it should also fix another bug,
> that the 2nd schedule_work(&cpuset_hotplug_work) returns true, and so
> the root cpuset.cpus won't be updated.

OK, I just realized that the work can only have one instance...

As schedule_work returns false if work already in the queue, so I guess
we should check against false.

And it seems to me that we also need the logic in cpu_up path, for
similar reasons, or WARN_ON might be triggered, and cpuset cpu info
won't get updated.

> Could you test this patch?

Seems it works, with the above two changes.

For your convenience, the final code I used for testing is attached
below.

Thanks, Zhong

---
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 8c8a60d..d3ba31d 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -119,6 +119,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
task_unlock(current);
}

+extern void cpuset_flush_hotplug_work(void);
+
#else /* !CONFIG_CPUSETS */

static inline int cpuset_init(void) { return 0; }
@@ -233,6 +235,10 @@ static inline bool put_mems_allowed(unsigned int seq)
return true;
}

+static inline void cpuset_flush_hotplug_work(void)
+{
+}
+
#endif /* !CONFIG_CPUSETS */

#endif /* _LINUX_CPUSET_H */
diff --git a/kernel/cpu.c b/kernel/cpu.c
index b5e4ab2..6eb914b 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -19,6 +19,7 @@
#include <linux/mutex.h>
#include <linux/gfp.h>
#include <linux/suspend.h>
+#include <linux/cpuset.h>

#include "smpboot.h"

@@ -278,6 +279,13 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
if (!cpu_online(cpu))
return -EINVAL;

+ /*
+ * cpuset handles cpu/memory hotplug asynchronously, so it's possible
+ * that cpuset hasn't finished it's hotplug work for the previous
+ * hotplug operation.
+ */
+ cpuset_flush_hotplug_work();
+
cpu_hotplug_begin();

err = __cpu_notify(CPU_DOWN_PREPARE | mod, hcpu, -1, &nr_calls);
@@ -352,6 +360,13 @@ static int __cpuinit _cpu_up(unsigned int cpu, int tasks_frozen)
unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
struct task_struct *idle;

+ /*
+ * cpuset handles cpu/memory hotplug asynchronously, so it's possible
+ * that cpuset hasn't finished it's hotplug work for the previous
+ * hotplug operation.
+ */
+ cpuset_flush_hotplug_work();
+
cpu_hotplug_begin();

if (cpu_online(cpu) || !cpu_present(cpu)) {
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4f9dfe4..6222c2c 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2152,6 +2152,14 @@ static void schedule_cpuset_propagate_hotplug(struct cpuset *cs)
}

/**
+ * cpuset_flush_hotplug_work() - wait cpuset hotplug work to finish
+ */
+void cpuset_flush_hotplug_work(void)
+{
+ flush_work(&cpuset_hotplug_work);
+}
+
+/**
* cpuset_hotplug_workfn - handle CPU/memory hotunplug for a cpuset
*
* This function is called after either CPU or memory configuration has
@@ -2237,6 +2245,8 @@ static void cpuset_hotplug_workfn(struct work_struct *work)

void cpuset_update_active_cpus(bool cpu_online)
{
+ bool queued;
+
/*
* We're inside cpu hotplug critical region which usually nests
* inside cgroup synchronization. Bounce actual hotplug processing
@@ -2248,7 +2258,8 @@ void cpuset_update_active_cpus(bool cpu_online)
* cpuset_hotplug_workfn() will rebuild it as necessary.
*/
partition_sched_domains(1, NULL, NULL);
- schedule_work(&cpuset_hotplug_work);
+ queued = schedule_work(&cpuset_hotplug_work);
+ WARN_ON(!queued);
}

#ifdef CONFIG_MEMORY_HOTPLUG