Subject: [RFC PATCH 0/5] Refcount based Cpu Hotplug. V2

Hello everyone,

This is the version 2 of the refcount based cpu-hotplug "locking"
implementation.

It incorporates the review comments from the first posting which
can be found here --> http://lkml.org/lkml/2007/10/16/118.

Changes since v1:
- !CONFIG_HOTPLUG_CPU part is now handled correctly, thanks
to the patch from Paul Jackson.

- The cpu_hotplug_begin() uses a waitqueue instead of a completion struct
where a writer can wait while there are active readers in the system.

- Provided a new API's cpu_maps_update_begin(), cpu_maps_update_done()
for serializing the updates to cpu_present_map and cpu_online_map.
Thus threads which update the cpu_present_map should now call
cpu_maps_update_begin instead of lock_cpu_hotplug(), since they play the
role of writers.

- pseries_processor_add() , pseries_processor_remove() now use
cpu_maps_update_begin()/cpu_maps_update_done() in place of
lock_cpu_hotplug()/unlock_cpu_hotplug().

- Replaced the workqueue_mutex with workqueue_lock, which is a spinlock
and guards the workqueues list.

- Updated Documentation/cpu-hotplug.txt to reflect get_online_cpus(),
put_online_cpus() in place of the old lock_cpu_hotplug(),
unlock_cpu_hotplug().

I'm Cc'ing the different subsystem maintainers who might be affected
by the changes in the patchstack. Especially if they rely on
lock_cpu_hotplug() to provide them protection for their local data
structures as well.

The patchstack which is based against 2.6.23-mm1 has behaved well
when it was stress tested with kernbench running while continuously
performing cpu-hotplug operations on i386, x86_64 and ppc64.

Awaiting your feedback.

Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"


Subject: [RFC PATCH 2/5] Replace lock_cpu_hotplug() with get_online_cpus()

Replace all lock_cpu_hotplug/unlock_cpu_hotplug from the kernel and use
get_online_cpus and put_online_cpus instead as it highlights
the refcount semantics in these operations.

The new API guarantees protection against the cpu-hotplug operation,
but it doesn't guarantee serialized access to any of the local data
structures. Hence the changes needs to be reviewed.

In case of pseries_add_processor/pseries_remove_processor, use
cpu_maps_update_begin()/cpu_maps_update_done() as we're modifying the
cpu_present_map there.

Signed-off-by: Gautham R Shenoy <[email protected]>
---
arch/i386/kernel/cpu/mtrr/main.c | 8 ++++----
arch/i386/kernel/microcode.c | 16 ++++++++--------
arch/mips/kernel/mips-mt-fpaff.c | 10 +++++-----
arch/powerpc/platforms/pseries/hotplug-cpu.c | 8 ++++----
arch/powerpc/platforms/pseries/rtasd.c | 8 ++++----
drivers/lguest/core.c | 8 ++++----
drivers/s390/char/sclp_config.c | 4 ++--
include/linux/cpu.h | 8 ++++----
kernel/cpu.c | 10 +++++-----
kernel/cpuset.c | 14 +++++++-------
kernel/rcutorture.c | 6 +++---
kernel/stop_machine.c | 4 ++--
net/core/flow.c | 4 ++--
13 files changed, 54 insertions(+), 54 deletions(-)

Index: linux-2.6.23/include/linux/cpu.h
===================================================================
--- linux-2.6.23.orig/include/linux/cpu.h
+++ linux-2.6.23/include/linux/cpu.h
@@ -100,8 +100,8 @@ static inline void cpuhotplug_mutex_unlo
mutex_unlock(cpu_hp_mutex);
}

-extern void lock_cpu_hotplug(void);
-extern void unlock_cpu_hotplug(void);
+extern void get_online_cpus(void);
+extern void put_online_cpus(void);
#define hotcpu_notifier(fn, pri) { \
static struct notifier_block fn##_nb = \
{ .notifier_call = fn, .priority = pri }; \
@@ -119,8 +119,8 @@ static inline void cpuhotplug_mutex_lock
static inline void cpuhotplug_mutex_unlock(struct mutex *cpu_hp_mutex)
{ }

-#define lock_cpu_hotplug() do { } while (0)
-#define unlock_cpu_hotplug() do { } while (0)
+#define get_online_cpus() do { } while (0)
+#define put_online_cpus() do { } while (0)
#define hotcpu_notifier(fn, pri) do { (void)(fn); } while (0)
/* These aren't inline functions due to a GCC bug. */
#define register_hotcpu_notifier(nb) ({ (void)(nb); 0; })
Index: linux-2.6.23/kernel/cpu.c
===================================================================
--- linux-2.6.23.orig/kernel/cpu.c
+++ linux-2.6.23/kernel/cpu.c
@@ -48,7 +48,7 @@ void __init cpu_hotplug_init(void)

#ifdef CONFIG_HOTPLUG_CPU

-void lock_cpu_hotplug(void)
+void get_online_cpus(void)
{
might_sleep();
if (cpu_hotplug.active_writer == current)
@@ -58,9 +58,9 @@ void lock_cpu_hotplug(void)
mutex_unlock(&cpu_hotplug.lock);

}
-EXPORT_SYMBOL_GPL(lock_cpu_hotplug);
+EXPORT_SYMBOL_GPL(get_online_cpus);

-void unlock_cpu_hotplug(void)
+void put_online_cpus(void)
{
if (cpu_hotplug.active_writer == current)
return;
@@ -73,7 +73,7 @@ void unlock_cpu_hotplug(void)
mutex_unlock(&cpu_hotplug.lock);

}
-EXPORT_SYMBOL_GPL(unlock_cpu_hotplug);
+EXPORT_SYMBOL_GPL(put_online_cpus);

#endif /* CONFIG_HOTPLUG_CPU */

@@ -110,7 +110,7 @@ void cpu_maps_update_done(void)
* non zero and goes to sleep again.
*
* However, this is very difficult to achieve in practice since
- * lock_cpu_hotplug() not an api which is called all that often.
+ * get_online_cpus() not an api which is called all that often.
*
*/
static void cpu_hotplug_begin(void)
Index: linux-2.6.23/kernel/stop_machine.c
===================================================================
--- linux-2.6.23.orig/kernel/stop_machine.c
+++ linux-2.6.23/kernel/stop_machine.c
@@ -203,13 +203,13 @@ int stop_machine_run(int (*fn)(void *),
int ret;

/* No CPUs can come up or down during this. */
- lock_cpu_hotplug();
+ get_online_cpus();
p = __stop_machine_run(fn, data, cpu);
if (!IS_ERR(p))
ret = kthread_stop(p);
else
ret = PTR_ERR(p);
- unlock_cpu_hotplug();
+ put_online_cpus();

return ret;
}
Index: linux-2.6.23/arch/i386/kernel/cpu/mtrr/main.c
===================================================================
--- linux-2.6.23.orig/arch/i386/kernel/cpu/mtrr/main.c
+++ linux-2.6.23/arch/i386/kernel/cpu/mtrr/main.c
@@ -351,7 +351,7 @@ int mtrr_add_page(unsigned long base, un
replace = -1;

/* No CPU hotplug when we change MTRR entries */
- lock_cpu_hotplug();
+ get_online_cpus();
/* Search for existing MTRR */
mutex_lock(&mtrr_mutex);
for (i = 0; i < num_var_ranges; ++i) {
@@ -407,7 +407,7 @@ int mtrr_add_page(unsigned long base, un
error = i;
out:
mutex_unlock(&mtrr_mutex);
- unlock_cpu_hotplug();
+ put_online_cpus();
return error;
}

@@ -497,7 +497,7 @@ int mtrr_del_page(int reg, unsigned long

max = num_var_ranges;
/* No CPU hotplug when we change MTRR entries */
- lock_cpu_hotplug();
+ get_online_cpus();
mutex_lock(&mtrr_mutex);
if (reg < 0) {
/* Search for existing MTRR */
@@ -538,7 +538,7 @@ int mtrr_del_page(int reg, unsigned long
error = reg;
out:
mutex_unlock(&mtrr_mutex);
- unlock_cpu_hotplug();
+ put_online_cpus();
return error;
}
/**
Index: linux-2.6.23/arch/i386/kernel/microcode.c
===================================================================
--- linux-2.6.23.orig/arch/i386/kernel/microcode.c
+++ linux-2.6.23/arch/i386/kernel/microcode.c
@@ -436,7 +436,7 @@ static ssize_t microcode_write (struct f
return -EINVAL;
}

- lock_cpu_hotplug();
+ get_online_cpus();
mutex_lock(&microcode_mutex);

user_buffer = (void __user *) buf;
@@ -447,7 +447,7 @@ static ssize_t microcode_write (struct f
ret = (ssize_t)len;

mutex_unlock(&microcode_mutex);
- unlock_cpu_hotplug();
+ put_online_cpus();

return ret;
}
@@ -658,14 +658,14 @@ static ssize_t reload_store(struct sys_d

old = current->cpus_allowed;

- lock_cpu_hotplug();
+ get_online_cpus();
set_cpus_allowed(current, cpumask_of_cpu(cpu));

mutex_lock(&microcode_mutex);
if (uci->valid)
err = cpu_request_microcode(cpu);
mutex_unlock(&microcode_mutex);
- unlock_cpu_hotplug();
+ put_online_cpus();
set_cpus_allowed(current, old);
}
if (err)
@@ -817,9 +817,9 @@ static int __init microcode_init (void)
return PTR_ERR(microcode_pdev);
}

- lock_cpu_hotplug();
+ get_online_cpus();
error = sysdev_driver_register(&cpu_sysdev_class, &mc_sysdev_driver);
- unlock_cpu_hotplug();
+ put_online_cpus();
if (error) {
microcode_dev_exit();
platform_device_unregister(microcode_pdev);
@@ -839,9 +839,9 @@ static void __exit microcode_exit (void)

unregister_hotcpu_notifier(&mc_cpu_notifier);

- lock_cpu_hotplug();
+ get_online_cpus();
sysdev_driver_unregister(&cpu_sysdev_class, &mc_sysdev_driver);
- unlock_cpu_hotplug();
+ put_online_cpus();

platform_device_unregister(microcode_pdev);
}
Index: linux-2.6.23/arch/powerpc/platforms/pseries/hotplug-cpu.c
===================================================================
--- linux-2.6.23.orig/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ linux-2.6.23/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -151,7 +151,7 @@ static int pseries_add_processor(struct
for (i = 0; i < nthreads; i++)
cpu_set(i, tmp);

- lock_cpu_hotplug();
+ cpu_maps_update_begin();

BUG_ON(!cpus_subset(cpu_present_map, cpu_possible_map));

@@ -188,7 +188,7 @@ static int pseries_add_processor(struct
}
err = 0;
out_unlock:
- unlock_cpu_hotplug();
+ cpu_maps_update_done();
return err;
}

@@ -209,7 +209,7 @@ static void pseries_remove_processor(str

nthreads = len / sizeof(u32);

- lock_cpu_hotplug();
+ cpu_maps_update_begin();
for (i = 0; i < nthreads; i++) {
for_each_present_cpu(cpu) {
if (get_hard_smp_processor_id(cpu) != intserv[i])
@@ -223,7 +223,7 @@ static void pseries_remove_processor(str
printk(KERN_WARNING "Could not find cpu to remove "
"with physical id 0x%x\n", intserv[i]);
}
- unlock_cpu_hotplug();
+ cpu_maps_update_done();
}

static int pseries_smp_notifier(struct notifier_block *nb,
Index: linux-2.6.23/arch/powerpc/platforms/pseries/rtasd.c
===================================================================
--- linux-2.6.23.orig/arch/powerpc/platforms/pseries/rtasd.c
+++ linux-2.6.23/arch/powerpc/platforms/pseries/rtasd.c
@@ -382,7 +382,7 @@ static void do_event_scan_all_cpus(long
{
int cpu;

- lock_cpu_hotplug();
+ get_online_cpus();
cpu = first_cpu(cpu_online_map);
for (;;) {
set_cpus_allowed(current, cpumask_of_cpu(cpu));
@@ -390,15 +390,15 @@ static void do_event_scan_all_cpus(long
set_cpus_allowed(current, CPU_MASK_ALL);

/* Drop hotplug lock, and sleep for the specified delay */
- unlock_cpu_hotplug();
+ put_online_cpus();
msleep_interruptible(delay);
- lock_cpu_hotplug();
+ get_online_cpus();

cpu = next_cpu(cpu, cpu_online_map);
if (cpu == NR_CPUS)
break;
}
- unlock_cpu_hotplug();
+ put_online_cpus();
}

static int rtasd(void *unused)
Index: linux-2.6.23/drivers/s390/char/sclp_config.c
===================================================================
--- linux-2.6.23.orig/drivers/s390/char/sclp_config.c
+++ linux-2.6.23/drivers/s390/char/sclp_config.c
@@ -29,12 +29,12 @@ static void sclp_cpu_capability_notify(s
struct sys_device *sysdev;

printk(KERN_WARNING TAG "cpu capability changed.\n");
- lock_cpu_hotplug();
+ get_online_cpus();
for_each_online_cpu(cpu) {
sysdev = get_cpu_sysdev(cpu);
kobject_uevent(&sysdev->kobj, KOBJ_CHANGE);
}
- unlock_cpu_hotplug();
+ put_online_cpus();
}

static void sclp_conf_receiver_fn(struct evbuf_header *evbuf)
Index: linux-2.6.23/kernel/rcutorture.c
===================================================================
--- linux-2.6.23.orig/kernel/rcutorture.c
+++ linux-2.6.23/kernel/rcutorture.c
@@ -726,11 +726,11 @@ static void rcu_torture_shuffle_tasks(vo
cpumask_t tmp_mask = CPU_MASK_ALL;
int i;

- lock_cpu_hotplug();
+ get_online_cpus();

/* No point in shuffling if there is only one online CPU (ex: UP) */
if (num_online_cpus() == 1) {
- unlock_cpu_hotplug();
+ put_online_cpus();
return;
}

@@ -762,7 +762,7 @@ static void rcu_torture_shuffle_tasks(vo
else
rcu_idle_cpu--;

- unlock_cpu_hotplug();
+ put_online_cpus();
}

/* Shuffle tasks across CPUs, with the intent of allowing each CPU in the
Index: linux-2.6.23/net/core/flow.c
===================================================================
--- linux-2.6.23.orig/net/core/flow.c
+++ linux-2.6.23/net/core/flow.c
@@ -296,7 +296,7 @@ void flow_cache_flush(void)
static DEFINE_MUTEX(flow_flush_sem);

/* Don't want cpus going down or up during this. */
- lock_cpu_hotplug();
+ get_online_cpus();
mutex_lock(&flow_flush_sem);
atomic_set(&info.cpuleft, num_online_cpus());
init_completion(&info.completion);
@@ -308,7 +308,7 @@ void flow_cache_flush(void)

wait_for_completion(&info.completion);
mutex_unlock(&flow_flush_sem);
- unlock_cpu_hotplug();
+ put_online_cpus();
}

static void __devinit flow_cache_cpu_prepare(int cpu)
Index: linux-2.6.23/arch/mips/kernel/mips-mt-fpaff.c
===================================================================
--- linux-2.6.23.orig/arch/mips/kernel/mips-mt-fpaff.c
+++ linux-2.6.23/arch/mips/kernel/mips-mt-fpaff.c
@@ -58,13 +58,13 @@ asmlinkage long mipsmt_sys_sched_setaffi
if (copy_from_user(&new_mask, user_mask_ptr, sizeof(new_mask)))
return -EFAULT;

- lock_cpu_hotplug();
+ get_online_cpus();
read_lock(&tasklist_lock);

p = find_process_by_pid(pid);
if (!p) {
read_unlock(&tasklist_lock);
- unlock_cpu_hotplug();
+ put_online_cpus();
return -ESRCH;
}

@@ -106,7 +106,7 @@ asmlinkage long mipsmt_sys_sched_setaffi

out_unlock:
put_task_struct(p);
- unlock_cpu_hotplug();
+ put_online_cpus();
return retval;
}

@@ -125,7 +125,7 @@ asmlinkage long mipsmt_sys_sched_getaffi
if (len < real_len)
return -EINVAL;

- lock_cpu_hotplug();
+ get_online_cpus();
read_lock(&tasklist_lock);

retval = -ESRCH;
@@ -140,7 +140,7 @@ asmlinkage long mipsmt_sys_sched_getaffi

out_unlock:
read_unlock(&tasklist_lock);
- unlock_cpu_hotplug();
+ put_online_cpus();
if (retval)
return retval;
if (copy_to_user(user_mask_ptr, &mask, real_len))
Index: linux-2.6.23/drivers/lguest/core.c
===================================================================
--- linux-2.6.23.orig/drivers/lguest/core.c
+++ linux-2.6.23/drivers/lguest/core.c
@@ -735,7 +735,7 @@ static int __init init(void)

/* We don't need the complexity of CPUs coming and going while we're
* doing this. */
- lock_cpu_hotplug();
+ get_online_cpus();
if (cpu_has_pge) { /* We have a broader idea of "global". */
/* Remember that this was originally set (for cleanup). */
cpu_had_pge = 1;
@@ -745,7 +745,7 @@ static int __init init(void)
/* Turn off the feature in the global feature set. */
clear_bit(X86_FEATURE_PGE, boot_cpu_data.x86_capability);
}
- unlock_cpu_hotplug();
+ put_online_cpus();

/* All good! */
return 0;
@@ -759,13 +759,13 @@ static void __exit fini(void)
unmap_switcher();

/* If we had PGE before we started, turn it back on now. */
- lock_cpu_hotplug();
+ get_online_cpus();
if (cpu_had_pge) {
set_bit(X86_FEATURE_PGE, boot_cpu_data.x86_capability);
/* adjust_pge's argument "1" means set PGE. */
on_each_cpu(adjust_pge, (void *)1, 0, 1);
}
- unlock_cpu_hotplug();
+ put_online_cpus();
}

/* The Host side of lguest can be a module. This is a nice way for people to
Index: linux-2.6.23/kernel/cpuset.c
===================================================================
--- linux-2.6.23.orig/kernel/cpuset.c
+++ linux-2.6.23/kernel/cpuset.c
@@ -536,10 +536,10 @@ static int cpusets_overlap(struct cpuset
*
* Call with cgroup_mutex held. May take callback_mutex during
* call due to the kfifo_alloc() and kmalloc() calls. May nest
- * a call to the lock_cpu_hotplug()/unlock_cpu_hotplug() pair.
+ * a call to the get_online_cpus()/put_online_cpus() pair.
* Must not be called holding callback_mutex, because we must not
- * call lock_cpu_hotplug() while holding callback_mutex. Elsewhere
- * the kernel nests callback_mutex inside lock_cpu_hotplug() calls.
+ * call get_online_cpus() while holding callback_mutex. Elsewhere
+ * the kernel nests callback_mutex inside get_online_cpus() calls.
* So the reverse nesting would risk an ABBA deadlock.
*
* The three key local variables below are:
@@ -673,9 +673,9 @@ restart:

rebuild:
/* Have scheduler rebuild sched domains */
- lock_cpu_hotplug();
+ get_online_cpus();
partition_sched_domains(ndoms, doms);
- unlock_cpu_hotplug();
+ put_online_cpus();

done:
if (q && !IS_ERR(q))
@@ -1503,10 +1503,10 @@ static struct cgroup_subsys_state *cpuse
*
* If the cpuset being removed has its flag 'sched_load_balance'
* enabled, then simulate turning sched_load_balance off, which
- * will call rebuild_sched_domains(). The lock_cpu_hotplug()
+ * will call rebuild_sched_domains(). The get_online_cpus()
* call in rebuild_sched_domains() must not be made while holding
* callback_mutex. Elsewhere the kernel nests callback_mutex inside
- * lock_cpu_hotplug() calls. So the reverse nesting would risk an
+ * get_online_cpus() calls. So the reverse nesting would risk an
* ABBA deadlock.
*/

--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

Subject: [RFC PATCH 1/5] Refcount Based Cpu Hotplug implementation

This patch implements a Refcount + Waitqueue based model for
cpu-hotplug.

Now, a thread which wants to prevent cpu-hotplug, will bump up a
global refcount and the thread which wants to perform a cpu-hotplug
operation will block till the global refcount goes to zero.

The readers, if any, during an ongoing cpu-hotplug operation are blocked
until the cpu-hotplug operation is over.

Signed-off-by: Gautham R Shenoy <[email protected]>
Signed-off-by: Paul Jackson <[email protected]> [For !CONFIG_HOTPLUG_CPU ]
---
include/linux/cpu.h | 3 +
init/main.c | 1
kernel/cpu.c | 152 +++++++++++++++++++++++++++++++++++++---------------
3 files changed, 115 insertions(+), 41 deletions(-)

Index: linux-2.6.23/init/main.c
===================================================================
--- linux-2.6.23.orig/init/main.c
+++ linux-2.6.23/init/main.c
@@ -614,6 +614,7 @@ asmlinkage void __init start_kernel(void
vfs_caches_init_early();
cpuset_init_early();
mem_init();
+ cpu_hotplug_init();
kmem_cache_init();
setup_per_cpu_pageset();
numa_policy_init();
Index: linux-2.6.23/include/linux/cpu.h
===================================================================
--- linux-2.6.23.orig/include/linux/cpu.h
+++ linux-2.6.23/include/linux/cpu.h
@@ -83,6 +83,9 @@ static inline void unregister_cpu_notifi

#endif /* CONFIG_SMP */
extern struct sysdev_class cpu_sysdev_class;
+extern void cpu_hotplug_init(void);
+extern void cpu_maps_update_begin(void);
+extern void cpu_maps_update_done(void);

#ifdef CONFIG_HOTPLUG_CPU
/* Stop CPUs going up and down. */
Index: linux-2.6.23/kernel/cpu.c
===================================================================
--- linux-2.6.23.orig/kernel/cpu.c
+++ linux-2.6.23/kernel/cpu.c
@@ -15,9 +15,8 @@
#include <linux/stop_machine.h>
#include <linux/mutex.h>

-/* This protects CPUs going up and down... */
+/* Serializes the updates to cpu_online_map, cpu_present_map */
static DEFINE_MUTEX(cpu_add_remove_lock);
-static DEFINE_MUTEX(cpu_bitmask_lock);

static __cpuinitdata RAW_NOTIFIER_HEAD(cpu_chain);

@@ -26,52 +25,123 @@ static __cpuinitdata RAW_NOTIFIER_HEAD(c
*/
static int cpu_hotplug_disabled;

-#ifdef CONFIG_HOTPLUG_CPU
+static struct {
+ struct task_struct *active_writer;
+ struct mutex lock; /* Synchronizes accesses to refcount, */
+ /*
+ * Also blocks the new readers during
+ * an ongoing cpu hotplug operation.
+ */
+ int refcount;
+ wait_queue_head_t writer_queue;
+} cpu_hotplug;
+
+#define writer_exists() (cpu_hotplug.active_writer != NULL)
+
+void __init cpu_hotplug_init(void)
+{
+ cpu_hotplug.active_writer = NULL;
+ mutex_init(&cpu_hotplug.lock);
+ cpu_hotplug.refcount = 0;
+ init_waitqueue_head(&cpu_hotplug.writer_queue);
+}

-/* Crappy recursive lock-takers in cpufreq! Complain loudly about idiots */
-static struct task_struct *recursive;
-static int recursive_depth;
+#ifdef CONFIG_HOTPLUG_CPU

void lock_cpu_hotplug(void)
{
- struct task_struct *tsk = current;
-
- if (tsk == recursive) {
- static int warnings = 10;
- if (warnings) {
- printk(KERN_ERR "Lukewarm IQ detected in hotplug locking\n");
- WARN_ON(1);
- warnings--;
- }
- recursive_depth++;
+ might_sleep();
+ if (cpu_hotplug.active_writer == current)
return;
- }
- mutex_lock(&cpu_bitmask_lock);
- recursive = tsk;
+ mutex_lock(&cpu_hotplug.lock);
+ cpu_hotplug.refcount++;
+ mutex_unlock(&cpu_hotplug.lock);
+
}
EXPORT_SYMBOL_GPL(lock_cpu_hotplug);

void unlock_cpu_hotplug(void)
{
- WARN_ON(recursive != current);
- if (recursive_depth) {
- recursive_depth--;
+ if (cpu_hotplug.active_writer == current)
return;
- }
- recursive = NULL;
- mutex_unlock(&cpu_bitmask_lock);
+ mutex_lock(&cpu_hotplug.lock);
+ cpu_hotplug.refcount--;
+
+ if (unlikely(writer_exists()) && !cpu_hotplug.refcount)
+ wake_up(&cpu_hotplug.writer_queue);
+
+ mutex_unlock(&cpu_hotplug.lock);
+
}
EXPORT_SYMBOL_GPL(unlock_cpu_hotplug);

#endif /* CONFIG_HOTPLUG_CPU */

+/*
+ * The following two API's must be used when attempting
+ * to serialize the updates to cpu_online_map, cpu_present_map.
+ */
+void cpu_maps_update_begin(void)
+{
+ mutex_lock(&cpu_add_remove_lock);
+}
+
+void cpu_maps_update_done(void)
+{
+ mutex_unlock(&cpu_add_remove_lock);
+}
+
+/*
+ * This ensures that the hotplug operation can begin only when the
+ * refcount goes to zero.
+ *
+ * Note that during a cpu-hotplug operation, the new readers, if any,
+ * will be blocked by the cpu_hotplug.lock
+ *
+ * Since cpu_maps_update_begin is always called after invoking
+ * cpu_maps_update_begin, we can be sure that only one writer is active.
+ *
+ * Note that theoretically, there is a possibility of a livelock:
+ * - Refcount goes to zero, last reader wakes up the sleeping
+ * writer.
+ * - Last reader unlocks the cpu_hotplug.lock.
+ * - A new reader arrives at this moment, bumps up the refcount.
+ * - The writer acquires the cpu_hotplug.lock finds the refcount
+ * non zero and goes to sleep again.
+ *
+ * However, this is very difficult to achieve in practice since
+ * lock_cpu_hotplug() not an api which is called all that often.
+ *
+ */
+static void cpu_hotplug_begin(void)
+{
+ DECLARE_WAITQUEUE(wait, current);
+
+ mutex_lock(&cpu_hotplug.lock);
+
+ cpu_hotplug.active_writer = current;
+ add_wait_queue_exclusive(&cpu_hotplug.writer_queue, &wait);
+ while (cpu_hotplug.refcount) {
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ mutex_unlock(&cpu_hotplug.lock);
+ schedule();
+ mutex_lock(&cpu_hotplug.lock);
+ }
+ remove_wait_queue_locked(&cpu_hotplug.writer_queue, &wait);
+}
+
+static void cpu_hotplug_done(void)
+{
+ cpu_hotplug.active_writer = NULL;
+ mutex_unlock(&cpu_hotplug.lock);
+}
/* Need to know about CPUs going up/down? */
int __cpuinit register_cpu_notifier(struct notifier_block *nb)
{
int ret;
- mutex_lock(&cpu_add_remove_lock);
+ cpu_maps_update_begin();
ret = raw_notifier_chain_register(&cpu_chain, nb);
- mutex_unlock(&cpu_add_remove_lock);
+ cpu_maps_update_done();
return ret;
}

@@ -81,9 +151,9 @@ EXPORT_SYMBOL(register_cpu_notifier);

void unregister_cpu_notifier(struct notifier_block *nb)
{
- mutex_lock(&cpu_add_remove_lock);
+ cpu_maps_update_begin();
raw_notifier_chain_unregister(&cpu_chain, nb);
- mutex_unlock(&cpu_add_remove_lock);
+ cpu_maps_update_done();
}
EXPORT_SYMBOL(unregister_cpu_notifier);

@@ -147,6 +217,7 @@ static int _cpu_down(unsigned int cpu, i
if (!cpu_online(cpu))
return -EINVAL;

+ cpu_hotplug_begin();
raw_notifier_call_chain(&cpu_chain, CPU_LOCK_ACQUIRE, hcpu);
err = __raw_notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE | mod,
hcpu, -1, &nr_calls);
@@ -166,9 +237,7 @@ static int _cpu_down(unsigned int cpu, i
cpu_clear(cpu, tmp);
set_cpus_allowed(current, tmp);

- mutex_lock(&cpu_bitmask_lock);
p = __stop_machine_run(take_cpu_down, &tcd_param, cpu);
- mutex_unlock(&cpu_bitmask_lock);

if (IS_ERR(p) || cpu_online(cpu)) {
/* CPU didn't die: tell everyone. Can't complain. */
@@ -203,6 +272,7 @@ out_allowed:
set_cpus_allowed(current, old_allowed);
out_release:
raw_notifier_call_chain(&cpu_chain, CPU_LOCK_RELEASE, hcpu);
+ cpu_hotplug_done();
return err;
}

@@ -210,13 +280,13 @@ int cpu_down(unsigned int cpu)
{
int err = 0;

- mutex_lock(&cpu_add_remove_lock);
+ cpu_maps_update_begin();
if (cpu_hotplug_disabled)
err = -EBUSY;
else
err = _cpu_down(cpu, 0);

- mutex_unlock(&cpu_add_remove_lock);
+ cpu_maps_update_done();
return err;
}
#endif /*CONFIG_HOTPLUG_CPU*/
@@ -231,6 +301,7 @@ static int __cpuinit _cpu_up(unsigned in
if (cpu_online(cpu) || !cpu_present(cpu))
return -EINVAL;

+ cpu_hotplug_begin();
raw_notifier_call_chain(&cpu_chain, CPU_LOCK_ACQUIRE, hcpu);
ret = __raw_notifier_call_chain(&cpu_chain, CPU_UP_PREPARE | mod, hcpu,
-1, &nr_calls);
@@ -243,9 +314,7 @@ static int __cpuinit _cpu_up(unsigned in
}

/* Arch-specific enabling code. */
- mutex_lock(&cpu_bitmask_lock);
ret = __cpu_up(cpu);
- mutex_unlock(&cpu_bitmask_lock);
if (ret != 0)
goto out_notify;
BUG_ON(!cpu_online(cpu));
@@ -258,6 +327,7 @@ out_notify:
__raw_notifier_call_chain(&cpu_chain,
CPU_UP_CANCELED | mod, hcpu, nr_calls, NULL);
raw_notifier_call_chain(&cpu_chain, CPU_LOCK_RELEASE, hcpu);
+ cpu_hotplug_done();

return ret;
}
@@ -275,13 +345,13 @@ int __cpuinit cpu_up(unsigned int cpu)
return -EINVAL;
}

- mutex_lock(&cpu_add_remove_lock);
+ cpu_maps_update_begin();
if (cpu_hotplug_disabled)
err = -EBUSY;
else
err = _cpu_up(cpu, 0);

- mutex_unlock(&cpu_add_remove_lock);
+ cpu_maps_update_done();
return err;
}

@@ -292,7 +362,7 @@ int disable_nonboot_cpus(void)
{
int cpu, first_cpu, error = 0;

- mutex_lock(&cpu_add_remove_lock);
+ cpu_maps_update_begin();
first_cpu = first_cpu(cpu_online_map);
/* We take down all of the non-boot CPUs in one shot to avoid races
* with the userspace trying to use the CPU hotplug at the same time
@@ -319,7 +389,7 @@ int disable_nonboot_cpus(void)
} else {
printk(KERN_ERR "Non-boot CPUs are not disabled\n");
}
- mutex_unlock(&cpu_add_remove_lock);
+ cpu_maps_update_done();
return error;
}

@@ -328,7 +398,7 @@ void enable_nonboot_cpus(void)
int cpu, error;

/* Allow everyone to use the CPU hotplug again */
- mutex_lock(&cpu_add_remove_lock);
+ cpu_maps_update_begin();
cpu_hotplug_disabled = 0;
if (cpus_empty(frozen_cpus))
goto out;
@@ -344,6 +414,6 @@ void enable_nonboot_cpus(void)
}
cpus_clear(frozen_cpus);
out:
- mutex_unlock(&cpu_add_remove_lock);
+ cpu_maps_update_done();
}
#endif /* CONFIG_PM_SLEEP_SMP */
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

Subject: [RFC PATCH 3/5] Replace per-subsystem mutexes with get_online_cpus()

This patch converts the known per-subsystem mutexes to get_online_cpus
put_online_cpus. It also eliminates the CPU_LOCK_ACQUIRE
and CPU_LOCK_RELEASE hotplug notification events.

Signed-off-by: Gautham R Shenoy <[email protected]>
---
include/linux/notifier.h | 4 +---
kernel/cpu.c | 4 ----
kernel/sched.c | 27 ++++++++++-----------------
kernel/workqueue.c | 31 ++++++++++++++++---------------
mm/slab.c | 18 +++++++++++-------
5 files changed, 38 insertions(+), 46 deletions(-)

Index: linux-2.6.23/kernel/sched.c
===================================================================
--- linux-2.6.23.orig/kernel/sched.c
+++ linux-2.6.23/kernel/sched.c
@@ -360,7 +360,6 @@ struct rq {
};

static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
-static DEFINE_MUTEX(sched_hotcpu_mutex);

static inline void check_preempt_curr(struct rq *rq, struct task_struct *p)
{
@@ -4337,13 +4336,13 @@ long sched_setaffinity(pid_t pid, cpumas
struct task_struct *p;
int retval;

- mutex_lock(&sched_hotcpu_mutex);
+ get_online_cpus();
read_lock(&tasklist_lock);

p = find_process_by_pid(pid);
if (!p) {
read_unlock(&tasklist_lock);
- mutex_unlock(&sched_hotcpu_mutex);
+ put_online_cpus();
return -ESRCH;
}

@@ -4370,7 +4369,7 @@ long sched_setaffinity(pid_t pid, cpumas

out_unlock:
put_task_struct(p);
- mutex_unlock(&sched_hotcpu_mutex);
+ put_online_cpus();
return retval;
}

@@ -4427,7 +4426,7 @@ long sched_getaffinity(pid_t pid, cpumas
struct task_struct *p;
int retval;

- mutex_lock(&sched_hotcpu_mutex);
+ get_online_cpus();
read_lock(&tasklist_lock);

retval = -ESRCH;
@@ -4443,7 +4442,7 @@ long sched_getaffinity(pid_t pid, cpumas

out_unlock:
read_unlock(&tasklist_lock);
- mutex_unlock(&sched_hotcpu_mutex);
+ put_online_cpus();

return retval;
}
@@ -5342,9 +5341,6 @@ migration_call(struct notifier_block *nf
struct rq *rq;

switch (action) {
- case CPU_LOCK_ACQUIRE:
- mutex_lock(&sched_hotcpu_mutex);
- break;

case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
@@ -5398,7 +5394,7 @@ migration_call(struct notifier_block *nf
BUG_ON(rq->nr_running != 0);

/* No need to migrate the tasks: it was best-effort if
- * they didn't take sched_hotcpu_mutex. Just wake up
+ * they didn't do a get_online_cpus(). Just wake up
* the requestors. */
spin_lock_irq(&rq->lock);
while (!list_empty(&rq->migration_queue)) {
@@ -5412,9 +5408,6 @@ migration_call(struct notifier_block *nf
spin_unlock_irq(&rq->lock);
break;
#endif
- case CPU_LOCK_RELEASE:
- mutex_unlock(&sched_hotcpu_mutex);
- break;
}
return NOTIFY_OK;
}
@@ -6338,10 +6331,10 @@ static int arch_reinit_sched_domains(voi
{
int err;

- mutex_lock(&sched_hotcpu_mutex);
+ get_online_cpus();
detach_destroy_domains(&cpu_online_map);
err = arch_init_sched_domains(&cpu_online_map);
- mutex_unlock(&sched_hotcpu_mutex);
+ put_online_cpus();

return err;
}
@@ -6452,12 +6445,12 @@ void __init sched_init_smp(void)
{
cpumask_t non_isolated_cpus;

- mutex_lock(&sched_hotcpu_mutex);
+ get_online_cpus();
arch_init_sched_domains(&cpu_online_map);
cpus_andnot(non_isolated_cpus, cpu_possible_map, cpu_isolated_map);
if (cpus_empty(non_isolated_cpus))
cpu_set(smp_processor_id(), non_isolated_cpus);
- mutex_unlock(&sched_hotcpu_mutex);
+ put_online_cpus();
/* XXX: Theoretical race here - CPU may be hotplugged now */
hotcpu_notifier(update_sched_domains, 0);

Index: linux-2.6.23/mm/slab.c
===================================================================
--- linux-2.6.23.orig/mm/slab.c
+++ linux-2.6.23/mm/slab.c
@@ -730,8 +730,7 @@ static inline void init_lock_keys(void)
#endif

/*
- * 1. Guard access to the cache-chain.
- * 2. Protect sanity of cpu_online_map against cpu hotplug events
+ * Guard access to the cache-chain.
*/
static DEFINE_MUTEX(cache_chain_mutex);
static struct list_head cache_chain;
@@ -1331,12 +1330,11 @@ static int __cpuinit cpuup_callback(stru
int err = 0;

switch (action) {
- case CPU_LOCK_ACQUIRE:
- mutex_lock(&cache_chain_mutex);
- break;
case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
+ mutex_lock(&cache_chain_mutex);
err = cpuup_prepare(cpu);
+ mutex_unlock(&cache_chain_mutex);
break;
case CPU_ONLINE:
case CPU_ONLINE_FROZEN:
@@ -1373,9 +1371,8 @@ static int __cpuinit cpuup_callback(stru
#endif
case CPU_UP_CANCELED:
case CPU_UP_CANCELED_FROZEN:
+ mutex_lock(&cache_chain_mutex);
cpuup_canceled(cpu);
- break;
- case CPU_LOCK_RELEASE:
mutex_unlock(&cache_chain_mutex);
break;
}
@@ -2170,6 +2167,7 @@ kmem_cache_create (const char *name, siz
* We use cache_chain_mutex to ensure a consistent view of
* cpu_online_map as well. Please see cpuup_callback
*/
+ get_online_cpus();
mutex_lock(&cache_chain_mutex);

list_for_each_entry(pc, &cache_chain, next) {
@@ -2396,6 +2394,7 @@ oops:
panic("kmem_cache_create(): failed to create slab `%s'\n",
name);
mutex_unlock(&cache_chain_mutex);
+ put_online_cpus();
return cachep;
}
EXPORT_SYMBOL(kmem_cache_create);
@@ -2547,9 +2546,11 @@ int kmem_cache_shrink(struct kmem_cache
int ret;
BUG_ON(!cachep || in_interrupt());

+ get_online_cpus();
mutex_lock(&cache_chain_mutex);
ret = __cache_shrink(cachep);
mutex_unlock(&cache_chain_mutex);
+ put_online_cpus();
return ret;
}
EXPORT_SYMBOL(kmem_cache_shrink);
@@ -2575,6 +2576,7 @@ void kmem_cache_destroy(struct kmem_cach
BUG_ON(!cachep || in_interrupt());

/* Find the cache in the chain of caches. */
+ get_online_cpus();
mutex_lock(&cache_chain_mutex);
/*
* the chain is never empty, cache_cache is never destroyed
@@ -2584,6 +2586,7 @@ void kmem_cache_destroy(struct kmem_cach
slab_error(cachep, "Can't free all objects");
list_add(&cachep->next, &cache_chain);
mutex_unlock(&cache_chain_mutex);
+ put_online_cpus();
return;
}

@@ -2592,6 +2595,7 @@ void kmem_cache_destroy(struct kmem_cach

__kmem_cache_destroy(cachep);
mutex_unlock(&cache_chain_mutex);
+ put_online_cpus();
}
EXPORT_SYMBOL(kmem_cache_destroy);

Index: linux-2.6.23/kernel/workqueue.c
===================================================================
--- linux-2.6.23.orig/kernel/workqueue.c
+++ linux-2.6.23/kernel/workqueue.c
@@ -592,8 +592,6 @@ EXPORT_SYMBOL(schedule_delayed_work_on);
* Returns zero on success.
* Returns -ve errno on failure.
*
- * Appears to be racy against CPU hotplug.
- *
* schedule_on_each_cpu() is very slow.
*/
int schedule_on_each_cpu(work_func_t func)
@@ -605,7 +603,7 @@ int schedule_on_each_cpu(work_func_t fun
if (!works)
return -ENOMEM;

- preempt_disable(); /* CPU hotplug */
+ get_online_cpus(); /* CPU hotplug */
for_each_online_cpu(cpu) {
struct work_struct *work = per_cpu_ptr(works, cpu);

@@ -613,8 +611,8 @@ int schedule_on_each_cpu(work_func_t fun
set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
__queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), work);
}
- preempt_enable();
flush_workqueue(keventd_wq);
+ put_online_cpus();
free_percpu(works);
return 0;
}
@@ -749,8 +747,10 @@ struct workqueue_struct *__create_workqu
err = create_workqueue_thread(cwq, singlethread_cpu);
start_workqueue_thread(cwq, -1);
} else {
+ get_online_cpus();
mutex_lock(&workqueue_mutex);
list_add(&wq->list, &workqueues);
+ mutex_unlock(&workqueue_mutex);

for_each_possible_cpu(cpu) {
cwq = init_cpu_workqueue(wq, cpu);
@@ -759,7 +759,7 @@ struct workqueue_struct *__create_workqu
err = create_workqueue_thread(cwq, cpu);
start_workqueue_thread(cwq, cpu);
}
- mutex_unlock(&workqueue_mutex);
+ put_online_cpus();
}

if (err) {
@@ -809,9 +809,11 @@ void destroy_workqueue(struct workqueue_
struct cpu_workqueue_struct *cwq;
int cpu;

+ get_online_cpus();
mutex_lock(&workqueue_mutex);
list_del(&wq->list);
mutex_unlock(&workqueue_mutex);
+ put_online_cpus();

for_each_cpu_mask(cpu, *cpu_map) {
cwq = per_cpu_ptr(wq->cpu_wq, cpu);
@@ -830,22 +832,17 @@ static int __devinit workqueue_cpu_callb
unsigned int cpu = (unsigned long)hcpu;
struct cpu_workqueue_struct *cwq;
struct workqueue_struct *wq;
+ int ret = NOTIFY_OK;

action &= ~CPU_TASKS_FROZEN;

switch (action) {
- case CPU_LOCK_ACQUIRE:
- mutex_lock(&workqueue_mutex);
- return NOTIFY_OK;
-
- case CPU_LOCK_RELEASE:
- mutex_unlock(&workqueue_mutex);
- return NOTIFY_OK;

case CPU_UP_PREPARE:
cpu_set(cpu, cpu_populated_map);
}

+ mutex_lock(&workqueue_mutex);
list_for_each_entry(wq, &workqueues, list) {
cwq = per_cpu_ptr(wq->cpu_wq, cpu);

@@ -853,8 +850,10 @@ static int __devinit workqueue_cpu_callb
case CPU_UP_PREPARE:
if (!create_workqueue_thread(cwq, cpu))
break;
- printk(KERN_ERR "workqueue for %i failed\n", cpu);
- return NOTIFY_BAD;
+ printk(KERN_ERR "workqueue [%s] for %i failed\n",
+ wq->name, cpu);
+ ret = NOTIFY_BAD;
+ goto out_unlock;

case CPU_ONLINE:
start_workqueue_thread(cwq, cpu);
@@ -868,7 +867,9 @@ static int __devinit workqueue_cpu_callb
}
}

- return NOTIFY_OK;
+out_unlock:
+ mutex_unlock(&workqueue_mutex);
+ return ret;
}

void __init init_workqueues(void)
Index: linux-2.6.23/kernel/cpu.c
===================================================================
--- linux-2.6.23.orig/kernel/cpu.c
+++ linux-2.6.23/kernel/cpu.c
@@ -218,7 +218,6 @@ static int _cpu_down(unsigned int cpu, i
return -EINVAL;

cpu_hotplug_begin();
- raw_notifier_call_chain(&cpu_chain, CPU_LOCK_ACQUIRE, hcpu);
err = __raw_notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE | mod,
hcpu, -1, &nr_calls);
if (err == NOTIFY_BAD) {
@@ -271,7 +270,6 @@ out_thread:
out_allowed:
set_cpus_allowed(current, old_allowed);
out_release:
- raw_notifier_call_chain(&cpu_chain, CPU_LOCK_RELEASE, hcpu);
cpu_hotplug_done();
return err;
}
@@ -302,7 +300,6 @@ static int __cpuinit _cpu_up(unsigned in
return -EINVAL;

cpu_hotplug_begin();
- raw_notifier_call_chain(&cpu_chain, CPU_LOCK_ACQUIRE, hcpu);
ret = __raw_notifier_call_chain(&cpu_chain, CPU_UP_PREPARE | mod, hcpu,
-1, &nr_calls);
if (ret == NOTIFY_BAD) {
@@ -326,7 +323,6 @@ out_notify:
if (ret != 0)
__raw_notifier_call_chain(&cpu_chain,
CPU_UP_CANCELED | mod, hcpu, nr_calls, NULL);
- raw_notifier_call_chain(&cpu_chain, CPU_LOCK_RELEASE, hcpu);
cpu_hotplug_done();

return ret;
Index: linux-2.6.23/include/linux/notifier.h
===================================================================
--- linux-2.6.23.orig/include/linux/notifier.h
+++ linux-2.6.23/include/linux/notifier.h
@@ -207,9 +207,7 @@ static inline int notifier_to_errno(int
#define CPU_DOWN_PREPARE 0x0005 /* CPU (unsigned)v going down */
#define CPU_DOWN_FAILED 0x0006 /* CPU (unsigned)v NOT going down */
#define CPU_DEAD 0x0007 /* CPU (unsigned)v dead */
-#define CPU_LOCK_ACQUIRE 0x0008 /* Acquire all hotcpu locks */
-#define CPU_LOCK_RELEASE 0x0009 /* Release all hotcpu locks */
-#define CPU_DYING 0x000A /* CPU (unsigned)v not running any task,
+#define CPU_DYING 0x0008 /* CPU (unsigned)v not running any task,
* not handling interrupts, soon dead */

/* Used for CPU hotplug events occuring while tasks are frozen due to a suspend
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

Subject: [RFC PATCH 4/5] Remove CPU_DEAD/CPU_UP_CANCELLED handling from workqueue.c

cleanup_workqueue_thread() in the CPU_DEAD and CPU_UP_CANCELLED path
will cause a deadlock if the worker thread is executing a work item
which is blocked on get_online_cpus(). This will lead to a irrecoverable
hang.

Solution is not to cleanup the worker thread. Instead let it remain
even after the cpu goes offline. Since no one can queue any work
on an offlined cpu, this thread will be forever sleeping, untill
someone onlines the cpu.

With get_online_cpus()/put_online_cpus(), we can eliminate
the workqueue_mutex and reintroduce the workqueue_lock,
which is a spinlock which serializes the accesses to the
workqueues list.

Signed-off-by: Gautham R Shenoy <[email protected]>
---
kernel/workqueue.c | 49 ++++++++++++++++++-------------------------------
1 file changed, 18 insertions(+), 31 deletions(-)

Index: linux-2.6.23/kernel/workqueue.c
===================================================================
--- linux-2.6.23.orig/kernel/workqueue.c
+++ linux-2.6.23/kernel/workqueue.c
@@ -30,6 +30,7 @@
#include <linux/hardirq.h>
#include <linux/mempolicy.h>
#include <linux/freezer.h>
+#include <linux/cpumask.h>
#include <linux/kallsyms.h>
#include <linux/debug_locks.h>
#include <linux/lockdep.h>
@@ -67,9 +68,8 @@ struct workqueue_struct {
#endif
};

-/* All the per-cpu workqueues on the system, for hotplug cpu to add/remove
- threads to each one as cpus come/go. */
-static DEFINE_MUTEX(workqueue_mutex);
+/* Serializes accesses to the workqueues list. */
+static DEFINE_SPINLOCK(workqueue_lock);
static LIST_HEAD(workqueues);

static int singlethread_cpu __read_mostly;
@@ -712,7 +712,7 @@ static void start_workqueue_thread(struc

if (p != NULL) {
if (cpu >= 0)
- kthread_bind(p, cpu);
+ set_cpus_allowed(p, cpumask_of_cpu(cpu));
wake_up_process(p);
}
}
@@ -748,9 +748,9 @@ struct workqueue_struct *__create_workqu
start_workqueue_thread(cwq, -1);
} else {
get_online_cpus();
- mutex_lock(&workqueue_mutex);
+ spin_lock(&workqueue_lock);
list_add(&wq->list, &workqueues);
- mutex_unlock(&workqueue_mutex);
+ spin_unlock(&workqueue_lock);

for_each_possible_cpu(cpu) {
cwq = init_cpu_workqueue(wq, cpu);
@@ -773,26 +773,19 @@ EXPORT_SYMBOL_GPL(__create_workqueue_key
static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
{
/*
- * Our caller is either destroy_workqueue() or CPU_DEAD,
- * workqueue_mutex protects cwq->thread
+ * Our caller is destroy_workqueue(). So warn on a double
+ * destroy.
*/
- if (cwq->thread == NULL)
+ if (cwq->thread == NULL) {
+ WARN_ON(1);
return;
+ }

lock_acquire(&cwq->wq->lockdep_map, 0, 0, 0, 2, _THIS_IP_);
lock_release(&cwq->wq->lockdep_map, 1, _THIS_IP_);

flush_cpu_workqueue(cwq);
- /*
- * If the caller is CPU_DEAD and cwq->worklist was not empty,
- * a concurrent flush_workqueue() can insert a barrier after us.
- * However, in that case run_workqueue() won't return and check
- * kthread_should_stop() until it flushes all work_struct's.
- * When ->worklist becomes empty it is safe to exit because no
- * more work_structs can be queued on this cwq: flush_workqueue
- * checks list_empty(), and a "normal" queue_work() can't use
- * a dead CPU.
- */
+
kthread_stop(cwq->thread);
cwq->thread = NULL;
}
@@ -810,9 +803,9 @@ void destroy_workqueue(struct workqueue_
int cpu;

get_online_cpus();
- mutex_lock(&workqueue_mutex);
+ spin_lock(&workqueue_lock);
list_del(&wq->list);
- mutex_unlock(&workqueue_mutex);
+ spin_unlock(&workqueue_lock);
put_online_cpus();

for_each_cpu_mask(cpu, *cpu_map) {
@@ -842,33 +835,27 @@ static int __devinit workqueue_cpu_callb
cpu_set(cpu, cpu_populated_map);
}

- mutex_lock(&workqueue_mutex);
list_for_each_entry(wq, &workqueues, list) {
cwq = per_cpu_ptr(wq->cpu_wq, cpu);

switch (action) {
case CPU_UP_PREPARE:
+ if (likely(cwq->thread != NULL))
+ break;
if (!create_workqueue_thread(cwq, cpu))
break;
printk(KERN_ERR "workqueue [%s] for %i failed\n",
wq->name, cpu);
ret = NOTIFY_BAD;
- goto out_unlock;
+ goto out;

case CPU_ONLINE:
start_workqueue_thread(cwq, cpu);
break;
-
- case CPU_UP_CANCELED:
- start_workqueue_thread(cwq, -1);
- case CPU_DEAD:
- cleanup_workqueue_thread(cwq, cpu);
- break;
}
}

-out_unlock:
- mutex_unlock(&workqueue_mutex);
+out:
return ret;
}

--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

Subject: [RFC PATCH 5/5] Update get_online_cpus() in Documentation/cpu-hotplug.c

Update the documentation for cpu hotplug to reflect the use
of newly added API's get_online_cpus()/put_online_cpus();

Signed-off-by: Gautham R Shenoy <[email protected]>

---
Documentation/cpu-hotplug.txt | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)

Index: linux-2.6.23/Documentation/cpu-hotplug.txt
===================================================================
--- linux-2.6.23.orig/Documentation/cpu-hotplug.txt
+++ linux-2.6.23/Documentation/cpu-hotplug.txt
@@ -109,12 +109,13 @@ Never use anything other than cpumask_t
for_each_cpu_mask(x,mask) - Iterate over some random collection of cpu mask.

#include <linux/cpu.h>
- lock_cpu_hotplug() and unlock_cpu_hotplug():
+ get_online_cpus() and put_online_cpus():

-The above calls are used to inhibit cpu hotplug operations. While holding the
-cpucontrol mutex, cpu_online_map will not change. If you merely need to avoid
-cpus going away, you could also use preempt_disable() and preempt_enable()
-for those sections. Just remember the critical section cannot call any
+The above calls are used to inhibit cpu hotplug operations. While the
+cpu_hotplug.refcount is non zero, the cpu_online_map will not change.
+If you merely need to avoid cpus going away, you could also use
+preempt_disable() and preempt_enable() for those sections.
+Just remember the critical section cannot call any
function that can sleep or schedule this process away. The preempt_disable()
will work as long as stop_machine_run() is used to take a cpu down.

--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2007-10-24 07:21:17

by Rusty Russell

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] Remove CPU_DEAD/CPU_UP_CANCELLED handling from workqueue.c

On Wednesday 24 October 2007 15:37:16 Gautham R Shenoy wrote:
> @@ -712,7 +712,7 @@ static void start_workqueue_thread(struc
>
> if (p != NULL) {
> if (cpu >= 0)
> - kthread_bind(p, cpu);
> + set_cpus_allowed(p, cpumask_of_cpu(cpu));
> wake_up_process(p);
> }

Hi Gautham!

This works, although the change is unnecessary. If you're going to
do this I'd prefer "if (set_cpus_allowed(p, cpumask_of_cpu(cpu)) != 0)
BUG();".

Cheers,
Rusty.

Subject: Re: [RFC PATCH 4/5] Remove CPU_DEAD/CPU_UP_CANCELLED handling from workqueue.c

Hello Rusty,
On Wed, Oct 24, 2007 at 05:21:04PM +1000, Rusty Russell wrote:
> On Wednesday 24 October 2007 15:37:16 Gautham R Shenoy wrote:
> > @@ -712,7 +712,7 @@ static void start_workqueue_thread(struc
> >
> > if (p != NULL) {
> > if (cpu >= 0)
> > - kthread_bind(p, cpu);
> > + set_cpus_allowed(p, cpumask_of_cpu(cpu));
> > wake_up_process(p);
> > }
>
> Hi Gautham!
>
> This works, although the change is unnecessary.

This change was necessary because once a cpu goes offline, the worker
threads for that particular cpu, sleep with their state being
TASK_INTERRUPTIBLE. Thus, during CPU_ONLINE, when we call
start_workqueue_thread(), kthread_bind() yells since the worker thread
is not TASK_UNINTERRUPTIBLE.

Probably later on when Oleg plans to re-introduce the ->should_stop
flag, we can put the worker threads to uninterruptible sleep, and revert
back to kthread_bind().

> If you're going to
> do this I'd prefer "if (set_cpus_allowed(p, cpumask_of_cpu(cpu)) != 0)
> BUG();".

Not sure if this is required. set_cpus_allowed() returns a non-zero
value only if the cpumask passed doesn't intersect with the cpu-online
map. In our case, start_workqueue_thread() is called only when the cpu
is online. i.e

a) __create_workqueue_key() calls start_workqueue_thread() with
get_online_cpus() protection only for those cpus which are online.

b) workqueue_cpu_callback() calls start_workqueue_thread() on event
of a CPU_ONLINE.

>
> Cheers,
> Rusty.

Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2007-10-24 13:33:26

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] Remove CPU_DEAD/CPU_UP_CANCELLED handling from workqueue.c

On 10/24, Gautham R Shenoy wrote:
>

(reordered)

> With get_online_cpus()/put_online_cpus(), we can eliminate
> the workqueue_mutex and reintroduce the workqueue_lock,
> which is a spinlock which serializes the accesses to the
> workqueues list.

This change is obviously good, can't it go into the previous patch?

Because,

> Solution is not to cleanup the worker thread. Instead let it remain
> even after the cpu goes offline. Since no one can queue any work
> on an offlined cpu, this thread will be forever sleeping, untill
> someone onlines the cpu.

I still think this patch is questionable. Please look at my previous
response http://marc.info/?l=linux-kernel&m=119262203729543

In short: with this patch it is not possible to guarantee that work->fun()
will run on the correct CPU.

> static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
> {
> /*
> - * Our caller is either destroy_workqueue() or CPU_DEAD,
> - * workqueue_mutex protects cwq->thread
> + * Our caller is destroy_workqueue(). So warn on a double
> + * destroy.
> */
> - if (cwq->thread == NULL)
> + if (cwq->thread == NULL) {
> + WARN_ON(1);

Looks wrong. It is possible that cwq->thread == NULL, because currently we
never "shrink" cpu_populated_map.

> cleanup_workqueue_thread() in the CPU_DEAD and CPU_UP_CANCELLED path
> will cause a deadlock if the worker thread is executing a work item
> which is blocked on get_online_cpus(). This will lead to a irrecoverable
> hang.

Yes. But there is nothing new. Currently, work->func() can't share the locks
with cpu_down's patch. Not only only it can't take workqueue_mutex, it can't
take any other lock which could be taken by notifier callbacks, etc.

Can't we ignore this problem, at least for now? I believe we need intrusive
changes to solve this problem correctly. Perhaps I am wrong, of course, but
I don't see a simple solution.

Another option. Note that get_online_cpus() does more than just pinning
cpu maps, actually it blocks hotplug entirely. Now let's look at
schedule_on_each_cpu(), for example. It doesn't need to block hotplug,
it only needs a stable cpu_online_map.

Suppose for a moment that _cpu_down() does cpu_hotplug_done() earlier,
right after __cpu_die(cpu) which removes CPU from the map (yes, this
is wrong, I know). Now, we don't need to change workqueue_cpu_callback(),
work->func() can use get_online_cpus() without fear of deadlock.

So, can't we introduce 2 nested rw locks? The first one blocks cpu hotplug
(like get_online_cpus does currently), the second one just pins cpu maps.
I think most users needs only this, not more.

What do you think?

(Gautham, I apologize in advance, can't be responsive till weekend).

Oleg.

2007-10-24 13:39:04

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] Remove CPU_DEAD/CPU_UP_CANCELLED handling from workqueue.c

On 10/24, Rusty Russell wrote:
>
> On Wednesday 24 October 2007 15:37:16 Gautham R Shenoy wrote:
> > @@ -712,7 +712,7 @@ static void start_workqueue_thread(struc
> >
> > if (p != NULL) {
> > if (cpu >= 0)
> > - kthread_bind(p, cpu);
> > + set_cpus_allowed(p, cpumask_of_cpu(cpu));
> > wake_up_process(p);
> > }
>
> Hi Gautham!
>
> This works, although the change is unnecessary.

kthread_bind() changes ->cpu/->cpus_allowed without any locks.

Currently this is possible because nobody can wakeup the new thread
after wait_task_inactive() succeeds. With this patch this is not true
any longer, cwq->thread can sleep, for example, on cwq->more_work.

Oleg.

2007-10-24 17:05:54

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Refcount based Cpu Hotplug. V2

On Wed, 24 Oct 2007, Gautham R Shenoy wrote:

> This is the version 2 of the refcount based cpu-hotplug "locking"
> implementation.

Uggh. This introduces a global lock that has to be taken always when
scanning over cpus? Is multipe cpus are scanning over processor lists then
we will get into scaling problems. Wasnt there an RCU based implementation
that avoided cacheline bouncing and refcounting?

> The patchstack which is based against 2.6.23-mm1 has behaved well
> when it was stress tested with kernbench running while continuously
> performing cpu-hotplug operations on i386, x86_64 and ppc64.

What was the highest cpu count in testing?

2007-10-24 18:09:24

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] Remove CPU_DEAD/CPU_UP_CANCELLED handling from workqueue.c

On 10/24, Gautham R Shenoy wrote:
>
> On Wed, Oct 24, 2007 at 05:38:18PM +0400, Oleg Nesterov wrote:
> >
> > So, can't we introduce 2 nested rw locks? The first one blocks cpu hotplug
> > (like get_online_cpus does currently), the second one just pins cpu maps.
> > I think most users needs only this, not more.
> >
>
> Well, rw locks/sems cannot recurse. However, refcount model supports
> recursion naturally. Hence the implementation.

No, no, you misunderstood! (I was unclear). I meant, can't we introduce 2
refcounted nested locks? Both implemented as get_online_cpus/cpu_hotplug_begin.

Oleg.

2007-10-24 18:13:15

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Refcount based Cpu Hotplug. V2

On 10/24, Gautham R Shenoy wrote:
>
> On Wed, Oct 24, 2007 at 10:04:41AM -0700, Christoph Lameter wrote:
> > On Wed, 24 Oct 2007, Gautham R Shenoy wrote:
> >
> > > This is the version 2 of the refcount based cpu-hotplug "locking"
> > > implementation.
> >
> > Uggh. This introduces a global lock that has to be taken always when
> > scanning over cpus?
>
> Well, no! we take the global lock only while bumping up the refcount.
> We don't hold the lock while scanning over the cpus. And this is
> definitely an improvement over the lock_cpu_hotplug() global mutex
> we have now.

Just to be sure I didn't miss something... preempt_disable() still works,
yes?

Oleg.

Subject: Re: [RFC PATCH 4/5] Remove CPU_DEAD/CPU_UP_CANCELLED handling from workqueue.c

On Wed, Oct 24, 2007 at 05:38:18PM +0400, Oleg Nesterov wrote:
> On 10/24, Gautham R Shenoy wrote:
> >
>
> (reordered)
>
> > With get_online_cpus()/put_online_cpus(), we can eliminate
> > the workqueue_mutex and reintroduce the workqueue_lock,
> > which is a spinlock which serializes the accesses to the
> > workqueues list.
>
> This change is obviously good, can't it go into the previous patch?

It can. Will repost.

>
> Because,
>
> > Solution is not to cleanup the worker thread. Instead let it remain
> > even after the cpu goes offline. Since no one can queue any work
> > on an offlined cpu, this thread will be forever sleeping, untill
> > someone onlines the cpu.
>
> I still think this patch is questionable. Please look at my previous
> response http://marc.info/?l=linux-kernel&m=119262203729543
>
> In short: with this patch it is not possible to guarantee that work->fun()
> will run on the correct CPU.
>
> > static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
> > {
> > /*
> > - * Our caller is either destroy_workqueue() or CPU_DEAD,
> > - * workqueue_mutex protects cwq->thread
> > + * Our caller is destroy_workqueue(). So warn on a double
> > + * destroy.
> > */
> > - if (cwq->thread == NULL)
> > + if (cwq->thread == NULL) {
> > + WARN_ON(1);
>
> Looks wrong. It is possible that cwq->thread == NULL, because currently we
> never "shrink" cpu_populated_map.
>
> > cleanup_workqueue_thread() in the CPU_DEAD and CPU_UP_CANCELLED path
> > will cause a deadlock if the worker thread is executing a work item
> > which is blocked on get_online_cpus(). This will lead to a irrecoverable
> > hang.
>
> Yes. But there is nothing new. Currently, work->func() can't share the locks
> with cpu_down's patch. Not only only it can't take workqueue_mutex, it can't
> take any other lock which could be taken by notifier callbacks, etc.
>
> Can't we ignore this problem, at least for now? I believe we need intrusive
> changes to solve this problem correctly. Perhaps I am wrong, of course, but
> I don't see a simple solution.

I think you're right. Even with this patch, we obviously can deadlock
if one of the cpu_notifiers (say slab) calls flush_workqueue or
wait_on_work from say CPU_DOWN_PREPARE, and the work in question
is blocked on get_online_cpus().


>
> Another option. Note that get_online_cpus() does more than just pinning
> cpu maps, actually it blocks hotplug entirely. Now let's look at
> schedule_on_each_cpu(), for example. It doesn't need to block hotplug,
> it only needs a stable cpu_online_map.
>
> Suppose for a moment that _cpu_down() does cpu_hotplug_done() earlier,
> right after __cpu_die(cpu) which removes CPU from the map (yes, this
> is wrong, I know). Now, we don't need to change workqueue_cpu_callback(),
> work->func() can use get_online_cpus() without fear of deadlock.
>
> So, can't we introduce 2 nested rw locks? The first one blocks cpu hotplug
> (like get_online_cpus does currently), the second one just pins cpu maps.
> I think most users needs only this, not more.
>

Well, rw locks/sems cannot recurse. However, refcount model supports
recursion naturally. Hence the implementation.

If the threads need a safe access to the cpu_online_map and they don't
sleep in that critical section, we can use preempt_disable()/preempt_enable()
which will block the stop_machine_run() and thus cpu_disable().
I think it would be a good idea to provide wrapper API's which
will make the code easier to read. Also, I need to check if __cpu_up()
can be called using stop_machine_run().

However, if the subsystem changes it local variables depending on the
cpu-state , i.e CPU_DOWN_PREPARE, CPU_OFFLINE, etc then it would
require synchronization with it's cpu-notifier. As of now, we have
the per-subsystem cpu-hotplug mutexes providing this by blocking
the cpu-hotplug operation. get_online_cpus() is a substitute
for this. And the case where a thread can block or can be preempted
while it is operating in the cpu-hotplug critical section.

> What do you think?

IIRC, the two-nesting rw lock implementation has been tried once before
around a year ago. But it didn't solve the problems due to threads
taking these rwlocks recursively.

>
> (Gautham, I apologize in advance, can't be responsive till weekend).
>
> Oleg.
>

Thanks for the review.

Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

Subject: Re: [RFC PATCH 0/5] Refcount based Cpu Hotplug. V2

On Wed, Oct 24, 2007 at 10:17:54PM +0400, Oleg Nesterov wrote:
> On 10/24, Gautham R Shenoy wrote:
> >
> > On Wed, Oct 24, 2007 at 10:04:41AM -0700, Christoph Lameter wrote:
> > > On Wed, 24 Oct 2007, Gautham R Shenoy wrote:
> > >
> > > > This is the version 2 of the refcount based cpu-hotplug "locking"
> > > > implementation.
> > >
> > > Uggh. This introduces a global lock that has to be taken always when
> > > scanning over cpus?
> >
> > Well, no! we take the global lock only while bumping up the refcount.
> > We don't hold the lock while scanning over the cpus. And this is
> > definitely an improvement over the lock_cpu_hotplug() global mutex
> > we have now.
>
> Just to be sure I didn't miss something... preempt_disable() still works,
> yes?

Yes it does. But it doesn't prevent onlining of new cpus.
I am checking if __cpu_up() can also be called safely using
stop_machine_run() so that we can use preempt_disable() for safe atomic
access of the cpu_online_map.

>
> Oleg.
>

Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

Subject: Re: [RFC PATCH 0/5] Refcount based Cpu Hotplug. V2

On Wed, Oct 24, 2007 at 10:04:41AM -0700, Christoph Lameter wrote:
> On Wed, 24 Oct 2007, Gautham R Shenoy wrote:
>
> > This is the version 2 of the refcount based cpu-hotplug "locking"
> > implementation.
>
> Uggh. This introduces a global lock that has to be taken always when
> scanning over cpus?

Well, no! we take the global lock only while bumping up the refcount.
We don't hold the lock while scanning over the cpus. And this is
definitely an improvement over the lock_cpu_hotplug() global mutex
we have now.

>Is multipe cpus are scanning over processor lists then
> we will get into scaling problems. Wasnt there an RCU based implementation
> that avoided cacheline bouncing and refcounting?

Oh yes! You are probably refering to
http://lkml.org/lkml/2006/10/26/73
which was posted a year ago. Unfortunately it didn't get much review,
and we moved on towards a bunch of other implementations.

Anyway, the first step would be to move away from locks towards
a refcount model. Later on we can use rcu + per-cpu refcounts thereby
making it scalable on the read side.

>
> > The patchstack which is based against 2.6.23-mm1 has behaved well
> > when it was stress tested with kernbench running while continuously
> > performing cpu-hotplug operations on i386, x86_64 and ppc64.
>
> What was the highest cpu count in testing?

8 cpus on i386 and x86_64.
16 cpus on ppc64.

Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"