Subject: [RFC PATCH 0/4] Refcount Based Cpu-Hotplug Revisit.

Hi,
This patch series attempts to revisit the topic of
cpu-hotplug locking model.

Prior to this attempt, there were several different suggestions on
how it should be implemented.

The ones that were posted before were

a) Refcount + Waitqueue model:
Here the threads that want to avoid a cpu hotplug operation while
they are operating in cpu-hotplug critical section, bump up the
reference to the global online cpu state.
The thread which wants to perform a cpu-hotplug,
blocks until the reference to the global online state goes
to zero. Any threads which want to enter the cpu-hotplug critical
section during an ongoing cpu-hotplug operatoin, are blocked using
a waitqueue.

The advantange of this model was that it is along the lines of
the well known get/put model. Only that it allows sleeping of readers
and writers.

The disadvantage, as Andrew pointed out was that there do exist
a whole bunch of lock_cpu_hotplug()'s whose existance is undocumented,
and an approach like this will not improve such a situation.

b) Per Subsystem cpu-hotplug locks: Each subsystem which has cpu-hotplug
critical data, uses a lock to protect that data. Such a subsystem
needs to subscribe to the cpu-hotplug notification, especially the
CPU_LOCK_ACQUIRE and CPU_LOCK_RELEASE events which are sent before
and and after a cpu-hotplug operation. While handling these events
respectively, the subsystem lock is taken or released.

The advantage this model offered was that lock was associated with the
data, which made easy to understand the purpose of locking.

The disadvantage was that any cpu-hotplug aware function, could
not be called from a cpu-hotplug callback path, since we would have
acquired the subsystem lock during CPU_LOCK_ACQUIRE and attempting
to reacquire it would result in a deadlock.
The case which pointed this limitation out was the implementation of
synchronize_sched in preemptible rcu.

c) Freezer based cpu-hotplug:
The idea here was to freeze the system using the process freezer
technique which is being used for suspend/hibernate purpose, before
performing a cpu-hotplug operation. This would ensure that none of
the kernel threads are accessing any of the cpu-hotplug critical
data, because they are frozen at well known points.

This would have helped to remove all kinds of locks because when a
thread is accessing a cpu-hotplug critical data, it meant that the
system was not frozen and hence there would be no cpu-hotplug
operation untill the thread either voluntarily calls try_to_freeze
or returns out of the kernel.

The disadvantage of this approach was that any kind of dependencies
between threads might call the freezer to fail. For eg, thread A is
waiting for thread B to accomplish something, but thread B is already
frozen, leading to a freeze failure. There could be other subtle
races which might be difficult to track.

Some time in May 2007, Linus suggested using the refcount model, and
this patch series simplifies and reimplements the Refcount + waitqueue
model, based on the discussions in the past and inputs from
Vatsa and Rusty.

Patch 1/4: Implements the core refcount + waitqueue model.
Patch 2/4: Replaces all the lock_cpu_hotplug/unlock_cpu_hotplug instances
with get_online_cpus()/put_online_cpus()
Patch 3/4: Replaces the subsystem mutexes (we do have three of them now,
in sched.c, slab.c and workqueue.c) with get_online_cpus,
put_online_cpus.
Patch 4/4: Eliminates the CPU_DEAD and CPU_UP_CANCELLED event handling
from workqueue.c

The patch series has survived an overnight test with kernbench on i386.
and has been tested with Paul Mckenney's latest preemptible rcu code.

Awaiting thy feedback!

Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"


Subject: [RFC PATCH 1/4] Refcount Based Cpu-Hotplug Implementation

This patch implements a Refcount + Waitqueue based model for
cpu-hotplug.

A thread which wants to prevent cpu-hotplug, will now bump up a
global refcount and the thread which wants to perform a cpu-hotplug
operation will block till the global refcount goes to zero.

The readers, if any, during an ongoing cpu-hotplug operation are blocked
until the cpu-hotplug operation is over.

Signed-off-by: Gautham R Shenoy <[email protected]>
---
include/linux/cpu.h | 2 +
init/main.c | 1
kernel/cpu.c | 91 ++++++++++++++++++++++++++++++++++++----------------
3 files changed, 67 insertions(+), 27 deletions(-)

Index: linux-2.6.23/init/main.c
===================================================================
--- linux-2.6.23.orig/init/main.c
+++ linux-2.6.23/init/main.c
@@ -614,6 +614,7 @@ asmlinkage void __init start_kernel(void
vfs_caches_init_early();
cpuset_init_early();
mem_init();
+ cpu_hotplug_init();
kmem_cache_init();
setup_per_cpu_pageset();
numa_policy_init();
Index: linux-2.6.23/include/linux/cpu.h
===================================================================
--- linux-2.6.23.orig/include/linux/cpu.h
+++ linux-2.6.23/include/linux/cpu.h
@@ -97,6 +97,7 @@ static inline void cpuhotplug_mutex_unlo
mutex_unlock(cpu_hp_mutex);
}

+extern void cpu_hotplug_init(void);
extern void lock_cpu_hotplug(void);
extern void unlock_cpu_hotplug(void);
#define hotcpu_notifier(fn, pri) { \
@@ -116,6 +117,7 @@ static inline void cpuhotplug_mutex_lock
static inline void cpuhotplug_mutex_unlock(struct mutex *cpu_hp_mutex)
{ }

+#define cpu_hotplug_init() do { } while (0)
#define lock_cpu_hotplug() do { } while (0)
#define unlock_cpu_hotplug() do { } while (0)
#define hotcpu_notifier(fn, pri) do { (void)(fn); } while (0)
Index: linux-2.6.23/kernel/cpu.c
===================================================================
--- linux-2.6.23.orig/kernel/cpu.c
+++ linux-2.6.23/kernel/cpu.c
@@ -17,7 +17,6 @@

/* This protects CPUs going up and down... */
static DEFINE_MUTEX(cpu_add_remove_lock);
-static DEFINE_MUTEX(cpu_bitmask_lock);

static __cpuinitdata RAW_NOTIFIER_HEAD(cpu_chain);

@@ -26,45 +25,83 @@ static __cpuinitdata RAW_NOTIFIER_HEAD(c
*/
static int cpu_hotplug_disabled;

+static struct {
+ struct task_struct *active_writer;
+ struct mutex lock; /* Synchronizes accesses to refcount, */
+ /*
+ * Also blocks the new readers during
+ * an ongoing cpu hotplug operation.
+ */
+ int refcount;
+ struct completion readers_done;
+} cpu_hotplug;
+
+#define writer_exists() (cpu_hotplug.active_writer != NULL)
+
#ifdef CONFIG_HOTPLUG_CPU

-/* Crappy recursive lock-takers in cpufreq! Complain loudly about idiots */
-static struct task_struct *recursive;
-static int recursive_depth;
+void __init cpu_hotplug_init(void)
+{
+ cpu_hotplug.active_writer = NULL;
+ mutex_init(&cpu_hotplug.lock);
+ cpu_hotplug.refcount = 0;
+ init_completion(&cpu_hotplug.readers_done);
+}

void lock_cpu_hotplug(void)
{
- struct task_struct *tsk = current;
-
- if (tsk == recursive) {
- static int warnings = 10;
- if (warnings) {
- printk(KERN_ERR "Lukewarm IQ detected in hotplug locking\n");
- WARN_ON(1);
- warnings--;
- }
- recursive_depth++;
+ might_sleep();
+ if (cpu_hotplug.active_writer == current)
return;
- }
- mutex_lock(&cpu_bitmask_lock);
- recursive = tsk;
+ mutex_lock(&cpu_hotplug.lock);
+ cpu_hotplug.refcount++;
+ mutex_unlock(&cpu_hotplug.lock);
+
}
EXPORT_SYMBOL_GPL(lock_cpu_hotplug);

void unlock_cpu_hotplug(void)
{
- WARN_ON(recursive != current);
- if (recursive_depth) {
- recursive_depth--;
+ if (cpu_hotplug.active_writer == current)
return;
- }
- recursive = NULL;
- mutex_unlock(&cpu_bitmask_lock);
+ mutex_lock(&cpu_hotplug.lock);
+ cpu_hotplug.refcount--;
+
+ if (unlikely(writer_exists()) && !cpu_hotplug.refcount)
+ complete(&cpu_hotplug.readers_done);
+
+ mutex_unlock(&cpu_hotplug.lock);
+
}
EXPORT_SYMBOL_GPL(unlock_cpu_hotplug);

#endif /* CONFIG_HOTPLUG_CPU */

+/*
+ * This ensures that the hotplug operation can begin only when the
+ * refcount goes to zero.
+ *
+ * Note that during a cpu-hotplug operation, the new readers, if any,
+ * will be blocked by the cpu_hotplug.lock
+ */
+static void cpu_hotplug_begin(void)
+{
+ mutex_lock(&cpu_hotplug.lock);
+ cpu_hotplug.active_writer = current;
+ while (cpu_hotplug.refcount) {
+ mutex_unlock(&cpu_hotplug.lock);
+ wait_for_completion(&cpu_hotplug.readers_done);
+ mutex_lock(&cpu_hotplug.lock);
+ }
+
+}
+
+static void cpu_hotplug_done(void)
+{
+ cpu_hotplug.active_writer = NULL;
+ mutex_unlock(&cpu_hotplug.lock);
+}
+
/* Need to know about CPUs going up/down? */
int __cpuinit register_cpu_notifier(struct notifier_block *nb)
{
@@ -147,6 +184,7 @@ static int _cpu_down(unsigned int cpu, i
if (!cpu_online(cpu))
return -EINVAL;

+ cpu_hotplug_begin();
raw_notifier_call_chain(&cpu_chain, CPU_LOCK_ACQUIRE, hcpu);
err = __raw_notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE | mod,
hcpu, -1, &nr_calls);
@@ -166,9 +204,7 @@ static int _cpu_down(unsigned int cpu, i
cpu_clear(cpu, tmp);
set_cpus_allowed(current, tmp);

- mutex_lock(&cpu_bitmask_lock);
p = __stop_machine_run(take_cpu_down, &tcd_param, cpu);
- mutex_unlock(&cpu_bitmask_lock);

if (IS_ERR(p) || cpu_online(cpu)) {
/* CPU didn't die: tell everyone. Can't complain. */
@@ -203,6 +239,7 @@ out_allowed:
set_cpus_allowed(current, old_allowed);
out_release:
raw_notifier_call_chain(&cpu_chain, CPU_LOCK_RELEASE, hcpu);
+ cpu_hotplug_done();
return err;
}

@@ -231,6 +268,7 @@ static int __cpuinit _cpu_up(unsigned in
if (cpu_online(cpu) || !cpu_present(cpu))
return -EINVAL;

+ cpu_hotplug_begin();
raw_notifier_call_chain(&cpu_chain, CPU_LOCK_ACQUIRE, hcpu);
ret = __raw_notifier_call_chain(&cpu_chain, CPU_UP_PREPARE | mod, hcpu,
-1, &nr_calls);
@@ -243,9 +281,7 @@ static int __cpuinit _cpu_up(unsigned in
}

/* Arch-specific enabling code. */
- mutex_lock(&cpu_bitmask_lock);
ret = __cpu_up(cpu);
- mutex_unlock(&cpu_bitmask_lock);
if (ret != 0)
goto out_notify;
BUG_ON(!cpu_online(cpu));
@@ -258,6 +294,7 @@ out_notify:
__raw_notifier_call_chain(&cpu_chain,
CPU_UP_CANCELED | mod, hcpu, nr_calls, NULL);
raw_notifier_call_chain(&cpu_chain, CPU_LOCK_RELEASE, hcpu);
+ cpu_hotplug_done();

return ret;
}
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

Subject: [RFC PATCH 2/4] Rename lock_cpu_hotplug to get_online_cpus

Replace all lock_cpu_hotplug/unlock_cpu_hotplug from the kernel and use
get_online_cpus and put_online_cpus instead as it highlights
the refcount semantics in these operations.

Signed-off-by: Gautham R Shenoy <[email protected]>
---
arch/i386/kernel/cpu/mtrr/main.c | 8 ++++----
arch/i386/kernel/microcode.c | 16 ++++++++--------
arch/mips/kernel/mips-mt-fpaff.c | 10 +++++-----
arch/powerpc/platforms/pseries/hotplug-cpu.c | 8 ++++----
arch/powerpc/platforms/pseries/rtasd.c | 8 ++++----
drivers/lguest/core.c | 8 ++++----
drivers/s390/char/sclp_config.c | 4 ++--
include/linux/cpu.h | 10 +++++-----
kernel/cpu.c | 8 ++++----
kernel/cpuset.c | 14 +++++++-------
kernel/rcutorture.c | 6 +++---
kernel/stop_machine.c | 4 ++--
net/core/flow.c | 4 ++--
13 files changed, 54 insertions(+), 54 deletions(-)

Index: linux-2.6.23/include/linux/cpu.h
===================================================================
--- linux-2.6.23.orig/include/linux/cpu.h
+++ linux-2.6.23/include/linux/cpu.h
@@ -98,8 +98,8 @@ static inline void cpuhotplug_mutex_unlo
}

extern void cpu_hotplug_init(void);
-extern void lock_cpu_hotplug(void);
-extern void unlock_cpu_hotplug(void);
+extern void get_online_cpus(void);
+extern void put_online_cpus(void);
#define hotcpu_notifier(fn, pri) { \
static struct notifier_block fn##_nb = \
{ .notifier_call = fn, .priority = pri }; \
@@ -117,9 +117,9 @@ static inline void cpuhotplug_mutex_lock
static inline void cpuhotplug_mutex_unlock(struct mutex *cpu_hp_mutex)
{ }

-#define cpu_hotplug_init() do { } while (0)
-#define lock_cpu_hotplug() do { } while (0)
-#define unlock_cpu_hotplug() do { } while (0)
+#define cpu_hotplug_init() do { } while (0)
+#define get_online_cpus() do { } while (0)
+#define put_online_cpus() do { } while (0)
#define hotcpu_notifier(fn, pri) do { (void)(fn); } while (0)
/* These aren't inline functions due to a GCC bug. */
#define register_hotcpu_notifier(nb) ({ (void)(nb); 0; })
Index: linux-2.6.23/kernel/cpu.c
===================================================================
--- linux-2.6.23.orig/kernel/cpu.c
+++ linux-2.6.23/kernel/cpu.c
@@ -48,7 +48,7 @@ void __init cpu_hotplug_init(void)
init_completion(&cpu_hotplug.readers_done);
}

-void lock_cpu_hotplug(void)
+void get_online_cpus(void)
{
might_sleep();
if (cpu_hotplug.active_writer == current)
@@ -58,9 +58,9 @@ void lock_cpu_hotplug(void)
mutex_unlock(&cpu_hotplug.lock);

}
-EXPORT_SYMBOL_GPL(lock_cpu_hotplug);
+EXPORT_SYMBOL_GPL(get_online_cpus);

-void unlock_cpu_hotplug(void)
+void put_online_cpus(void)
{
if (cpu_hotplug.active_writer == current)
return;
@@ -73,7 +73,7 @@ void unlock_cpu_hotplug(void)
mutex_unlock(&cpu_hotplug.lock);

}
-EXPORT_SYMBOL_GPL(unlock_cpu_hotplug);
+EXPORT_SYMBOL_GPL(put_online_cpus);

#endif /* CONFIG_HOTPLUG_CPU */

Index: linux-2.6.23/kernel/stop_machine.c
===================================================================
--- linux-2.6.23.orig/kernel/stop_machine.c
+++ linux-2.6.23/kernel/stop_machine.c
@@ -203,13 +203,13 @@ int stop_machine_run(int (*fn)(void *),
int ret;

/* No CPUs can come up or down during this. */
- lock_cpu_hotplug();
+ get_online_cpus();
p = __stop_machine_run(fn, data, cpu);
if (!IS_ERR(p))
ret = kthread_stop(p);
else
ret = PTR_ERR(p);
- unlock_cpu_hotplug();
+ put_online_cpus();

return ret;
}
Index: linux-2.6.23/arch/i386/kernel/cpu/mtrr/main.c
===================================================================
--- linux-2.6.23.orig/arch/i386/kernel/cpu/mtrr/main.c
+++ linux-2.6.23/arch/i386/kernel/cpu/mtrr/main.c
@@ -351,7 +351,7 @@ int mtrr_add_page(unsigned long base, un
replace = -1;

/* No CPU hotplug when we change MTRR entries */
- lock_cpu_hotplug();
+ get_online_cpus();
/* Search for existing MTRR */
mutex_lock(&mtrr_mutex);
for (i = 0; i < num_var_ranges; ++i) {
@@ -407,7 +407,7 @@ int mtrr_add_page(unsigned long base, un
error = i;
out:
mutex_unlock(&mtrr_mutex);
- unlock_cpu_hotplug();
+ put_online_cpus();
return error;
}

@@ -497,7 +497,7 @@ int mtrr_del_page(int reg, unsigned long

max = num_var_ranges;
/* No CPU hotplug when we change MTRR entries */
- lock_cpu_hotplug();
+ get_online_cpus();
mutex_lock(&mtrr_mutex);
if (reg < 0) {
/* Search for existing MTRR */
@@ -538,7 +538,7 @@ int mtrr_del_page(int reg, unsigned long
error = reg;
out:
mutex_unlock(&mtrr_mutex);
- unlock_cpu_hotplug();
+ put_online_cpus();
return error;
}
/**
Index: linux-2.6.23/arch/i386/kernel/microcode.c
===================================================================
--- linux-2.6.23.orig/arch/i386/kernel/microcode.c
+++ linux-2.6.23/arch/i386/kernel/microcode.c
@@ -436,7 +436,7 @@ static ssize_t microcode_write (struct f
return -EINVAL;
}

- lock_cpu_hotplug();
+ get_online_cpus();
mutex_lock(&microcode_mutex);

user_buffer = (void __user *) buf;
@@ -447,7 +447,7 @@ static ssize_t microcode_write (struct f
ret = (ssize_t)len;

mutex_unlock(&microcode_mutex);
- unlock_cpu_hotplug();
+ put_online_cpus();

return ret;
}
@@ -658,14 +658,14 @@ static ssize_t reload_store(struct sys_d

old = current->cpus_allowed;

- lock_cpu_hotplug();
+ get_online_cpus();
set_cpus_allowed(current, cpumask_of_cpu(cpu));

mutex_lock(&microcode_mutex);
if (uci->valid)
err = cpu_request_microcode(cpu);
mutex_unlock(&microcode_mutex);
- unlock_cpu_hotplug();
+ put_online_cpus();
set_cpus_allowed(current, old);
}
if (err)
@@ -817,9 +817,9 @@ static int __init microcode_init (void)
return PTR_ERR(microcode_pdev);
}

- lock_cpu_hotplug();
+ get_online_cpus();
error = sysdev_driver_register(&cpu_sysdev_class, &mc_sysdev_driver);
- unlock_cpu_hotplug();
+ put_online_cpus();
if (error) {
microcode_dev_exit();
platform_device_unregister(microcode_pdev);
@@ -839,9 +839,9 @@ static void __exit microcode_exit (void)

unregister_hotcpu_notifier(&mc_cpu_notifier);

- lock_cpu_hotplug();
+ get_online_cpus();
sysdev_driver_unregister(&cpu_sysdev_class, &mc_sysdev_driver);
- unlock_cpu_hotplug();
+ put_online_cpus();

platform_device_unregister(microcode_pdev);
}
Index: linux-2.6.23/arch/powerpc/platforms/pseries/hotplug-cpu.c
===================================================================
--- linux-2.6.23.orig/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ linux-2.6.23/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -151,7 +151,7 @@ static int pseries_add_processor(struct
for (i = 0; i < nthreads; i++)
cpu_set(i, tmp);

- lock_cpu_hotplug();
+ get_online_cpus();

BUG_ON(!cpus_subset(cpu_present_map, cpu_possible_map));

@@ -188,7 +188,7 @@ static int pseries_add_processor(struct
}
err = 0;
out_unlock:
- unlock_cpu_hotplug();
+ put_online_cpus();
return err;
}

@@ -209,7 +209,7 @@ static void pseries_remove_processor(str

nthreads = len / sizeof(u32);

- lock_cpu_hotplug();
+ get_online_cpus();
for (i = 0; i < nthreads; i++) {
for_each_present_cpu(cpu) {
if (get_hard_smp_processor_id(cpu) != intserv[i])
@@ -223,7 +223,7 @@ static void pseries_remove_processor(str
printk(KERN_WARNING "Could not find cpu to remove "
"with physical id 0x%x\n", intserv[i]);
}
- unlock_cpu_hotplug();
+ put_online_cpus();
}

static int pseries_smp_notifier(struct notifier_block *nb,
Index: linux-2.6.23/arch/powerpc/platforms/pseries/rtasd.c
===================================================================
--- linux-2.6.23.orig/arch/powerpc/platforms/pseries/rtasd.c
+++ linux-2.6.23/arch/powerpc/platforms/pseries/rtasd.c
@@ -382,7 +382,7 @@ static void do_event_scan_all_cpus(long
{
int cpu;

- lock_cpu_hotplug();
+ get_online_cpus();
cpu = first_cpu(cpu_online_map);
for (;;) {
set_cpus_allowed(current, cpumask_of_cpu(cpu));
@@ -390,15 +390,15 @@ static void do_event_scan_all_cpus(long
set_cpus_allowed(current, CPU_MASK_ALL);

/* Drop hotplug lock, and sleep for the specified delay */
- unlock_cpu_hotplug();
+ put_online_cpus();
msleep_interruptible(delay);
- lock_cpu_hotplug();
+ get_online_cpus();

cpu = next_cpu(cpu, cpu_online_map);
if (cpu == NR_CPUS)
break;
}
- unlock_cpu_hotplug();
+ put_online_cpus();
}

static int rtasd(void *unused)
Index: linux-2.6.23/drivers/s390/char/sclp_config.c
===================================================================
--- linux-2.6.23.orig/drivers/s390/char/sclp_config.c
+++ linux-2.6.23/drivers/s390/char/sclp_config.c
@@ -29,12 +29,12 @@ static void sclp_cpu_capability_notify(s
struct sys_device *sysdev;

printk(KERN_WARNING TAG "cpu capability changed.\n");
- lock_cpu_hotplug();
+ get_online_cpus();
for_each_online_cpu(cpu) {
sysdev = get_cpu_sysdev(cpu);
kobject_uevent(&sysdev->kobj, KOBJ_CHANGE);
}
- unlock_cpu_hotplug();
+ put_online_cpus();
}

static void sclp_conf_receiver_fn(struct evbuf_header *evbuf)
Index: linux-2.6.23/kernel/rcutorture.c
===================================================================
--- linux-2.6.23.orig/kernel/rcutorture.c
+++ linux-2.6.23/kernel/rcutorture.c
@@ -726,11 +726,11 @@ static void rcu_torture_shuffle_tasks(vo
cpumask_t tmp_mask = CPU_MASK_ALL;
int i;

- lock_cpu_hotplug();
+ get_online_cpus();

/* No point in shuffling if there is only one online CPU (ex: UP) */
if (num_online_cpus() == 1) {
- unlock_cpu_hotplug();
+ put_online_cpus();
return;
}

@@ -762,7 +762,7 @@ static void rcu_torture_shuffle_tasks(vo
else
rcu_idle_cpu--;

- unlock_cpu_hotplug();
+ put_online_cpus();
}

/* Shuffle tasks across CPUs, with the intent of allowing each CPU in the
Index: linux-2.6.23/net/core/flow.c
===================================================================
--- linux-2.6.23.orig/net/core/flow.c
+++ linux-2.6.23/net/core/flow.c
@@ -296,7 +296,7 @@ void flow_cache_flush(void)
static DEFINE_MUTEX(flow_flush_sem);

/* Don't want cpus going down or up during this. */
- lock_cpu_hotplug();
+ get_online_cpus();
mutex_lock(&flow_flush_sem);
atomic_set(&info.cpuleft, num_online_cpus());
init_completion(&info.completion);
@@ -308,7 +308,7 @@ void flow_cache_flush(void)

wait_for_completion(&info.completion);
mutex_unlock(&flow_flush_sem);
- unlock_cpu_hotplug();
+ put_online_cpus();
}

static void __devinit flow_cache_cpu_prepare(int cpu)
Index: linux-2.6.23/arch/mips/kernel/mips-mt-fpaff.c
===================================================================
--- linux-2.6.23.orig/arch/mips/kernel/mips-mt-fpaff.c
+++ linux-2.6.23/arch/mips/kernel/mips-mt-fpaff.c
@@ -58,13 +58,13 @@ asmlinkage long mipsmt_sys_sched_setaffi
if (copy_from_user(&new_mask, user_mask_ptr, sizeof(new_mask)))
return -EFAULT;

- lock_cpu_hotplug();
+ get_online_cpus();
read_lock(&tasklist_lock);

p = find_process_by_pid(pid);
if (!p) {
read_unlock(&tasklist_lock);
- unlock_cpu_hotplug();
+ put_online_cpus();
return -ESRCH;
}

@@ -106,7 +106,7 @@ asmlinkage long mipsmt_sys_sched_setaffi

out_unlock:
put_task_struct(p);
- unlock_cpu_hotplug();
+ put_online_cpus();
return retval;
}

@@ -125,7 +125,7 @@ asmlinkage long mipsmt_sys_sched_getaffi
if (len < real_len)
return -EINVAL;

- lock_cpu_hotplug();
+ get_online_cpus();
read_lock(&tasklist_lock);

retval = -ESRCH;
@@ -140,7 +140,7 @@ asmlinkage long mipsmt_sys_sched_getaffi

out_unlock:
read_unlock(&tasklist_lock);
- unlock_cpu_hotplug();
+ put_online_cpus();
if (retval)
return retval;
if (copy_to_user(user_mask_ptr, &mask, real_len))
Index: linux-2.6.23/drivers/lguest/core.c
===================================================================
--- linux-2.6.23.orig/drivers/lguest/core.c
+++ linux-2.6.23/drivers/lguest/core.c
@@ -735,7 +735,7 @@ static int __init init(void)

/* We don't need the complexity of CPUs coming and going while we're
* doing this. */
- lock_cpu_hotplug();
+ get_online_cpus();
if (cpu_has_pge) { /* We have a broader idea of "global". */
/* Remember that this was originally set (for cleanup). */
cpu_had_pge = 1;
@@ -745,7 +745,7 @@ static int __init init(void)
/* Turn off the feature in the global feature set. */
clear_bit(X86_FEATURE_PGE, boot_cpu_data.x86_capability);
}
- unlock_cpu_hotplug();
+ put_online_cpus();

/* All good! */
return 0;
@@ -759,13 +759,13 @@ static void __exit fini(void)
unmap_switcher();

/* If we had PGE before we started, turn it back on now. */
- lock_cpu_hotplug();
+ get_online_cpus();
if (cpu_had_pge) {
set_bit(X86_FEATURE_PGE, boot_cpu_data.x86_capability);
/* adjust_pge's argument "1" means set PGE. */
on_each_cpu(adjust_pge, (void *)1, 0, 1);
}
- unlock_cpu_hotplug();
+ put_online_cpus();
}

/* The Host side of lguest can be a module. This is a nice way for people to
Index: linux-2.6.23/kernel/cpuset.c
===================================================================
--- linux-2.6.23.orig/kernel/cpuset.c
+++ linux-2.6.23/kernel/cpuset.c
@@ -536,10 +536,10 @@ static int cpusets_overlap(struct cpuset
*
* Call with cgroup_mutex held. May take callback_mutex during
* call due to the kfifo_alloc() and kmalloc() calls. May nest
- * a call to the lock_cpu_hotplug()/unlock_cpu_hotplug() pair.
+ * a call to the get_online_cpus()/put_online_cpus() pair.
* Must not be called holding callback_mutex, because we must not
- * call lock_cpu_hotplug() while holding callback_mutex. Elsewhere
- * the kernel nests callback_mutex inside lock_cpu_hotplug() calls.
+ * call get_online_cpus() while holding callback_mutex. Elsewhere
+ * the kernel nests callback_mutex inside get_online_cpus() calls.
* So the reverse nesting would risk an ABBA deadlock.
*
* The three key local variables below are:
@@ -673,9 +673,9 @@ restart:

rebuild:
/* Have scheduler rebuild sched domains */
- lock_cpu_hotplug();
+ get_online_cpus();
partition_sched_domains(ndoms, doms);
- unlock_cpu_hotplug();
+ put_online_cpus();

done:
if (q && !IS_ERR(q))
@@ -1503,10 +1503,10 @@ static struct cgroup_subsys_state *cpuse
*
* If the cpuset being removed has its flag 'sched_load_balance'
* enabled, then simulate turning sched_load_balance off, which
- * will call rebuild_sched_domains(). The lock_cpu_hotplug()
+ * will call rebuild_sched_domains(). The get_online_cpus()
* call in rebuild_sched_domains() must not be made while holding
* callback_mutex. Elsewhere the kernel nests callback_mutex inside
- * lock_cpu_hotplug() calls. So the reverse nesting would risk an
+ * get_online_cpus() calls. So the reverse nesting would risk an
* ABBA deadlock.
*/

--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

Subject: [RFC PATCH 3/4] Replace per-subsystem mutexes with get_online_cpus

This patch converts the known per-subsystem cpu_hotplug mutexes to
get_online_cpus put_online_cpus.
It also eliminates the CPU_LOCK_ACQUIRE and CPU_LOCK_RELEASE hotplug
notification events.

Signed-off-by: Gautham R Shenoy <[email protected]>
---
include/linux/notifier.h | 4 +---
kernel/cpu.c | 4 ----
kernel/sched.c | 27 ++++++++++-----------------
kernel/workqueue.c | 31 ++++++++++++++++---------------
mm/slab.c | 18 +++++++++++-------
5 files changed, 38 insertions(+), 46 deletions(-)

Index: linux-2.6.23/kernel/sched.c
===================================================================
--- linux-2.6.23.orig/kernel/sched.c
+++ linux-2.6.23/kernel/sched.c
@@ -360,7 +360,6 @@ struct rq {
};

static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
-static DEFINE_MUTEX(sched_hotcpu_mutex);

static inline void check_preempt_curr(struct rq *rq, struct task_struct *p)
{
@@ -4337,13 +4336,13 @@ long sched_setaffinity(pid_t pid, cpumas
struct task_struct *p;
int retval;

- mutex_lock(&sched_hotcpu_mutex);
+ get_online_cpus();
read_lock(&tasklist_lock);

p = find_process_by_pid(pid);
if (!p) {
read_unlock(&tasklist_lock);
- mutex_unlock(&sched_hotcpu_mutex);
+ put_online_cpus();
return -ESRCH;
}

@@ -4370,7 +4369,7 @@ long sched_setaffinity(pid_t pid, cpumas

out_unlock:
put_task_struct(p);
- mutex_unlock(&sched_hotcpu_mutex);
+ put_online_cpus();
return retval;
}

@@ -4427,7 +4426,7 @@ long sched_getaffinity(pid_t pid, cpumas
struct task_struct *p;
int retval;

- mutex_lock(&sched_hotcpu_mutex);
+ get_online_cpus();
read_lock(&tasklist_lock);

retval = -ESRCH;
@@ -4443,7 +4442,7 @@ long sched_getaffinity(pid_t pid, cpumas

out_unlock:
read_unlock(&tasklist_lock);
- mutex_unlock(&sched_hotcpu_mutex);
+ put_online_cpus();

return retval;
}
@@ -5342,9 +5341,6 @@ migration_call(struct notifier_block *nf
struct rq *rq;

switch (action) {
- case CPU_LOCK_ACQUIRE:
- mutex_lock(&sched_hotcpu_mutex);
- break;

case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
@@ -5398,7 +5394,7 @@ migration_call(struct notifier_block *nf
BUG_ON(rq->nr_running != 0);

/* No need to migrate the tasks: it was best-effort if
- * they didn't take sched_hotcpu_mutex. Just wake up
+ * they didn't do a get_online_cpus(). Just wake up
* the requestors. */
spin_lock_irq(&rq->lock);
while (!list_empty(&rq->migration_queue)) {
@@ -5412,9 +5408,6 @@ migration_call(struct notifier_block *nf
spin_unlock_irq(&rq->lock);
break;
#endif
- case CPU_LOCK_RELEASE:
- mutex_unlock(&sched_hotcpu_mutex);
- break;
}
return NOTIFY_OK;
}
@@ -6338,10 +6331,10 @@ static int arch_reinit_sched_domains(voi
{
int err;

- mutex_lock(&sched_hotcpu_mutex);
+ get_online_cpus();
detach_destroy_domains(&cpu_online_map);
err = arch_init_sched_domains(&cpu_online_map);
- mutex_unlock(&sched_hotcpu_mutex);
+ put_online_cpus();

return err;
}
@@ -6452,12 +6445,12 @@ void __init sched_init_smp(void)
{
cpumask_t non_isolated_cpus;

- mutex_lock(&sched_hotcpu_mutex);
+ get_online_cpus();
arch_init_sched_domains(&cpu_online_map);
cpus_andnot(non_isolated_cpus, cpu_possible_map, cpu_isolated_map);
if (cpus_empty(non_isolated_cpus))
cpu_set(smp_processor_id(), non_isolated_cpus);
- mutex_unlock(&sched_hotcpu_mutex);
+ put_online_cpus();
/* XXX: Theoretical race here - CPU may be hotplugged now */
hotcpu_notifier(update_sched_domains, 0);

Index: linux-2.6.23/mm/slab.c
===================================================================
--- linux-2.6.23.orig/mm/slab.c
+++ linux-2.6.23/mm/slab.c
@@ -730,8 +730,7 @@ static inline void init_lock_keys(void)
#endif

/*
- * 1. Guard access to the cache-chain.
- * 2. Protect sanity of cpu_online_map against cpu hotplug events
+ * Guard access to the cache-chain.
*/
static DEFINE_MUTEX(cache_chain_mutex);
static struct list_head cache_chain;
@@ -1331,12 +1330,11 @@ static int __cpuinit cpuup_callback(stru
int err = 0;

switch (action) {
- case CPU_LOCK_ACQUIRE:
- mutex_lock(&cache_chain_mutex);
- break;
case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
+ mutex_lock(&cache_chain_mutex);
err = cpuup_prepare(cpu);
+ mutex_unlock(&cache_chain_mutex);
break;
case CPU_ONLINE:
case CPU_ONLINE_FROZEN:
@@ -1373,9 +1371,8 @@ static int __cpuinit cpuup_callback(stru
#endif
case CPU_UP_CANCELED:
case CPU_UP_CANCELED_FROZEN:
+ mutex_lock(&cache_chain_mutex);
cpuup_canceled(cpu);
- break;
- case CPU_LOCK_RELEASE:
mutex_unlock(&cache_chain_mutex);
break;
}
@@ -2170,6 +2167,7 @@ kmem_cache_create (const char *name, siz
* We use cache_chain_mutex to ensure a consistent view of
* cpu_online_map as well. Please see cpuup_callback
*/
+ get_online_cpus();
mutex_lock(&cache_chain_mutex);

list_for_each_entry(pc, &cache_chain, next) {
@@ -2396,6 +2394,7 @@ oops:
panic("kmem_cache_create(): failed to create slab `%s'\n",
name);
mutex_unlock(&cache_chain_mutex);
+ put_online_cpus();
return cachep;
}
EXPORT_SYMBOL(kmem_cache_create);
@@ -2547,9 +2546,11 @@ int kmem_cache_shrink(struct kmem_cache
int ret;
BUG_ON(!cachep || in_interrupt());

+ get_online_cpus();
mutex_lock(&cache_chain_mutex);
ret = __cache_shrink(cachep);
mutex_unlock(&cache_chain_mutex);
+ put_online_cpus();
return ret;
}
EXPORT_SYMBOL(kmem_cache_shrink);
@@ -2575,6 +2576,7 @@ void kmem_cache_destroy(struct kmem_cach
BUG_ON(!cachep || in_interrupt());

/* Find the cache in the chain of caches. */
+ get_online_cpus();
mutex_lock(&cache_chain_mutex);
/*
* the chain is never empty, cache_cache is never destroyed
@@ -2584,6 +2586,7 @@ void kmem_cache_destroy(struct kmem_cach
slab_error(cachep, "Can't free all objects");
list_add(&cachep->next, &cache_chain);
mutex_unlock(&cache_chain_mutex);
+ put_online_cpus();
return;
}

@@ -2592,6 +2595,7 @@ void kmem_cache_destroy(struct kmem_cach

__kmem_cache_destroy(cachep);
mutex_unlock(&cache_chain_mutex);
+ put_online_cpus();
}
EXPORT_SYMBOL(kmem_cache_destroy);

Index: linux-2.6.23/kernel/workqueue.c
===================================================================
--- linux-2.6.23.orig/kernel/workqueue.c
+++ linux-2.6.23/kernel/workqueue.c
@@ -592,8 +592,6 @@ EXPORT_SYMBOL(schedule_delayed_work_on);
* Returns zero on success.
* Returns -ve errno on failure.
*
- * Appears to be racy against CPU hotplug.
- *
* schedule_on_each_cpu() is very slow.
*/
int schedule_on_each_cpu(work_func_t func)
@@ -605,7 +603,7 @@ int schedule_on_each_cpu(work_func_t fun
if (!works)
return -ENOMEM;

- preempt_disable(); /* CPU hotplug */
+ get_online_cpus(); /* CPU hotplug */
for_each_online_cpu(cpu) {
struct work_struct *work = per_cpu_ptr(works, cpu);

@@ -613,7 +611,7 @@ int schedule_on_each_cpu(work_func_t fun
set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
__queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), work);
}
- preempt_enable();
+ put_online_cpus();
flush_workqueue(keventd_wq);
free_percpu(works);
return 0;
@@ -749,8 +747,10 @@ struct workqueue_struct *__create_workqu
err = create_workqueue_thread(cwq, singlethread_cpu);
start_workqueue_thread(cwq, -1);
} else {
+ get_online_cpus();
mutex_lock(&workqueue_mutex);
list_add(&wq->list, &workqueues);
+ mutex_unlock(&workqueue_mutex);

for_each_possible_cpu(cpu) {
cwq = init_cpu_workqueue(wq, cpu);
@@ -759,7 +759,7 @@ struct workqueue_struct *__create_workqu
err = create_workqueue_thread(cwq, cpu);
start_workqueue_thread(cwq, cpu);
}
- mutex_unlock(&workqueue_mutex);
+ put_online_cpus();
}

if (err) {
@@ -809,6 +809,7 @@ void destroy_workqueue(struct workqueue_
struct cpu_workqueue_struct *cwq;
int cpu;

+ get_online_cpus();
mutex_lock(&workqueue_mutex);
list_del(&wq->list);
mutex_unlock(&workqueue_mutex);
@@ -817,6 +818,7 @@ void destroy_workqueue(struct workqueue_
cwq = per_cpu_ptr(wq->cpu_wq, cpu);
cleanup_workqueue_thread(cwq, cpu);
}
+ put_online_cpus();

free_percpu(wq->cpu_wq);
kfree(wq);
@@ -830,22 +832,17 @@ static int __devinit workqueue_cpu_callb
unsigned int cpu = (unsigned long)hcpu;
struct cpu_workqueue_struct *cwq;
struct workqueue_struct *wq;
+ int ret = NOTIFY_OK;

action &= ~CPU_TASKS_FROZEN;

switch (action) {
- case CPU_LOCK_ACQUIRE:
- mutex_lock(&workqueue_mutex);
- return NOTIFY_OK;
-
- case CPU_LOCK_RELEASE:
- mutex_unlock(&workqueue_mutex);
- return NOTIFY_OK;

case CPU_UP_PREPARE:
cpu_set(cpu, cpu_populated_map);
}

+ mutex_lock(&workqueue_mutex);
list_for_each_entry(wq, &workqueues, list) {
cwq = per_cpu_ptr(wq->cpu_wq, cpu);

@@ -853,8 +850,10 @@ static int __devinit workqueue_cpu_callb
case CPU_UP_PREPARE:
if (!create_workqueue_thread(cwq, cpu))
break;
- printk(KERN_ERR "workqueue for %i failed\n", cpu);
- return NOTIFY_BAD;
+ printk(KERN_ERR "workqueue [%s] for %i failed\n",
+ wq->name, cpu);
+ ret = NOTIFY_BAD;
+ goto out_unlock;

case CPU_ONLINE:
start_workqueue_thread(cwq, cpu);
@@ -868,7 +867,9 @@ static int __devinit workqueue_cpu_callb
}
}

- return NOTIFY_OK;
+out_unlock:
+ mutex_unlock(&workqueue_mutex);
+ return ret;
}

void __init init_workqueues(void)
Index: linux-2.6.23/kernel/cpu.c
===================================================================
--- linux-2.6.23.orig/kernel/cpu.c
+++ linux-2.6.23/kernel/cpu.c
@@ -185,7 +185,6 @@ static int _cpu_down(unsigned int cpu, i
return -EINVAL;

cpu_hotplug_begin();
- raw_notifier_call_chain(&cpu_chain, CPU_LOCK_ACQUIRE, hcpu);
err = __raw_notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE | mod,
hcpu, -1, &nr_calls);
if (err == NOTIFY_BAD) {
@@ -238,7 +237,6 @@ out_thread:
out_allowed:
set_cpus_allowed(current, old_allowed);
out_release:
- raw_notifier_call_chain(&cpu_chain, CPU_LOCK_RELEASE, hcpu);
cpu_hotplug_done();
return err;
}
@@ -269,7 +267,6 @@ static int __cpuinit _cpu_up(unsigned in
return -EINVAL;

cpu_hotplug_begin();
- raw_notifier_call_chain(&cpu_chain, CPU_LOCK_ACQUIRE, hcpu);
ret = __raw_notifier_call_chain(&cpu_chain, CPU_UP_PREPARE | mod, hcpu,
-1, &nr_calls);
if (ret == NOTIFY_BAD) {
@@ -293,7 +290,6 @@ out_notify:
if (ret != 0)
__raw_notifier_call_chain(&cpu_chain,
CPU_UP_CANCELED | mod, hcpu, nr_calls, NULL);
- raw_notifier_call_chain(&cpu_chain, CPU_LOCK_RELEASE, hcpu);
cpu_hotplug_done();

return ret;
Index: linux-2.6.23/include/linux/notifier.h
===================================================================
--- linux-2.6.23.orig/include/linux/notifier.h
+++ linux-2.6.23/include/linux/notifier.h
@@ -207,9 +207,7 @@ static inline int notifier_to_errno(int
#define CPU_DOWN_PREPARE 0x0005 /* CPU (unsigned)v going down */
#define CPU_DOWN_FAILED 0x0006 /* CPU (unsigned)v NOT going down */
#define CPU_DEAD 0x0007 /* CPU (unsigned)v dead */
-#define CPU_LOCK_ACQUIRE 0x0008 /* Acquire all hotcpu locks */
-#define CPU_LOCK_RELEASE 0x0009 /* Release all hotcpu locks */
-#define CPU_DYING 0x000A /* CPU (unsigned)v not running any task,
+#define CPU_DYING 0x0008 /* CPU (unsigned)v not running any task,
* not handling interrupts, soon dead */

/* Used for CPU hotplug events occuring while tasks are frozen due to a suspend
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

Subject: [RFC PATCH 4/4] Remove CPU_DEAD/CPU_UP_CANCELLED handling from workqueue.c

cleanup_workqueue_thread() in the CPU_DEAD and CPU_UP_CANCELLED path
will cause a deadlock if the worker thread is executing a work item
which is blocked on get_online_cpus(). This will lead to a irrecoverable
hang.

Solution is not to cleanup the worker thread. Instead let it remain
even after the cpu goes offline. Since no one can queue any work
on an offlined cpu, this thread will be forever sleeping, untill
someone onlines the cpu.

Signed-off-by: Gautham R Shenoy <[email protected]>
---
kernel/workqueue.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)

Index: linux-2.6.23/kernel/workqueue.c
===================================================================
--- linux-2.6.23.orig/kernel/workqueue.c
+++ linux-2.6.23/kernel/workqueue.c
@@ -30,6 +30,7 @@
#include <linux/hardirq.h>
#include <linux/mempolicy.h>
#include <linux/freezer.h>
+#include <linux/cpumask.h>
#include <linux/kallsyms.h>
#include <linux/debug_locks.h>
#include <linux/lockdep.h>
@@ -679,6 +680,7 @@ init_cpu_workqueue(struct workqueue_stru
spin_lock_init(&cwq->lock);
INIT_LIST_HEAD(&cwq->worklist);
init_waitqueue_head(&cwq->more_work);
+ cwq->thread = NULL;

return cwq;
}
@@ -712,7 +714,7 @@ static void start_workqueue_thread(struc

if (p != NULL) {
if (cpu >= 0)
- kthread_bind(p, cpu);
+ set_cpus_allowed(p, cpumask_of_cpu(cpu));
wake_up_process(p);
}
}
@@ -848,6 +850,9 @@ static int __devinit workqueue_cpu_callb

switch (action) {
case CPU_UP_PREPARE:
+ if (likely(cwq->thread != NULL &&
+ !IS_ERR(cwq->thread)))
+ break;
if (!create_workqueue_thread(cwq, cpu))
break;
printk(KERN_ERR "workqueue [%s] for %i failed\n",
@@ -858,12 +863,6 @@ static int __devinit workqueue_cpu_callb
case CPU_ONLINE:
start_workqueue_thread(cwq, cpu);
break;
-
- case CPU_UP_CANCELED:
- start_workqueue_thread(cwq, -1);
- case CPU_DEAD:
- cleanup_workqueue_thread(cwq, cpu);
- break;
}
}

--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2007-10-16 17:22:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Refcount Based Cpu-Hotplug Revisit.



On Tue, 16 Oct 2007, Gautham R Shenoy wrote:
>
> Patch 1/4: Implements the core refcount + waitqueue model.
> Patch 2/4: Replaces all the lock_cpu_hotplug/unlock_cpu_hotplug instances
> with get_online_cpus()/put_online_cpus()
> Patch 3/4: Replaces the subsystem mutexes (we do have three of them now,
> in sched.c, slab.c and workqueue.c) with get_online_cpus,
> put_online_cpus.
> Patch 4/4: Eliminates the CPU_DEAD and CPU_UP_CANCELLED event handling
> from workqueue.c
>
> The patch series has survived an overnight test with kernbench on i386.
> and has been tested with Paul Mckenney's latest preemptible rcu code.
>
> Awaiting thy feedback!

Well, afaik, the patch series is fairly clean, and I'm obviously perfectly
happy with the approach, so I have no objections.

But it looks buggy. This:

+static void cpu_hotplug_begin(void)
+{
+ mutex_lock(&cpu_hotplug.lock);
+ cpu_hotplug.active_writer = current;
+ while (cpu_hotplug.refcount) {
+ mutex_unlock(&cpu_hotplug.lock);
+ wait_for_completion(&cpu_hotplug.readers_done);
+ mutex_lock(&cpu_hotplug.lock);
+ }
+
+}

drops the cpu_hotplug.lock, which - as far as I can see - means that
another process can come in and do the same, and mess up the
"active_writer" thing. The oerson that actually *gets* the lock may not be
the same one that has "active_writer" set to itself. No? Am I missing
something.

So I think this needs (a) more people looking at it (I think I found a
bug, who knows if there are more subtle ones lurking) and (b) lots of
testing.

Linus

2007-10-17 00:48:17

by Rusty Russell

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] Refcount Based Cpu-Hotplug Implementation

On Tuesday 16 October 2007 20:34:17 Gautham R Shenoy wrote:
> This patch implements a Refcount + Waitqueue based model for
> cpu-hotplug.

Hi Gautham,

I can't see where you re-initialize the completion.

> +static void cpu_hotplug_begin(void)
> +{
> + mutex_lock(&cpu_hotplug.lock);
> + cpu_hotplug.active_writer = current;
> + while (cpu_hotplug.refcount) {
> + mutex_unlock(&cpu_hotplug.lock);
> + wait_for_completion(&cpu_hotplug.readers_done);
> + mutex_lock(&cpu_hotplug.lock);
> + }

AFAICT this will busy-wait on the second CPU hotplug.

Cheers,
Rusty.

2007-10-17 02:12:54

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Refcount Based Cpu-Hotplug Revisit.

On Tue, Oct 16, 2007 at 10:20:37AM -0700, Linus Torvalds wrote:
> On Tue, 16 Oct 2007, Gautham R Shenoy wrote:
>
> Well, afaik, the patch series is fairly clean, and I'm obviously perfectly
> happy with the approach, so I have no objections.
>
> But it looks buggy. This:
>
> +static void cpu_hotplug_begin(void)
> +{
> + mutex_lock(&cpu_hotplug.lock);
> + cpu_hotplug.active_writer = current;
> + while (cpu_hotplug.refcount) {
> + mutex_unlock(&cpu_hotplug.lock);
> + wait_for_completion(&cpu_hotplug.readers_done);
> + mutex_lock(&cpu_hotplug.lock);
> + }
> +
> +}
>
> drops the cpu_hotplug.lock, which - as far as I can see - means that
> another process can come in and do the same, and mess up the
> "active_writer" thing. The oerson that actually *gets* the lock may not be
> the same one that has "active_writer" set to itself. No? Am I missing
> something.

Unless I am reading the patch wrongly, it seems cpu_hotplug_begin() is called
while holding the cpu_add_remove_lock mutex. So, another CPU cannot come in
and do the same until _cpu_down() is over.

Thanks
Dipankar

2007-10-17 02:30:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Refcount Based Cpu-Hotplug Revisit.



On Wed, 17 Oct 2007, Dipankar Sarma wrote:
>
> Unless I am reading the patch wrongly, it seems cpu_hotplug_begin() is called
> while holding the cpu_add_remove_lock mutex. So, another CPU cannot come in
> and do the same until _cpu_down() is over.

Ahh, in that case I take back that objection, although maybe a comment
about the required nesting within the other mutex is in order? Or maybe
it's there and I just missed it (blush).

Linus

Subject: Re: [RFC PATCH 0/4] Refcount Based Cpu-Hotplug Revisit.

On Tue, Oct 16, 2007 at 07:23:38PM -0700, Linus Torvalds wrote:
>
>
> On Wed, 17 Oct 2007, Dipankar Sarma wrote:
> >
> > Unless I am reading the patch wrongly, it seems cpu_hotplug_begin() is called
> > while holding the cpu_add_remove_lock mutex. So, another CPU cannot come in
> > and do the same until _cpu_down() is over.
>
> Ahh, in that case I take back that objection, although maybe a comment
> about the required nesting within the other mutex is in order? Or maybe
> it's there and I just missed it (blush).

Good point! I'll add the comment before cpu_hotplug_begin().


>
> Linus

Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

Subject: Re: [RFC PATCH 1/4] Refcount Based Cpu-Hotplug Implementation

On Wed, Oct 17, 2007 at 10:47:41AM +1000, Rusty Russell wrote:
> On Tuesday 16 October 2007 20:34:17 Gautham R Shenoy wrote:
> > This patch implements a Refcount + Waitqueue based model for
> > cpu-hotplug.
>
> Hi Gautham,

Hi Rusty,

>
> I can't see where you re-initialize the completion.

The cpu_hotplug.readers_done is a global variable which has been
initialized in cpu_hotplug_init.

So I am wondering is the re-initialization required ?

>
> > +static void cpu_hotplug_begin(void)
> > +{
> > + mutex_lock(&cpu_hotplug.lock);
> > + cpu_hotplug.active_writer = current;
> > + while (cpu_hotplug.refcount) {
> > + mutex_unlock(&cpu_hotplug.lock);
> > + wait_for_completion(&cpu_hotplug.readers_done);
> > + mutex_lock(&cpu_hotplug.lock);
> > + }
>
> AFAICT this will busy-wait on the second CPU hotplug.
>

Well when the first cpu_hotplug comes out of wait_for_completion, it
would have decremented the ->done count, so it's as good as new
for the second CPU hotplug, no?

> Cheers,
> Rusty.

Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2007-10-17 06:29:40

by Rusty Russell

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] Refcount Based Cpu-Hotplug Implementation

On Wednesday 17 October 2007 15:37:54 Gautham R Shenoy wrote:
> On Wed, Oct 17, 2007 at 10:47:41AM +1000, Rusty Russell wrote:
> > On Tuesday 16 October 2007 20:34:17 Gautham R Shenoy wrote:
> > > This patch implements a Refcount + Waitqueue based model for
> > > cpu-hotplug.
> >
> > Hi Gautham,
>
> Hi Rusty,
>
> > I can't see where you re-initialize the completion.
>
> The cpu_hotplug.readers_done is a global variable which has been
> initialized in cpu_hotplug_init.
>
> So I am wondering is the re-initialization required ?

Yes. AFAICT you use this completion on every hotplug. Yet once a completion
is completed, it needs to be re-initialized to be reused: it's "complete" and
wait_for_completion will return immediately thereafter.

Perhaps you want a waitqueue instead?

Rusty.

2007-10-17 10:53:45

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] Refcount Based Cpu-Hotplug Implementation

This patch dies on boot for me, using sn2_defconfig. It is dies on
the kernel/cpu.c line:

mutex_lock(&cpu_hotplug.lock);

This is in cpu_hotplug_begin(), called from _cpu_up(), called from
cpu_up().

I haven't finished verifying this yet, but it looks like the
initialization of the mutex cpu_hotplug.lock in cpu_hotplug_init()
only happens if CONFIG_HOTPLUG_CPU is enabled, but the use of that
mutex in the above mutex_lock() call happens on all configs.

In the sn2_defconfig I'm using, as well as in several other ia64,
mips and powerpc configs, CONFIG_HOTPLUG_CPU is, at present anyway,
disabled.

For the record, the kernel death messages are:

===

Unable to handle kernel NULL pointer dereference (address 0000000000000000)
swapper[1]: Oops 8804682956800 [1]
Modules linked in:

Pid: 1, CPU 0, comm: swapper
psr : 00001010085a2010 ifs : 800000000000038b ip : [<a00000010081e080>] Not tainted
ip is at __mutex_lock_slowpath+0x280/0x660
unat: 0000000000000000 pfs : 000000000000038b rsc : 0000000000000003
rnat: 1fcc5c7d8b12a4b4 bsps: 000000000001003e pr : 0000000000001041
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
csd : 0000000000000000 ssd : 0000000000000000
b0 : a00000010081e040 b6 : e0000030025bee60 b7 : a0000001005315c0
f6 : 1003e1111111111111111 f7 : 1003e20c49ba5e353f7cf
f8 : 1003e00000000000000c8 f9 : 10006c7fffffffd73ea5c
f10 : 100019ffffffff8a77fd5 f11 : 1003e0000000000000000
r1 : a000000101064760 r2 : a00000010118ef80 r3 : e00000b020377b20
r8 : e00000b020377d00 r9 : 0000000000004000 r10 : e00000b020370b58
r11 : e00000b020370000 r12 : e00000b020377cf0 r13 : e00000b020370000
r14 : 0000000000000001 r15 : e00000b020377d18 r16 : e00000b020377768
r17 : 00000000dead4ead r18 : a000000100d01380 r19 : a000000100e68300
r20 : 0000000000004000 r21 : 0000000000000010 r22 : 0000000000000000
r23 : 0000000000000000 r24 : 0000000000004000 r25 : ffffffffffffffff
r26 : e00000b020377d10 r27 : 0000000000000000 r28 : e00000b020377d00
r29 : e00000b020377d08 r30 : a00000010118efa0 r31 : a00000010118ef98

Call Trace:
[<a000000100013a60>] show_stack+0x40/0xa0
sp=e00000b0203778c0 bsp=e00000b020370f80
[<a000000100014360>] show_regs+0x840/0x880
sp=e00000b020377a90 bsp=e00000b020370f28
[<a000000100036460>] die+0x220/0x300
sp=e00000b020377a90 bsp=e00000b020370ee0
[<a000000100059cd0>] ia64_do_page_fault+0x910/0xa80
sp=e00000b020377a90 bsp=e00000b020370e90
[<a00000010000b120>] ia64_leave_kernel+0x0/0x280
sp=e00000b020377b20 bsp=e00000b020370e90
[<a00000010081e080>] __mutex_lock_slowpath+0x280/0x660
sp=e00000b020377cf0 bsp=e00000b020370e38
[<a00000010081e4a0>] mutex_lock+0x40/0x60
sp=e00000b020377d20 bsp=e00000b020370e18
[<a000000100a2c7e0>] cpu_up+0x220/0x6a0
sp=e00000b020377d20 bsp=e00000b020370dd0
[<a000000100a044d0>] kernel_init+0x250/0x7e0
sp=e00000b020377d30 bsp=e00000b020370d88
[<a000000100011fd0>] kernel_thread_helper+0xd0/0x100
sp=e00000b020377e30 bsp=e00000b020370d60
[<a000000100008dc0>] start_kernel_thread+0x20/0x40
sp=e00000b020377e30 bsp=e00000b020370d60
Kernel panic - not syncing: Attempted to kill init!


--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2007-10-17 11:27:25

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] Refcount Based Cpu-Hotplug Implementation

With the following patch, the four cpu hotplug patches by
Gautham R Shenoy boot successfully on my sn2_defconfig ia64
SN2 Altix system.

This patch moves the cpu_hotplug_init() code outside of the
#ifdef CONFIG_HOTPLUG_CPU code.

I will confess to being confused however as to the best way
to handle this. Having "cpu_hotplug_init" be a piece of code
that has to execute whether or not CONFIG_HOTPLUG_CPU is
enabled doesn't seem right.

Signed-off-by: Paul Jackson <[email protected]>

---

include/linux/cpu.h | 3 +--
kernel/cpu.c | 4 ++--
2 files changed, 3 insertions(+), 4 deletions(-)

--- 2.6.23-mm1.orig/include/linux/cpu.h 2007-10-17 03:59:59.529623639 -0700
+++ 2.6.23-mm1/include/linux/cpu.h 2007-10-17 04:03:56.257223754 -0700
@@ -83,6 +83,7 @@ static inline void unregister_cpu_notifi

#endif /* CONFIG_SMP */
extern struct sysdev_class cpu_sysdev_class;
+extern void cpu_hotplug_init(void);

#ifdef CONFIG_HOTPLUG_CPU
/* Stop CPUs going up and down. */
@@ -97,7 +98,6 @@ static inline void cpuhotplug_mutex_unlo
mutex_unlock(cpu_hp_mutex);
}

-extern void cpu_hotplug_init(void);
extern void get_online_cpus(void);
extern void put_online_cpus(void);
#define hotcpu_notifier(fn, pri) { \
@@ -117,7 +117,6 @@ static inline void cpuhotplug_mutex_lock
static inline void cpuhotplug_mutex_unlock(struct mutex *cpu_hp_mutex)
{ }

-#define cpu_hotplug_init() do { } while (0)
#define get_online_cpus() do { } while (0)
#define put_online_cpus() do { } while (0)
#define hotcpu_notifier(fn, pri) do { (void)(fn); } while (0)
--- 2.6.23-mm1.orig/kernel/cpu.c 2007-10-17 03:59:59.873628872 -0700
+++ 2.6.23-mm1/kernel/cpu.c 2007-10-17 04:04:52.522079165 -0700
@@ -38,8 +38,6 @@ static struct {

#define writer_exists() (cpu_hotplug.active_writer != NULL)

-#ifdef CONFIG_HOTPLUG_CPU
-
void __init cpu_hotplug_init(void)
{
cpu_hotplug.active_writer = NULL;
@@ -48,6 +46,8 @@ void __init cpu_hotplug_init(void)
init_completion(&cpu_hotplug.readers_done);
}

+#ifdef CONFIG_HOTPLUG_CPU
+
void get_online_cpus(void)
{
might_sleep();


--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

Subject: Re: [RFC PATCH 1/4] Refcount Based Cpu-Hotplug Implementation

Hi Paul,

On Wed, Oct 17, 2007 at 04:27:09AM -0700, Paul Jackson wrote:
> With the following patch, the four cpu hotplug patches by
> Gautham R Shenoy boot successfully on my sn2_defconfig ia64
> SN2 Altix system.
>
> This patch moves the cpu_hotplug_init() code outside of the
> #ifdef CONFIG_HOTPLUG_CPU code.
>
> I will confess to being confused however as to the best way
> to handle this. Having "cpu_hotplug_init" be a piece of code
> that has to execute whether or not CONFIG_HOTPLUG_CPU is
> enabled doesn't seem right.

Thanks for the patch!

I had not tested the patch series with !CONFIG_HOTPLUG_CPU,
so no wonder I missed it.

>
> Signed-off-by: Paul Jackson <[email protected]>
>

Acked-by: Gautham R Shenoy <[email protected]>

> ---
>
> include/linux/cpu.h | 3 +--
> kernel/cpu.c | 4 ++--
> 2 files changed, 3 insertions(+), 4 deletions(-)
>
> --- 2.6.23-mm1.orig/include/linux/cpu.h 2007-10-17 03:59:59.529623639 -0700
> +++ 2.6.23-mm1/include/linux/cpu.h 2007-10-17 04:03:56.257223754 -0700
> @@ -83,6 +83,7 @@ static inline void unregister_cpu_notifi
>
> #endif /* CONFIG_SMP */
> extern struct sysdev_class cpu_sysdev_class;
> +extern void cpu_hotplug_init(void);
>
> #ifdef CONFIG_HOTPLUG_CPU
> /* Stop CPUs going up and down. */
> @@ -97,7 +98,6 @@ static inline void cpuhotplug_mutex_unlo
> mutex_unlock(cpu_hp_mutex);
> }
>
> -extern void cpu_hotplug_init(void);
> extern void get_online_cpus(void);
> extern void put_online_cpus(void);
> #define hotcpu_notifier(fn, pri) { \
> @@ -117,7 +117,6 @@ static inline void cpuhotplug_mutex_lock
> static inline void cpuhotplug_mutex_unlock(struct mutex *cpu_hp_mutex)
> { }
>
> -#define cpu_hotplug_init() do { } while (0)
> #define get_online_cpus() do { } while (0)
> #define put_online_cpus() do { } while (0)
> #define hotcpu_notifier(fn, pri) do { (void)(fn); } while (0)
> --- 2.6.23-mm1.orig/kernel/cpu.c 2007-10-17 03:59:59.873628872 -0700
> +++ 2.6.23-mm1/kernel/cpu.c 2007-10-17 04:04:52.522079165 -0700
> @@ -38,8 +38,6 @@ static struct {
>
> #define writer_exists() (cpu_hotplug.active_writer != NULL)
>
> -#ifdef CONFIG_HOTPLUG_CPU
> -
> void __init cpu_hotplug_init(void)
> {
> cpu_hotplug.active_writer = NULL;
> @@ -48,6 +46,8 @@ void __init cpu_hotplug_init(void)
> init_completion(&cpu_hotplug.readers_done);
> }
>
> +#ifdef CONFIG_HOTPLUG_CPU
> +
> void get_online_cpus(void)
> {
> might_sleep();
>
>
> --
> I won't rest till it's the best ...
> Programmer, Linux Scalability
> Paul Jackson <[email protected]> 1.925.600.0401

--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2007-10-17 11:52:53

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] Remove CPU_DEAD/CPU_UP_CANCELLED handling from workqueue.c

On 10/16, Gautham R Shenoy wrote:
>
> cleanup_workqueue_thread() in the CPU_DEAD and CPU_UP_CANCELLED path
> will cause a deadlock if the worker thread is executing a work item
> which is blocked on get_online_cpus(). This will lead to a irrecoverable
> hang.
>
> Solution is not to cleanup the worker thread. Instead let it remain
> even after the cpu goes offline. Since no one can queue any work
> on an offlined cpu, this thread will be forever sleeping, untill
> someone onlines the cpu.

Imho: not perfect, but acceptable. We can re-introduce cwq->should_stop
flag, so that cwq->thread eventually exits after it flushed ->worklist.
But see below.

> @@ -679,6 +680,7 @@ init_cpu_workqueue(struct workqueue_stru
> spin_lock_init(&cwq->lock);
> INIT_LIST_HEAD(&cwq->worklist);
> init_waitqueue_head(&cwq->more_work);
> + cwq->thread = NULL;

Not needed, cwq->thread == NULL because alloc_percpu() uses GFP_ZERO.

> return cwq;
> }
> @@ -712,7 +714,7 @@ static void start_workqueue_thread(struc
>
> if (p != NULL) {
> if (cpu >= 0)
> - kthread_bind(p, cpu);
> + set_cpus_allowed(p, cpumask_of_cpu(cpu));

OK, this is understandable, but see below...

> wake_up_process(p);
> }
> }
> @@ -848,6 +850,9 @@ static int __devinit workqueue_cpu_callb
>
> switch (action) {
> case CPU_UP_PREPARE:
> + if (likely(cwq->thread != NULL &&
> + !IS_ERR(cwq->thread)))

IS_ERR(cwq->thread) is not possible. cwq->thread is either NULL, or
a valid task_struct.

> + break;
> if (!create_workqueue_thread(cwq, cpu))
> break;
> printk(KERN_ERR "workqueue [%s] for %i failed\n",
> @@ -858,12 +863,6 @@ static int __devinit workqueue_cpu_callb
> case CPU_ONLINE:
> start_workqueue_thread(cwq, cpu);
> break;
> -
> - case CPU_UP_CANCELED:
> - start_workqueue_thread(cwq, -1);
> - case CPU_DEAD:
> - cleanup_workqueue_thread(cwq, cpu);
> - break;
> }
> }

Unfortunately, this approach seem to have a sublte problem.

Suppose that CPU 0 goes down. Let's denote the corresponding cwq->thread as T.
When CPU goes offline, T is not bound to CPU 0 any longer. So far this is OK.

Now suppose that CPU 0 goes online again. There is a window before CPU_ONLINE
(note that CPU 0 is already online at this time) binds T back to CPU 0. It is
possible that another thread running on CPU 0 does schedule_work() in that window,
or it can run on any CPU but calls schedule_delayed_work_on(0).

In this case the work_struct may run on the wrong CPU because T is ready to
execute the work_struct, this is bug. work->func() has all rights to expect
that cwq->thread is bound to the right CPU.

Let's take cache_reap() as an example. It is critical that this function is
really "per cpu", otherwise it can use the wrong per-cpu data, and even the
last schedule_delayed_work() is not reliable.

(Sorry, have no time to read the other patches now, will do on weekend).

Oleg.

2007-10-17 12:04:31

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] Refcount Based Cpu-Hotplug Implementation

Gautham wrote:
> Thanks for the patch!

You're welcome.

I wonder what are the chances of this patch set making it into 2.6.24.

The cgroup (aka container) folks Paul Menage and David Rientjes are
working with me on the (hopefully) last fix for getting cpusets to work
with the cgroup changes. This involves how cpusets adapts to CPUs
coming and going from the system due to hotplug/hotunplug or coming and
going from cpusets due to user changes in the cpuset configuraton.

Not surprisingly, the code we are struggling with depends on the CPU
hotplug support, and this present patch set of yours looks to be a
fine improvement to that.

However whatever we do for this last fix has to make it into 2.6.24,
because cgroups themselves are targeted for 2.6.24, and I can't let
that lead to any serious regressions in existing cpuset capability.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2007-10-17 16:13:30

by Nathan Lynch

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] Rename lock_cpu_hotplug to get_online_cpus

Hi Gautham-

Gautham R Shenoy wrote:
> Replace all lock_cpu_hotplug/unlock_cpu_hotplug from the kernel and use
> get_online_cpus and put_online_cpus instead as it highlights
> the refcount semantics in these operations.

Something other than "get_online_cpus", please? lock_cpu_hotplug()
protects cpu_present_map as well as cpu_online_map. For example, some
of the powerpc code modified in this patch is made a bit less clear
because it is manipulating cpu_present_map, not cpu_online_map.


> Index: linux-2.6.23/arch/powerpc/platforms/pseries/hotplug-cpu.c
> ===================================================================
> --- linux-2.6.23.orig/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ linux-2.6.23/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -151,7 +151,7 @@ static int pseries_add_processor(struct
> for (i = 0; i < nthreads; i++)
> cpu_set(i, tmp);
>
> - lock_cpu_hotplug();
> + get_online_cpus();
>
> BUG_ON(!cpus_subset(cpu_present_map, cpu_possible_map));
>
> @@ -188,7 +188,7 @@ static int pseries_add_processor(struct
> }
> err = 0;
> out_unlock:
> - unlock_cpu_hotplug();
> + put_online_cpus();
> return err;
> }
>
> @@ -209,7 +209,7 @@ static void pseries_remove_processor(str
>
> nthreads = len / sizeof(u32);
>
> - lock_cpu_hotplug();
> + get_online_cpus();
> for (i = 0; i < nthreads; i++) {
> for_each_present_cpu(cpu) {
> if (get_hard_smp_processor_id(cpu) != intserv[i])
> @@ -223,7 +223,7 @@ static void pseries_remove_processor(str
> printk(KERN_WARNING "Could not find cpu to remove "
> "with physical id 0x%x\n", intserv[i]);
> }
> - unlock_cpu_hotplug();
> + put_online_cpus();
> }

Subject: Re: [RFC PATCH 1/4] Refcount Based Cpu-Hotplug Implementation

On Wed, Oct 17, 2007 at 04:29:12PM +1000, Rusty Russell wrote:
> On Wednesday 17 October 2007 15:37:54 Gautham R Shenoy wrote:
> > On Wed, Oct 17, 2007 at 10:47:41AM +1000, Rusty Russell wrote:
> > > On Tuesday 16 October 2007 20:34:17 Gautham R Shenoy wrote:
> > > > This patch implements a Refcount + Waitqueue based model for
> > > > cpu-hotplug.
> > >
> > > Hi Gautham,
> >
> > Hi Rusty,
> >
> > > I can't see where you re-initialize the completion.
> >
> > The cpu_hotplug.readers_done is a global variable which has been
> > initialized in cpu_hotplug_init.
> >
> > So I am wondering is the re-initialization required ?
>
> Yes. AFAICT you use this completion on every hotplug. Yet once a completion
> is completed, it needs to be re-initialized to be reused: it's "complete" and
> wait_for_completion will return immediately thereafter.
>

Okay, I thought completion followed the spinlock semantics, and hence
didn't require re-initalization. Thanks for that information!

> Perhaps you want a waitqueue instead?

Yes, I had considered it. But completion looked appealing since it
already had the wait-queuing code. I'll give it a try and repost
the series with other changes.

>
> Rusty.

Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

Subject: Re: [RFC PATCH 2/4] Rename lock_cpu_hotplug to get_online_cpus

Hi Nathan,
> Hi Gautham-
>
> Gautham R Shenoy wrote:
> > Replace all lock_cpu_hotplug/unlock_cpu_hotplug from the kernel and use
> > get_online_cpus and put_online_cpus instead as it highlights
> > the refcount semantics in these operations.
>
> Something other than "get_online_cpus", please? lock_cpu_hotplug()
> protects cpu_present_map as well as cpu_online_map. For example, some
> of the powerpc code modified in this patch is made a bit less clear
> because it is manipulating cpu_present_map, not cpu_online_map.

A quick look at the code, and I am wondering why is lock_cpu_hotplug()
used there in the first place. It doesn't look like we require any
protection against cpus coming up/ going down in the code below,
since the cpu-hotplug operation doesn't affect the cpu_present_map.

Can't we use another mutex here instead of the cpu_hotplug mutex here ?


>
>
> > Index: linux-2.6.23/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > ===================================================================
> > --- linux-2.6.23.orig/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > +++ linux-2.6.23/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > @@ -151,7 +151,7 @@ static int pseries_add_processor(struct
> > for (i = 0; i < nthreads; i++)
> > cpu_set(i, tmp);
> >
> > - lock_cpu_hotplug();
> > + get_online_cpus();
> >
> > BUG_ON(!cpus_subset(cpu_present_map, cpu_possible_map));
> >
> > @@ -188,7 +188,7 @@ static int pseries_add_processor(struct
> > }
> > err = 0;
> > out_unlock:
> > - unlock_cpu_hotplug();
> > + put_online_cpus();
> > return err;
> > }
> >
> > @@ -209,7 +209,7 @@ static void pseries_remove_processor(str
> >
> > nthreads = len / sizeof(u32);
> >
> > - lock_cpu_hotplug();
> > + get_online_cpus();
> > for (i = 0; i < nthreads; i++) {
> > for_each_present_cpu(cpu) {
> > if (get_hard_smp_processor_id(cpu) != intserv[i])
> > @@ -223,7 +223,7 @@ static void pseries_remove_processor(str
> > printk(KERN_WARNING "Could not find cpu to remove "
> > "with physical id 0x%x\n", intserv[i]);
> > }
> > - unlock_cpu_hotplug();
> > + put_online_cpus();
> > }


--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2007-10-18 08:23:10

by Nathan Lynch

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] Rename lock_cpu_hotplug to get_online_cpus

Gautham R Shenoy wrote:
> Hi Nathan,
> > Hi Gautham-
> >
> > Gautham R Shenoy wrote:
> > > Replace all lock_cpu_hotplug/unlock_cpu_hotplug from the kernel and use
> > > get_online_cpus and put_online_cpus instead as it highlights
> > > the refcount semantics in these operations.
> >
> > Something other than "get_online_cpus", please? lock_cpu_hotplug()
> > protects cpu_present_map as well as cpu_online_map. For example, some
> > of the powerpc code modified in this patch is made a bit less clear
> > because it is manipulating cpu_present_map, not cpu_online_map.
>
> A quick look at the code, and I am wondering why is lock_cpu_hotplug()
> used there in the first place. It doesn't look like we require any
> protection against cpus coming up/ going down in the code below,
> since the cpu-hotplug operation doesn't affect the cpu_present_map.

The locking is necessary. Changes to cpu_online_map and
cpu_present_map must be serialized; otherwise you could end up trying
to online a cpu as it is being removed (i.e. cleared from
cpu_present_map). Online operations must check that a cpu is present
before bringing it up (kernel/cpu.c):

/* Requires cpu_add_remove_lock to be held */
static int __cpuinit _cpu_up(unsigned int cpu, int tasks_frozen)
{
int ret, nr_calls = 0;
void *hcpu = (void *)(long)cpu;
unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;

if (cpu_online(cpu) || !cpu_present(cpu))
return -EINVAL;
....

> Can't we use another mutex here instead of the cpu_hotplug mutex here ?

I guess so, but I don't really see the need...

Subject: Re: [RFC PATCH 2/4] Rename lock_cpu_hotplug to get_online_cpus

On Thu, Oct 18, 2007 at 03:22:21AM -0500, Nathan Lynch wrote:
> Gautham R Shenoy wrote:
> > Hi Nathan,
> > > Hi Gautham-
> > >
> > > Gautham R Shenoy wrote:
> > > > Replace all lock_cpu_hotplug/unlock_cpu_hotplug from the kernel and use
> > > > get_online_cpus and put_online_cpus instead as it highlights
> > > > the refcount semantics in these operations.
> > >
> > > Something other than "get_online_cpus", please? lock_cpu_hotplug()
> > > protects cpu_present_map as well as cpu_online_map. For example, some
> > > of the powerpc code modified in this patch is made a bit less clear
> > > because it is manipulating cpu_present_map, not cpu_online_map.
> >
> > A quick look at the code, and I am wondering why is lock_cpu_hotplug()
> > used there in the first place. It doesn't look like we require any
> > protection against cpus coming up/ going down in the code below,
> > since the cpu-hotplug operation doesn't affect the cpu_present_map.
>
> The locking is necessary. Changes to cpu_online_map and
> cpu_present_map must be serialized; otherwise you could end up trying
> to online a cpu as it is being removed (i.e. cleared from
> cpu_present_map). Online operations must check that a cpu is present
> before bringing it up (kernel/cpu.c):

Fair enough!

But we are not protecting the cpu_present_map here using
lock_cpu_hotplug(), now are we?

The lock_cpu_hotplug() here, ensures that no cpu-hotplug operations
occur in parallel with a processor add or a processor remove.
IOW, we're still ensuring that the cpu_online_map doesn't change
while we're changing the cpu_present_map. So I don't see why the name
get_online_cpus() should be a problem here. May be we could add a
comment as to why we don't want a cpu-hotplug operation to happen while
we're adding/removing a processor.

Unless of course, lock_cpu_hotplug() is also used to serialize
the add_processor and remove_processor operations. Is that the case
here ?

Thanks and Regards
gautham.
>
> /* Requires cpu_add_remove_lock to be held */
> static int __cpuinit _cpu_up(unsigned int cpu, int tasks_frozen)
> {
> int ret, nr_calls = 0;
> void *hcpu = (void *)(long)cpu;
> unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
>
> if (cpu_online(cpu) || !cpu_present(cpu))
> return -EINVAL;
> ....
>
> > Can't we use another mutex here instead of the cpu_hotplug mutex here ?
>
> I guess so, but I don't really see the need...
>

--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2007-10-18 17:31:18

by Nathan Lynch

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] Rename lock_cpu_hotplug to get_online_cpus

Gautham R Shenoy wrote:
> On Thu, Oct 18, 2007 at 03:22:21AM -0500, Nathan Lynch wrote:
> > Gautham R Shenoy wrote:
> > > Hi Nathan,
> > > > Hi Gautham-
> > > >
> > > > Gautham R Shenoy wrote:
> > > > > Replace all lock_cpu_hotplug/unlock_cpu_hotplug from the kernel and use
> > > > > get_online_cpus and put_online_cpus instead as it highlights
> > > > > the refcount semantics in these operations.
> > > >
> > > > Something other than "get_online_cpus", please? lock_cpu_hotplug()
> > > > protects cpu_present_map as well as cpu_online_map. For example, some
> > > > of the powerpc code modified in this patch is made a bit less clear
> > > > because it is manipulating cpu_present_map, not cpu_online_map.
> > >
> > > A quick look at the code, and I am wondering why is lock_cpu_hotplug()
> > > used there in the first place. It doesn't look like we require any
> > > protection against cpus coming up/ going down in the code below,
> > > since the cpu-hotplug operation doesn't affect the cpu_present_map.
> >
> > The locking is necessary. Changes to cpu_online_map and
> > cpu_present_map must be serialized; otherwise you could end up trying
> > to online a cpu as it is being removed (i.e. cleared from
> > cpu_present_map). Online operations must check that a cpu is present
> > before bringing it up (kernel/cpu.c):
>
> Fair enough!
>
> But we are not protecting the cpu_present_map here using
> lock_cpu_hotplug(), now are we?

Yes, we are. In addition to the above, updates to cpu_present_map
have to be serialized. pseries_add_processor can be summed up as
"find the first N unset bits in cpu_present_map and set them". That's
not an atomic operation, so some kind of mutual exclusion is needed.


> The lock_cpu_hotplug() here, ensures that no cpu-hotplug operations
> occur in parallel with a processor add or a processor remove.

That's one important effect, but not the only one (see above).


> IOW, we're still ensuring that the cpu_online_map doesn't change
> while we're changing the cpu_present_map. So I don't see why the name
> get_online_cpus() should be a problem here.

The naming is a problem IMO for two reasons:

- lock_cpu_hotplug() protects cpu_present_map as well as
cpu_online_map (sigh, I see that Documentation/cpu-hotplug.txt
disagrees with me, but my statement holds for powerpc, at least).

- get_online_cpus() implies reference count semantics (as stated in
the changelog) but AFAICT it really has a reference count
implementation with read-write locking semantics.

Hmm, I think there's another problem here. With your changes, code
which relies on the mutual exclusion behavior of lock_cpu_hotplug()
(such as pseries_add/remove_processor) will now be able to run
concurrently. Probably those functions should use
cpu_hotplug_begin/end instead.

Subject: Re: [RFC PATCH 2/4] Rename lock_cpu_hotplug to get_online_cpus

Hi Nathan,

> Gautham R Shenoy wrote:
> > On Thu, Oct 18, 2007 at 03:22:21AM -0500, Nathan Lynch wrote:
> > > Gautham R Shenoy wrote:
> > > > Hi Nathan,
> > > > > Hi Gautham-
> > > > >
> > > > > Gautham R Shenoy wrote:
> > > > > > Replace all lock_cpu_hotplug/unlock_cpu_hotplug from the kernel and use
> > > > > > get_online_cpus and put_online_cpus instead as it highlights
> > > > > > the refcount semantics in these operations.
> > > > >
> > > > > Something other than "get_online_cpus", please? lock_cpu_hotplug()
> > > > > protects cpu_present_map as well as cpu_online_map. For example, some
> > > > > of the powerpc code modified in this patch is made a bit less clear
> > > > > because it is manipulating cpu_present_map, not cpu_online_map.
> > > >
> > > > A quick look at the code, and I am wondering why is lock_cpu_hotplug()
> > > > used there in the first place. It doesn't look like we require any
> > > > protection against cpus coming up/ going down in the code below,
> > > > since the cpu-hotplug operation doesn't affect the cpu_present_map.
> > >
> > > The locking is necessary. Changes to cpu_online_map and
> > > cpu_present_map must be serialized; otherwise you could end up trying
> > > to online a cpu as it is being removed (i.e. cleared from
> > > cpu_present_map). Online operations must check that a cpu is present
> > > before bringing it up (kernel/cpu.c):
> >
> > Fair enough!
> >
> > But we are not protecting the cpu_present_map here using
> > lock_cpu_hotplug(), now are we?
>
> Yes, we are. In addition to the above, updates to cpu_present_map
> have to be serialized. pseries_add_processor can be summed up as
> "find the first N unset bits in cpu_present_map and set them". That's
> not an atomic operation, so some kind of mutual exclusion is needed.
>

Okay. But other than pseries_add_processor and pseries_remove_processor,
are there any other places where we _change_ the cpu_present_map ?

I agree that we need some kind of synchronization for threads which read
the cpu_present_map. But probably we can use a seperate mutex for that.



>
> > The lock_cpu_hotplug() here, ensures that no cpu-hotplug operations
> > occur in parallel with a processor add or a processor remove.
>
> That's one important effect, but not the only one (see above).
>
>
> > IOW, we're still ensuring that the cpu_online_map doesn't change
> > while we're changing the cpu_present_map. So I don't see why the name
> > get_online_cpus() should be a problem here.
>
> The naming is a problem IMO for two reasons:
>
> - lock_cpu_hotplug() protects cpu_present_map as well as
> cpu_online_map (sigh, I see that Documentation/cpu-hotplug.txt
> disagrees with me, but my statement holds for powerpc, at least).
>
> - get_online_cpus() implies reference count semantics (as stated in
> the changelog) but AFAICT it really has a reference count
> implementation with read-write locking semantics.
>
> Hmm, I think there's another problem here. With your changes, code
> which relies on the mutual exclusion behavior of lock_cpu_hotplug()
> (such as pseries_add/remove_processor) will now be able to run
> concurrently. Probably those functions should use
> cpu_hotplug_begin/end instead.

One of the primary reasons to move away from the mutex semantics for
cpu-hotplug protection, was that there were a lot of places where
lock_cpu_hotplug() was used for protecting local data structures too,
when they had nothing to do with cpu-hotplug as such, and it resulted
in a whole mess. It took people quite sometime to sort things out
with the cpufreq subsystem.

Probably it would be a lot cleaner if we use get_online_cpus() for
protection against cpu-hotplug and use specific mutexes for serializing
accesses to local data structures. Thoughts?

Thanks and Regards
gautham.

--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2007-10-21 11:34:19

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] Replace per-subsystem mutexes with get_online_cpus

On 10/16, Gautham R Shenoy wrote:
>
> This patch converts the known per-subsystem cpu_hotplug mutexes to
> get_online_cpus put_online_cpus.
> It also eliminates the CPU_LOCK_ACQUIRE and CPU_LOCK_RELEASE hotplug
> notification events.

Personally, I like the changes in workqueue.c very much, a couple of
minor nits below.

> --- linux-2.6.23.orig/kernel/workqueue.c
> +++ linux-2.6.23/kernel/workqueue.c
> @@ -592,8 +592,6 @@ EXPORT_SYMBOL(schedule_delayed_work_on);
> * Returns zero on success.
> * Returns -ve errno on failure.
> *
> - * Appears to be racy against CPU hotplug.
> - *

see below,

> * schedule_on_each_cpu() is very slow.
> */
> int schedule_on_each_cpu(work_func_t func)
> @@ -605,7 +603,7 @@ int schedule_on_each_cpu(work_func_t fun
> if (!works)
> return -ENOMEM;
>
> - preempt_disable(); /* CPU hotplug */
> + get_online_cpus(); /* CPU hotplug */
> for_each_online_cpu(cpu) {
> struct work_struct *work = per_cpu_ptr(works, cpu);
>
> @@ -613,7 +611,7 @@ int schedule_on_each_cpu(work_func_t fun
> set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
> __queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), work);
> }
> - preempt_enable();
> + put_online_cpus();
> flush_workqueue(keventd_wq);

Still racy. To kill the race, please move flush_workqueue() up, before
put_online_cpus().

> @@ -809,6 +809,7 @@ void destroy_workqueue(struct workqueue_
> struct cpu_workqueue_struct *cwq;
> int cpu;
>
> + get_online_cpus();
> mutex_lock(&workqueue_mutex);
> list_del(&wq->list);
> mutex_unlock(&workqueue_mutex);
> @@ -817,6 +818,7 @@ void destroy_workqueue(struct workqueue_
> cwq = per_cpu_ptr(wq->cpu_wq, cpu);
> cleanup_workqueue_thread(cwq, cpu);
> }
> + put_online_cpus();

Correct, but I'd suggest to do put_online_cpus() earlier, right after
mutex_unlock(&workqueue_mutex).

> @@ -830,22 +832,17 @@ static int __devinit workqueue_cpu_callb
> unsigned int cpu = (unsigned long)hcpu;
> struct cpu_workqueue_struct *cwq;
> struct workqueue_struct *wq;
> + int ret = NOTIFY_OK;
>
> action &= ~CPU_TASKS_FROZEN;
>
> switch (action) {
> - case CPU_LOCK_ACQUIRE:
> - mutex_lock(&workqueue_mutex);
> - return NOTIFY_OK;
> -
> - case CPU_LOCK_RELEASE:
> - mutex_unlock(&workqueue_mutex);
> - return NOTIFY_OK;
>

please remove this emtpy line

> case CPU_UP_PREPARE:
> cpu_set(cpu, cpu_populated_map);
> }
>
> + mutex_lock(&workqueue_mutex);

We don't need workqueue_mutex here. With your patch workqueue_mutex protects
list_head, nothing more. But all other callers (create/destroy) should take
get_online_cpus() anyway. This means that we can convert workqueue_mutex to
spinlock_t.

Oleg.

2007-10-21 12:43:01

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] Refcount Based Cpu-Hotplug Implementation

On 10/17, Gautham R Shenoy wrote:
>
> On Wed, Oct 17, 2007 at 10:47:41AM +1000, Rusty Russell wrote:
> >
> > I can't see where you re-initialize the completion.
>
> The cpu_hotplug.readers_done is a global variable which has been
> initialized in cpu_hotplug_init.
>
> So I am wondering is the re-initialization required ?

I don't understand why should we re-initialize the completion too,
but see below.

> > > +static void cpu_hotplug_begin(void)
> > > +{
> > > + mutex_lock(&cpu_hotplug.lock);
> > > + cpu_hotplug.active_writer = current;
> > > + while (cpu_hotplug.refcount) {
> > > + mutex_unlock(&cpu_hotplug.lock);
> > > + wait_for_completion(&cpu_hotplug.readers_done);
> > > + mutex_lock(&cpu_hotplug.lock);
> > > + }
> >
> > AFAICT this will busy-wait on the second CPU hotplug.

Why?

> Well when the first cpu_hotplug comes out of wait_for_completion, it
> would have decremented the ->done count, so it's as good as new
> for the second CPU hotplug, no?

No, because we decrement the ->done count only once, but there is no
guarantee that ->done == 1 when we get CPU after wakeup. Another reader
can do lock_cpu_hotplug/unlock_cpu_hotplug in between, so we have a race.

But I disagree with "Yet once a completion is completed, it needs to be
re-initialized to be reused: it's "complete" and wait_for_completion
will return immediately thereafter".

Rusty, could you please clarify?

Side note, we don't block the new readers while cpu_hotplug_begin() waits
for the completion. I don't think this is a problem, but perhaps it makes
sense to document the possible livelock.

Oleg.

2007-10-22 00:43:25

by Nathan Lynch

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] Rename lock_cpu_hotplug to get_online_cpus

Gautham R Shenoy wrote:
> > Gautham R Shenoy wrote:
> > > On Thu, Oct 18, 2007 at 03:22:21AM -0500, Nathan Lynch wrote:
> > > > Gautham R Shenoy wrote:
> > > > > > Gautham R Shenoy wrote:
> > > > > > > Replace all lock_cpu_hotplug/unlock_cpu_hotplug from the kernel and use
> > > > > > > get_online_cpus and put_online_cpus instead as it highlights
> > > > > > > the refcount semantics in these operations.
> > > > > >
> > > > > > Something other than "get_online_cpus", please? lock_cpu_hotplug()
> > > > > > protects cpu_present_map as well as cpu_online_map. For example, some
> > > > > > of the powerpc code modified in this patch is made a bit less clear
> > > > > > because it is manipulating cpu_present_map, not cpu_online_map.
> > > > >
> > > > > A quick look at the code, and I am wondering why is lock_cpu_hotplug()
> > > > > used there in the first place. It doesn't look like we require any
> > > > > protection against cpus coming up/ going down in the code below,
> > > > > since the cpu-hotplug operation doesn't affect the cpu_present_map.
> > > >
> > > > The locking is necessary. Changes to cpu_online_map and
> > > > cpu_present_map must be serialized; otherwise you could end up trying
> > > > to online a cpu as it is being removed (i.e. cleared from
> > > > cpu_present_map). Online operations must check that a cpu is present
> > > > before bringing it up (kernel/cpu.c):
> > >
> > > Fair enough!
> > >
> > > But we are not protecting the cpu_present_map here using
> > > lock_cpu_hotplug(), now are we?
> >
> > Yes, we are. In addition to the above, updates to cpu_present_map
> > have to be serialized. pseries_add_processor can be summed up as
> > "find the first N unset bits in cpu_present_map and set them". That's
> > not an atomic operation, so some kind of mutual exclusion is needed.
> >
>
> Okay. But other than pseries_add_processor and pseries_remove_processor,
> are there any other places where we _change_ the cpu_present_map ?

Other arch code e.g. ia64 changes it for add/remove also. But I fail
to see how it matters.


> I agree that we need some kind of synchronization for threads which
> read the cpu_present_map. But probably we can use a seperate mutex
> for that.

That would be needless complexity.


> > The naming is a problem IMO for two reasons:
> >
> > - lock_cpu_hotplug() protects cpu_present_map as well as
> > cpu_online_map (sigh, I see that Documentation/cpu-hotplug.txt
> > disagrees with me, but my statement holds for powerpc, at least).
> >
> > - get_online_cpus() implies reference count semantics (as stated in
> > the changelog) but AFAICT it really has a reference count
> > implementation with read-write locking semantics.
> >
> > Hmm, I think there's another problem here. With your changes, code
> > which relies on the mutual exclusion behavior of lock_cpu_hotplug()
> > (such as pseries_add/remove_processor) will now be able to run
> > concurrently. Probably those functions should use
> > cpu_hotplug_begin/end instead.
>
> One of the primary reasons to move away from the mutex semantics for
> cpu-hotplug protection, was that there were a lot of places where
> lock_cpu_hotplug() was used for protecting local data structures too,
> when they had nothing to do with cpu-hotplug as such, and it resulted
> in a whole mess. It took people quite sometime to sort things out
> with the cpufreq subsystem.

cpu_present_map isn't a "local data structure" any more than
cpu_online_map, and it is quite relevant to cpu hotplug. We have to
maintain the invariant that the set of cpus online is a subset of cpus
present.

> Probably it would be a lot cleaner if we use get_online_cpus() for
> protection against cpu-hotplug and use specific mutexes for serializing
> accesses to local data structures. Thoughts?

I don't feel like I'm getting through here. Let me restate.

If I'm reading them correctly, these patches are changing the behavior
of lock_cpu_hotplug() from mutex-style locking to a kind of read-write
locking. I think that's fine, but the naming of the new API poorly
reflects its real behavior. Conversion of lock_cpu_hotplug() users
should be done with care. Most of them - those that need one of the
cpu maps to remain unchanged during a critical section - can be
considered readers. But a few (such as pseries_add_processor() and
pseries_remove_processor()) are writers, because they modify one of
the maps.

So, why not:

get_cpus_online -> cpumaps_read_lock
put_cpus_online -> cpumaps_read_unlock
cpu_hotplug_begin -> cpumaps_write_lock
cpu_hotplug_end -> cpumaps_write_unlock

Or something similar?

Subject: Re: [RFC PATCH 2/4] Rename lock_cpu_hotplug to get_online_cpus

> >
> > Okay. But other than pseries_add_processor and pseries_remove_processor,
> > are there any other places where we _change_ the cpu_present_map ?
>
> Other arch code e.g. ia64 changes it for add/remove also. But I fail
> to see how it matters.
>
>
> > I agree that we need some kind of synchronization for threads which
> > read the cpu_present_map. But probably we can use a seperate mutex
> > for that.
>
> That would be needless complexity.
>
>
> > > The naming is a problem IMO for two reasons:
> > >
> > > - lock_cpu_hotplug() protects cpu_present_map as well as
> > > cpu_online_map (sigh, I see that Documentation/cpu-hotplug.txt
> > > disagrees with me, but my statement holds for powerpc, at least).
> > >
> > > - get_online_cpus() implies reference count semantics (as stated in
> > > the changelog) but AFAICT it really has a reference count
> > > implementation with read-write locking semantics.
> > >
> > > Hmm, I think there's another problem here. With your changes, code
> > > which relies on the mutual exclusion behavior of lock_cpu_hotplug()
> > > (such as pseries_add/remove_processor) will now be able to run
> > > concurrently. Probably those functions should use
> > > cpu_hotplug_begin/end instead.
> >
> > One of the primary reasons to move away from the mutex semantics for
> > cpu-hotplug protection, was that there were a lot of places where
> > lock_cpu_hotplug() was used for protecting local data structures too,
> > when they had nothing to do with cpu-hotplug as such, and it resulted
> > in a whole mess. It took people quite sometime to sort things out
> > with the cpufreq subsystem.
>
> cpu_present_map isn't a "local data structure" any more than
> cpu_online_map, and it is quite relevant to cpu hotplug. We have to
> maintain the invariant that the set of cpus online is a subset of cpus
> present.
>
> > Probably it would be a lot cleaner if we use get_online_cpus() for
> > protection against cpu-hotplug and use specific mutexes for serializing
> > accesses to local data structures. Thoughts?
>
> I don't feel like I'm getting through here. Let me restate.
>
> If I'm reading them correctly, these patches are changing the behavior
> of lock_cpu_hotplug() from mutex-style locking to a kind of read-write
> locking. I think that's fine, but the naming of the new API poorly
> reflects its real behavior. Conversion of lock_cpu_hotplug() users
> should be done with care. Most of them - those that need one of the
> cpu maps to remain unchanged during a critical section - can be
> considered readers. But a few (such as pseries_add_processor() and
> pseries_remove_processor()) are writers, because they modify one of
> the maps.
>
> So, why not:
>
> get_cpus_online -> cpumaps_read_lock
> put_cpus_online -> cpumaps_read_unlock
> cpu_hotplug_begin -> cpumaps_write_lock
> cpu_hotplug_end -> cpumaps_write_unlock
>
> Or something similar?

Hi Nathan,
I totally see your point and agree with it.

However, though the new implementation appears to have
a read-write-semaphore behaviour it differs in
following aspects:

a) This implementation allows natural recursion of reads, just
as in the case of preempt_count.
This is not possible in case of a normal read-write
lock because if you have something like the following in
a timeline,
R1 -- > R2--> W3 --> R1, where
Ri is a read by a thread i and
Wi is a write by a thread i,
then thread1 will deadlock as it will be blocked behind W3,
holding the semaphore.

b) Regular Reader Writer semaphores are fair. As in if a new reader
arrives when a writer is already present, the new reader waits until
the writer in front of it finishes. But, in the new implementation,
the new reader will proceed as long as the refcount is non-zero, and the
writer will get to run only when the number of readers = 0. However, the
readers which arrive when the writer is executing, will block until the
writer is done.

Because of these properties, the implementation is more similar to the
refcounting, except that it allows the new bunch of readers to
sleep during an ongoing write.

Like you pointed out, since the code of pseries_add_processor and
other places where the cpu_present_map is being changed during the
runtime requires write protection, how if the
cpu_hotplug_begin/cpu_hotplug_end support is made
available outside cpu.c and appropriately named ?

Personally I would propose the following names to reflect the behaviour,
but if the community is okay with the word "lock" in the API's I don't
have any problems renaming them to what you've suggested.

/*
* Api's which ensure that the cpu_present_map and the cpu_online_map
* do not change while operating in a critical section.
*/

get_cpu_maps();
put_cpu_maps();

/*
* Api's to serialize the updates to the cpu_present_map/cpu_online_map.
*/

cpu_map_update_begin();
cpu_map_update_done();

Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

Subject: Re: [RFC PATCH 3/4] Replace per-subsystem mutexes with get_online_cpus

On Sun, Oct 21, 2007 at 03:39:17PM +0400, Oleg Nesterov wrote:
> On 10/16, Gautham R Shenoy wrote:
> >
> > This patch converts the known per-subsystem cpu_hotplug mutexes to
> > get_online_cpus put_online_cpus.
> > It also eliminates the CPU_LOCK_ACQUIRE and CPU_LOCK_RELEASE hotplug
> > notification events.
>
> Personally, I like the changes in workqueue.c very much, a couple of
> minor nits below.
>
> > --- linux-2.6.23.orig/kernel/workqueue.c
> > +++ linux-2.6.23/kernel/workqueue.c
> > @@ -592,8 +592,6 @@ EXPORT_SYMBOL(schedule_delayed_work_on);
> > * Returns zero on success.
> > * Returns -ve errno on failure.
> > *
> > - * Appears to be racy against CPU hotplug.
> > - *
>
> see below,
>
> > * schedule_on_each_cpu() is very slow.
> > */
> > int schedule_on_each_cpu(work_func_t func)
> > @@ -605,7 +603,7 @@ int schedule_on_each_cpu(work_func_t fun
> > if (!works)
> > return -ENOMEM;
> >
> > - preempt_disable(); /* CPU hotplug */
> > + get_online_cpus(); /* CPU hotplug */
> > for_each_online_cpu(cpu) {
> > struct work_struct *work = per_cpu_ptr(works, cpu);
> >
> > @@ -613,7 +611,7 @@ int schedule_on_each_cpu(work_func_t fun
> > set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
> > __queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), work);
> > }
> > - preempt_enable();
> > + put_online_cpus();
> > flush_workqueue(keventd_wq);
>
> Still racy. To kill the race, please move flush_workqueue() up, before
> put_online_cpus().
>
> > @@ -809,6 +809,7 @@ void destroy_workqueue(struct workqueue_
> > struct cpu_workqueue_struct *cwq;
> > int cpu;
> >
> > + get_online_cpus();
> > mutex_lock(&workqueue_mutex);
> > list_del(&wq->list);
> > mutex_unlock(&workqueue_mutex);
> > @@ -817,6 +818,7 @@ void destroy_workqueue(struct workqueue_
> > cwq = per_cpu_ptr(wq->cpu_wq, cpu);
> > cleanup_workqueue_thread(cwq, cpu);
> > }
> > + put_online_cpus();
>
> Correct, but I'd suggest to do put_online_cpus() earlier, right after
> mutex_unlock(&workqueue_mutex).
>
> > @@ -830,22 +832,17 @@ static int __devinit workqueue_cpu_callb
> > unsigned int cpu = (unsigned long)hcpu;
> > struct cpu_workqueue_struct *cwq;
> > struct workqueue_struct *wq;
> > + int ret = NOTIFY_OK;
> >
> > action &= ~CPU_TASKS_FROZEN;
> >
> > switch (action) {
> > - case CPU_LOCK_ACQUIRE:
> > - mutex_lock(&workqueue_mutex);
> > - return NOTIFY_OK;
> > -
> > - case CPU_LOCK_RELEASE:
> > - mutex_unlock(&workqueue_mutex);
> > - return NOTIFY_OK;
> >
>
> please remove this emtpy line
>
> > case CPU_UP_PREPARE:
> > cpu_set(cpu, cpu_populated_map);
> > }
> >
> > + mutex_lock(&workqueue_mutex);
>
> We don't need workqueue_mutex here. With your patch workqueue_mutex protects
> list_head, nothing more. But all other callers (create/destroy) should take
> get_online_cpus() anyway. This means that we can convert workqueue_mutex to
> spinlock_t.

Thanks for the review!
Will code these changes up in the next version and post them
sometime soon.

>
> Oleg.
>

Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"