Following patch series extends CPU isolation support. Yes, most people want to virtuallize
CPUs these days and I want to isolate them :).
The primary idea here is to be able to use some CPU cores as dedicated engines for running
user-space code with minimal kernel overhead/intervention, think of it as an SPE in the
Cell processor.
We've had scheduler support for CPU isolation ever since O(1) scheduler went it.
I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
In fact that the primary distinction that I'm making between say "CPU sets" and
"CPU isolation". "CPU sets" let you manage user-space load while "CPU isolation" provides
a way to isolate a CPU as much as possible (including kernel activities).
I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to
achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf
multi- processor/core systems under exteme system load. I'm working with legal folks on releasing
hard RT user-space framework for that.
I can also see other application like simulators and stuff that can benefit from this.
I've been maintaining this stuff since around 2.6.18 and it's been running in production
environment for a couple of years now. It's been tested on all kinds of machines, from NUMA
boxes like HP xw9300/9400 to tiny uTCA boards like Mercury AXA110.
The messiest part used to be SLAB garbage collector changes. With the new SLUB all that mess
goes away (ie no changes necessary). Also CFS seems to handle CPU hotplug much better than O(1)
did (ie domains are recomputed dynamically) so that isolation can be done at any time (via sysfs).
So this seems like a good time to merge.
Anyway. The patchset consist of 5 patches. First three are very simple and non-controversial.
They simply make "CPU isolation" a configurable feature, export cpu_isolated_map and provide
some helper functions to access it (just like cpu_online() and friends).
Last two patches add support for isolating CPUs from running workqueus and stop machine.
More details in the individual patch descriptions.
Ideally I'd like all of this to go in during this merge window. If people think it's acceptable
Linus or Andrew (or whoever is more appropriate Ingo maybe) can pull this patch set from
git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git
That tree is rebased against latest (as of yesterday) Linus' tree.
Thanx
Max
arch/x86/Kconfig | 1
arch/x86/kernel/genapic_flat_64.c | 5 ++--
drivers/base/cpu.c | 47 ++++++++++++++++++++++++++++++++++++++
include/linux/cpumask.h | 3 ++
kernel/Kconfig.cpuisol | 25 +++++++++++++++++++-
kernel/sched.c | 13 ++++++----
kernel/stop_machine.c | 3 --
kernel/workqueue.c | 31 ++++++++++++++++++-------
8 files changed, 110 insertions(+), 18 deletions(-)
From: Max Krasnyansky <[email protected]>
Most people would expect isolated CPUs to not get any
IRQs by default. This happens naturally if CPU is brought
off-line, marked isolated and then brought back online.
Signed-off-by: Max Krasnyansky <[email protected]>
---
arch/x86/kernel/genapic_flat_64.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/genapic_flat_64.c b/arch/x86/kernel/genapic_flat_64.c
index 07352b7..e02e58c 100644
--- a/arch/x86/kernel/genapic_flat_64.c
+++ b/arch/x86/kernel/genapic_flat_64.c
@@ -21,7 +21,9 @@
static cpumask_t flat_target_cpus(void)
{
- return cpu_online_map;
+ cpumask_t target;
+ cpus_andnot(target, cpu_online_map, cpu_isolated_map);
+ return target;
}
static cpumask_t flat_vector_allocation_domain(int cpu)
--
1.5.3.7
From: Max Krasnyansky <[email protected]>
This patch is trying to address the same use case I explained in the previous workqueue
isolation patch. Which is when a high priority realtime (FIFO, RR) user-space thread
is using 100% CPU for extended periods of time. In which case stopmachine threads do not
get a chance to run and entire machine essentially hangs because other CPUs are waiting
for the all stopmachine threads to run.
This use case is perfectly valid if one is using a CPU as a dedicated engine
(crunching numbers, hard realtime, etc). Think of it as an SPE in the Cell processor.
Which is what CPU isolation enables in first place.
Stopmachine is particularly bad when it comes to latencies. It's currently used for
module insertion and removal only. Given that threads running on the isolated CPUs
are unlikely to use kernel services anyway I'd consider this patch pretty safe.
The patch adds no overhead and/or side effects when CPU isolation is disabled.
Signed-off-by: Max Krasnyansky <[email protected]>
---
kernel/stop_machine.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 51b5ee5..0f4cc3f 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -99,7 +99,7 @@ static int stop_machine(void)
stopmachine_state = STOPMACHINE_WAIT;
for_each_online_cpu(i) {
- if (i == raw_smp_processor_id())
+ if (i == raw_smp_processor_id() || cpu_isolated(i))
continue;
ret = kernel_thread(stopmachine, (void *)(long)i,CLONE_KERNEL);
if (ret < 0)
--
1.5.3.7
From: Max Krasnyansky <[email protected]>
This simply adds a couple of new kconfig options for
configuring CPU isolation features.
Signed-off-by: Max Krasnyansky <[email protected]>
---
arch/x86/Kconfig | 1 +
kernel/Kconfig.cpuisol | 24 ++++++++++++++++++++++++
2 files changed, 25 insertions(+), 0 deletions(-)
create mode 100644 kernel/Kconfig.cpuisol
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 80b7ba4..b8f986e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -495,6 +495,7 @@ config SCHED_MC
increased overhead in some places. If unsure say N here.
source "kernel/Kconfig.preempt"
+source "kernel/Kconfig.cpuisol"
config X86_UP_APIC
bool "Local APIC support on uniprocessors"
diff --git a/kernel/Kconfig.cpuisol b/kernel/Kconfig.cpuisol
new file mode 100644
index 0000000..6e099a4
--- /dev/null
+++ b/kernel/Kconfig.cpuisol
@@ -0,0 +1,24 @@
+config CPUISOL
+ depends on SMP
+ bool "CPU isolation"
+ help
+ This option enables support for CPU isolation.
+ If enabled the kernel will try to avoid kernel activity on the isolated CPUs.
+ By default user-space threads are not scheduled on the isolated CPUs unless
+ they explicitly request it (via sched_ and pthread_ affinity calls). Isolated
+ CPUs are not subject to the scheduler load-balancing algorithms.
+
+ CPUs can be marked as isolated using 'cpuisol=' command line option or by
+ writing '1' into /sys/devices/system/cpu/cpuN/isolated.
+
+ This feature is useful for hard realtime and high performance applications.
+ If unsure say 'N'.
+
+config CPUISOL_WORKQUEUE
+ bool "Do not schedule workqueues on isolated CPUs (EXPERIMENTAL)"
+ depends on CPUISOL && EXPERIMENTAL
+ help
+ In this option is enabled kernel will not schedule workqueues on the
+ isolated CPUs.
+ Please note that at this point this feature is experimental. It brakes
+ certain things like OProfile that heavily rely on per cpu workqueues.
--
1.5.3.7
From: Max Krasnyansky <[email protected]>
Here we're just exporting CPU isolation bitmap so that it can
be used outside the scheduler code.
Helper functions like cpu_isolated() are provided for easy access.
This is very similar to cpu_online() and friends.
The patch also exports 'isolated' bit via sysfs in very much the
same way 'online' bit is exposed today.
CPUs can be isolated either via command line isolcpu= option or
by doing something like this:
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/isolated
echo 1 > /sys/devices/system/cpu/cpu1/online
Signed-off-by: Max Krasnyansky <[email protected]>
---
drivers/base/cpu.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/cpumask.h | 3 +++
kernel/sched.c | 12 ++++++++----
3 files changed, 58 insertions(+), 4 deletions(-)
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index c5885f5..f59c719 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -55,10 +55,57 @@ static ssize_t store_online(struct sys_device *dev, const char *buf,
}
static SYSDEV_ATTR(online, 0644, show_online, store_online);
+#ifdef CONFIG_CPUISOL
+/*
+ * This is under config hotplug is because in order to
+ * dynamically isolate a CPU it needs to be brought down.
+ * In other words the sequence should be
+ * echo 0 > /sys/device/system/cpuN/online
+ * echo 1 > /sys/device/system/cpuN/isolated
+ */
+static ssize_t show_isol(struct sys_device *dev, char *buf)
+{
+ struct cpu *cpu = container_of(dev, struct cpu, sysdev);
+
+ return sprintf(buf, "%u\n", !!cpu_isolated(cpu->sysdev.id));
+}
+
+static ssize_t store_isol(struct sys_device *dev, const char *buf,
+ size_t count)
+{
+ struct cpu *cpu = container_of(dev, struct cpu, sysdev);
+ ssize_t ret = 0;
+
+ if (cpu_online(cpu->sysdev.id))
+ return -EBUSY;
+
+ switch (buf[0]) {
+ case '0':
+ cpu_clear(cpu->sysdev.id, cpu_isolated_map);
+ break;
+ case '1':
+ cpu_set(cpu->sysdev.id, cpu_isolated_map);
+ break;
+ default:
+ ret = -EINVAL;
+ }
+
+ if (ret >= 0)
+ ret = count;
+ return ret;
+}
+static SYSDEV_ATTR(isolated, 0600, show_isol, store_isol);
+#endif /* CONFIG_CPUISOL */
+
static void __devinit register_cpu_control(struct cpu *cpu)
{
sysdev_create_file(&cpu->sysdev, &attr_online);
+
+#ifdef CONFIG_CPUISOL
+ sysdev_create_file(&cpu->sysdev, &attr_isolated);
+#endif
}
+
void unregister_cpu(struct cpu *cpu)
{
int logical_cpu = cpu->sysdev.id;
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 85bd790..84d1561 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -380,6 +380,7 @@ static inline void __cpus_remap(cpumask_t *dstp, const cpumask_t *srcp,
extern cpumask_t cpu_possible_map;
extern cpumask_t cpu_online_map;
extern cpumask_t cpu_present_map;
+extern cpumask_t cpu_isolated_map;
#if NR_CPUS > 1
#define num_online_cpus() cpus_weight(cpu_online_map)
@@ -388,6 +389,7 @@ extern cpumask_t cpu_present_map;
#define cpu_online(cpu) cpu_isset((cpu), cpu_online_map)
#define cpu_possible(cpu) cpu_isset((cpu), cpu_possible_map)
#define cpu_present(cpu) cpu_isset((cpu), cpu_present_map)
+#define cpu_isolated(cpu) cpu_isset((cpu), cpu_isolated_map)
#else
#define num_online_cpus() 1
#define num_possible_cpus() 1
@@ -395,6 +397,7 @@ extern cpumask_t cpu_present_map;
#define cpu_online(cpu) ((cpu) == 0)
#define cpu_possible(cpu) ((cpu) == 0)
#define cpu_present(cpu) ((cpu) == 0)
+#define cpu_isolated(cpu) (0)
#endif
#define cpu_is_offline(cpu) unlikely(!cpu_online(cpu))
diff --git a/kernel/sched.c b/kernel/sched.c
index 524285e..2fc942e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4815,10 +4815,17 @@ asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
* as new cpu's are detected in the system via any platform specific
* method, such as ACPI for e.g.
*/
-
cpumask_t cpu_present_map __read_mostly;
EXPORT_SYMBOL(cpu_present_map);
+/*
+ * Represents isolated cpu's.
+ * These cpu's have isolated scheduling domains. In general any
+ * kernel activity should be avoided as much as possible on them.
+ */
+cpumask_t cpu_isolated_map __read_mostly = CPU_MASK_NONE;
+EXPORT_SYMBOL(cpu_isolated_map);
+
#ifndef CONFIG_SMP
cpumask_t cpu_online_map __read_mostly = CPU_MASK_ALL;
EXPORT_SYMBOL(cpu_online_map);
@@ -6186,9 +6193,6 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
rcu_assign_pointer(rq->sd, sd);
}
-/* cpus with isolated domains */
-static cpumask_t cpu_isolated_map = CPU_MASK_NONE;
-
/* Setup the mask of cpus configured for isolated domains */
static int __init isolated_cpu_setup(char *str)
{
--
1.5.3.7
From: Max Krasnyansky <[email protected]>
I'm sure this one is going to be controversial for a lot of folks here.
So let me explain :).
What this patch is trying to address is the case when a high priority
realtime (FIFO, RR) user-space thread is using 100% CPU for extended periods
of time. In which case kernel workqueue threads do not get a chance to run and
entire machine essentially hangs because other CPUs are waiting for scheduled
workqueues to flush.
This use case is perfectly valid if one is using a CPU as a dedicated engine
(crunching numbers, hard realtime, etc). Think of it as an SPE in the Cell processor.
Which is what CPU isolation enables in first place.
Most kernel subsystems do not rely on the per CPU workqueues. In fact we already
have support for single threaded workqueues, this patch just makes it automatic.
Some subsystems namely OProfile do rely on per CPU workqueues and do not work when
this feature is enabled. It does not result in crashes or anything OProfile is just
unable to collect stats from isolated CPUs. Hence this feature is marked as
experimental.
There is zero overhead if CPU workqueue isolation is disabled.
Better ideas/suggestions on how to handle use case described above are welcome
of course.
Signed-off-by: Max Krasnyansky <[email protected]>
---
kernel/workqueue.c | 30 +++++++++++++++++++++++-------
1 files changed, 23 insertions(+), 7 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 52db48e..ed2f09b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -35,6 +35,16 @@
#include <linux/lockdep.h>
/*
+ * Stub out cpu_isolated() if isolated CPUs are allowed to
+ * run workqueues.
+ */
+#ifdef CONFIG_CPUISOL_WORKQUEUE
+#define cpu_unusable(cpu) cpu_isolated(cpu)
+#else
+#define cpu_unusable(cpu) (0)
+#endif
+
+/*
* The per-CPU workqueue (if single thread, we always use the first
* possible cpu).
*/
@@ -97,7 +107,7 @@ static const cpumask_t *wq_cpu_map(struct workqueue_struct *wq)
static
struct cpu_workqueue_struct *wq_per_cpu(struct workqueue_struct *wq, int cpu)
{
- if (unlikely(is_single_threaded(wq)))
+ if (unlikely(is_single_threaded(wq)) || cpu_unusable(cpu))
cpu = singlethread_cpu;
return per_cpu_ptr(wq->cpu_wq, cpu);
}
@@ -229,9 +239,11 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
timer->data = (unsigned long)dwork;
timer->function = delayed_work_timer_fn;
- if (unlikely(cpu >= 0))
+ if (unlikely(cpu >= 0)) {
+ if (cpu_unusable(cpu))
+ cpu = singlethread_cpu;
add_timer_on(timer, cpu);
- else
+ } else
add_timer(timer);
ret = 1;
}
@@ -605,7 +617,8 @@ int schedule_on_each_cpu(work_func_t func)
get_online_cpus();
for_each_online_cpu(cpu) {
struct work_struct *work = per_cpu_ptr(works, cpu);
-
+ if (cpu_unusable(cpu))
+ continue;
INIT_WORK(work, func);
set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
__queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), work);
@@ -754,7 +767,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
for_each_possible_cpu(cpu) {
cwq = init_cpu_workqueue(wq, cpu);
- if (err || !cpu_online(cpu))
+ if (err || !cpu_online(cpu) || cpu_unusable(cpu))
continue;
err = create_workqueue_thread(cwq, cpu);
start_workqueue_thread(cwq, cpu);
@@ -833,8 +846,11 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
struct cpu_workqueue_struct *cwq;
struct workqueue_struct *wq;
- action &= ~CPU_TASKS_FROZEN;
+ if (cpu_unusable(cpu))
+ return NOTIFY_OK;
+ action &= ~CPU_TASKS_FROZEN;
+
switch (action) {
case CPU_UP_PREPARE:
@@ -869,7 +885,7 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
void __init init_workqueues(void)
{
- cpu_populated_map = cpu_online_map;
+ cpus_andnot(cpu_populated_map, cpu_online_map, cpu_isolated_map);
singlethread_cpu = first_cpu(cpu_possible_map);
cpu_singlethread_map = cpumask_of_cpu(singlethread_cpu);
hotcpu_notifier(workqueue_cpu_callback, 0);
--
1.5.3.7
[ You really ought to CC people :-) ]
On Sun, 2008-01-27 at 20:09 -0800, [email protected] wrote:
> Following patch series extends CPU isolation support. Yes, most people want to virtuallize
> CPUs these days and I want to isolate them :).
> The primary idea here is to be able to use some CPU cores as dedicated engines for running
> user-space code with minimal kernel overhead/intervention, think of it as an SPE in the
> Cell processor.
>
> We've had scheduler support for CPU isolation ever since O(1) scheduler went it.
> I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
> In fact that the primary distinction that I'm making between say "CPU sets" and
> "CPU isolation". "CPU sets" let you manage user-space load while "CPU isolation" provides
> a way to isolate a CPU as much as possible (including kernel activities).
Ok, so you're aware of CPU sets, miss a feature, but instead of
extending it to cover your needs you build something new entirely?
> I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to
> achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf
> multi- processor/core systems under exteme system load. I'm working with legal folks on releasing
> hard RT user-space framework for that.
> I can also see other application like simulators and stuff that can benefit from this.
have you been using just this, or in combination with the -rt effort?
Thanks for the CC, Peter.
Ingo - see question at end of message.
Max wrote:
> We've had scheduler support for CPU isolation ever since O(1) scheduler went it.
> I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
I recently added the per-cpuset flag 'sched_load_balance' for some
other realtime folks, so that they can disable the kernel scheduler
load balancing on isolated CPUs. It essentially allows for dynamic
control of which CPUs are isolated by the scheduler, using the cpuset
hierarchy, rather than enhancing the 'isolated_cpus' mask. That
'isolated_cpus' mask remained a minimal kernel boottime parameter.
I believe this went to Linus's tree about Oct 2007.
It looks like you have three additional tweaks for realtime in this
patch set, with your patches:
[PATCH] [CPUISOL] Do not route IRQs to the CPUs isolated at boot
[PATCH] [CPUISOL] Support for workqueue isolation
[PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"
It would be interesting to see a patchset with the above three realtime
tweaks, layered on this new cpuset 'sched_load_balance' apparatus, rather
than layered on changes to make 'isolated_cpus' more dynamic. Some of us
run realtime and cpuset-intensive loads on the same system, so like to
have those two capabilities co-operate with each other.
Ingo - what's your sense of the value of the above three realtime tweaks
(the last three patches in Max's patch set)?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
On Mon, Jan 28, 2008 at 08:59:10AM -0600, Paul Jackson wrote:
> Thanks for the CC, Peter.
Thanks from me too.
> Max wrote:
> > We've had scheduler support for CPU isolation ever since O(1) scheduler went it.
> > I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
>
> I recently added the per-cpuset flag 'sched_load_balance' for some
> other realtime folks, so that they can disable the kernel scheduler
> load balancing on isolated CPUs. It essentially allows for dynamic
> control of which CPUs are isolated by the scheduler, using the cpuset
> hierarchy, rather than enhancing the 'isolated_cpus' mask. That
> 'isolated_cpus' mask remained a minimal kernel boottime parameter.
> I believe this went to Linus's tree about Oct 2007.
>
> It looks like you have three additional tweaks for realtime in this
> patch set, with your patches:
>
> [PATCH] [CPUISOL] Do not route IRQs to the CPUs isolated at boot
I didn't know we still routed IRQs to isolated CPUs. I guess I need to
look deeper into the code on this one. But I agree that isolated CPUs
should not have IRQs routed to them.
> [PATCH] [CPUISOL] Support for workqueue isolation
The thing about workqueues is that they should only be woken on a CPU if
something on that CPU accessed them. IOW, the workqueue on a CPU handles
work that was called by something on that CPU. Which means that
something that high prio task did triggered a workqueue to do some work.
But this can also be triggered by interrupts, so by keeping interrupts
off the CPU no workqueue should be activated.
> [PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"
This I find very dangerous. We are making an assumption that tasks on an
isolated CPU wont be doing things that stopmachine requires. What stops
a task on an isolated CPU from calling something into the kernel that
stop_machine requires to halt?
-- Steve
>
> It would be interesting to see a patchset with the above three realtime
> tweaks, layered on this new cpuset 'sched_load_balance' apparatus, rather
> than layered on changes to make 'isolated_cpus' more dynamic. Some of us
> run realtime and cpuset-intensive loads on the same system, so like to
> have those two capabilities co-operate with each other.
>
> Ingo - what's your sense of the value of the above three realtime tweaks
> (the last three patches in Max's patch set)?
>
On Mon, 2008-01-28 at 11:34 -0500, Steven Rostedt wrote:
> On Mon, Jan 28, 2008 at 08:59:10AM -0600, Paul Jackson wrote:
> > Thanks for the CC, Peter.
>
> Thanks from me too.
>
> > Max wrote:
> > > We've had scheduler support for CPU isolation ever since O(1) scheduler went it.
> > > I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
> >
> > I recently added the per-cpuset flag 'sched_load_balance' for some
> > other realtime folks, so that they can disable the kernel scheduler
> > load balancing on isolated CPUs. It essentially allows for dynamic
> > control of which CPUs are isolated by the scheduler, using the cpuset
> > hierarchy, rather than enhancing the 'isolated_cpus' mask. That
> > 'isolated_cpus' mask remained a minimal kernel boottime parameter.
> > I believe this went to Linus's tree about Oct 2007.
> >
> > It looks like you have three additional tweaks for realtime in this
> > patch set, with your patches:
> >
> > [PATCH] [CPUISOL] Do not route IRQs to the CPUs isolated at boot
>
> I didn't know we still routed IRQs to isolated CPUs. I guess I need to
> look deeper into the code on this one. But I agree that isolated CPUs
> should not have IRQs routed to them.
While I agree with this in principle, I'm not sure flat out denying all
IRQs to these cpus is a good option. What about the case where we want
to service just this one specific IRQ on this CPU and no others?
Can't this be done by userspace irq routing as used by irqbalanced?
> > [PATCH] [CPUISOL] Support for workqueue isolation
>
> The thing about workqueues is that they should only be woken on a CPU if
> something on that CPU accessed them. IOW, the workqueue on a CPU handles
> work that was called by something on that CPU. Which means that
> something that high prio task did triggered a workqueue to do some work.
> But this can also be triggered by interrupts, so by keeping interrupts
> off the CPU no workqueue should be activated.
Quite so, if nobody uses it, there is no harm in having them around. If
they are used, its by someone already allowed on the cpu.
> > [PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"
>
> This I find very dangerous. We are making an assumption that tasks on an
> isolated CPU wont be doing things that stopmachine requires. What stops
> a task on an isolated CPU from calling something into the kernel that
> stop_machine requires to halt?
Very dangerous indeed!
Hi Peter,
Peter Zijlstra wrote:
> [ You really ought to CC people :-) ]
I was not sure who though :)
Do we have a mailing list for scheduler development btw ?
Or it's just folks that you included in CC ?
Some of the latest scheduler patches brake things that I'm doing and I'd like to make
them configurable (RT watchdog, etc).
> On Sun, 2008-01-27 at 20:09 -0800, [email protected] wrote:
>> Following patch series extends CPU isolation support. Yes, most people want to virtuallize
>> CPUs these days and I want to isolate them :).
>> The primary idea here is to be able to use some CPU cores as dedicated engines for running
>> user-space code with minimal kernel overhead/intervention, think of it as an SPE in the
>> Cell processor.
>>
>> We've had scheduler support for CPU isolation ever since O(1) scheduler went it.
>> I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
>> In fact that the primary distinction that I'm making between say "CPU sets" and
>> "CPU isolation". "CPU sets" let you manage user-space load while "CPU isolation" provides
>> a way to isolate a CPU as much as possible (including kernel activities).
>
> Ok, so you're aware of CPU sets, miss a feature, but instead of
> extending it to cover your needs you build something new entirely?
It's not really new. CPU isolation bits just has not been exported before that's all.
Also "CPU sets" seem to mostly deal with the scheduler domains. I'll reply to Paul's
proposal to use that instead.
>> I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to
>> achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf
>> multi- processor/core systems under exteme system load. I'm working with legal folks on releasing
>> hard RT user-space framework for that.
>> I can also see other application like simulators and stuff that can benefit from this.
>
> have you been using just this, or in combination with the -rt effort?
Just this patches. RT patches cannot achieve what I needed. Even RTAI/Xenomai can't do that.
For example I have separate tasks with hard deadlines that must be enforced in 50usec kind
of range and basically no idle time whatsoever. Just to give more background it's a wireless
basestation with SW MAC/Scheduler. Another requirement is for the SW to know precise timing
because SW. For example there is no way we can do predictable 1-2 usec sleeps.
So I wrote a user-space engine that does all this, it requires full control of the CPU ie minimal
overhead from the kernel, just IPIs for memory management and that's basically it. When my legal
department lets me I'll do a presentation on this stuff at Linux RT conference or something.
Max
Paul Jackson wrote:
> Thanks for the CC, Peter.
>
> Ingo - see question at end of message.
>
> Max wrote:
>> We've had scheduler support for CPU isolation ever since O(1) scheduler went it.
>> I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
>
> I recently added the per-cpuset flag 'sched_load_balance' for some
> other realtime folks, so that they can disable the kernel scheduler
> load balancing on isolated CPUs. It essentially allows for dynamic
> control of which CPUs are isolated by the scheduler, using the cpuset
> hierarchy, rather than enhancing the 'isolated_cpus' mask. That
> 'isolated_cpus' mask remained a minimal kernel boottime parameter.
> I believe this went to Linus's tree about Oct 2007.
>
> It looks like you have three additional tweaks for realtime in this
> patch set, with your patches:
>
> [PATCH] [CPUISOL] Do not route IRQs to the CPUs isolated at boot
> [PATCH] [CPUISOL] Support for workqueue isolation
> [PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"
>
> It would be interesting to see a patchset with the above three realtime
> tweaks, layered on this new cpuset 'sched_load_balance' apparatus, rather
> than layered on changes to make 'isolated_cpus' more dynamic. Some of us
> run realtime and cpuset-intensive loads on the same system, so like to
> have those two capabilities co-operate with each other.
I'll definitely take a look. So far it seems that extending cpu_isolated_map
is more natural way of propagating this notion to the rest of the kernel.
Since it's very similar to the cpu_online_map concept and it's easy to integrated
with the code that already uses it.
Anyway. I'll take a look at the cpuset flag that you mentioned and report back.
Thanx
Max
Steven Rostedt wrote:
> On Mon, Jan 28, 2008 at 08:59:10AM -0600, Paul Jackson wrote:
>> Thanks for the CC, Peter.
>
> Thanks from me too.
>
>> Max wrote:
>>> We've had scheduler support for CPU isolation ever since O(1) scheduler went it.
>>> I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
>> I recently added the per-cpuset flag 'sched_load_balance' for some
>> other realtime folks, so that they can disable the kernel scheduler
>> load balancing on isolated CPUs. It essentially allows for dynamic
>> control of which CPUs are isolated by the scheduler, using the cpuset
>> hierarchy, rather than enhancing the 'isolated_cpus' mask. That
>> 'isolated_cpus' mask remained a minimal kernel boottime parameter.
>> I believe this went to Linus's tree about Oct 2007.
>>
>> It looks like you have three additional tweaks for realtime in this
>> patch set, with your patches:
>>
>> [PATCH] [CPUISOL] Do not route IRQs to the CPUs isolated at boot
>
> I didn't know we still routed IRQs to isolated CPUs. I guess I need to
> look deeper into the code on this one. But I agree that isolated CPUs
> should not have IRQs routed to them.
Also note that it's just a convenience feature. In other words it's not that with this patch
we'll never route IRQs to those CPUs. They can still be explicitly routed by writing to
irq/N/smp_affitnity.
>> [PATCH] [CPUISOL] Support for workqueue isolation
>
> The thing about workqueues is that they should only be woken on a CPU if
> something on that CPU accessed them. IOW, the workqueue on a CPU handles
> work that was called by something on that CPU. Which means that
> something that high prio task did triggered a workqueue to do some work.
> But this can also be triggered by interrupts, so by keeping interrupts
> off the CPU no workqueue should be activated.
No no no. That's what I though too ;-). The problem is that things like NFS and friends
expect _all_ their workqueue threads to report back when they do certain things like
flushing buffers and stuff. The reason I added this is because my machines were getting
stuck because CPU0 was waiting for CPU1 to run NFS work queue threads even though no IRQs
or other things are running on it.
>> [PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"
>
> This I find very dangerous. We are making an assumption that tasks on an
> isolated CPU wont be doing things that stopmachine requires. What stops
> a task on an isolated CPU from calling something into the kernel that
> stop_machine requires to halt?
I agree in general. The thing is though that stop machine just kills any kind of latency
guaranties. Without the patch the machine just hangs waiting for the stop-machine to run
when module is inserted/removed. And running without dynamic module loading is not very
practical on general purpose machines. So I'd rather have an option with a big red warning
than no option at all :).
Thanx
Max
Peter Zijlstra wrote:
> On Mon, 2008-01-28 at 11:34 -0500, Steven Rostedt wrote:
>> On Mon, Jan 28, 2008 at 08:59:10AM -0600, Paul Jackson wrote:
>>> Thanks for the CC, Peter.
>> Thanks from me too.
>>
>>> Max wrote:
>>>> We've had scheduler support for CPU isolation ever since O(1) scheduler went it.
>>>> I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
>>> I recently added the per-cpuset flag 'sched_load_balance' for some
>>> other realtime folks, so that they can disable the kernel scheduler
>>> load balancing on isolated CPUs. It essentially allows for dynamic
>>> control of which CPUs are isolated by the scheduler, using the cpuset
>>> hierarchy, rather than enhancing the 'isolated_cpus' mask. That
>>> 'isolated_cpus' mask remained a minimal kernel boottime parameter.
>>> I believe this went to Linus's tree about Oct 2007.
>>>
>>> It looks like you have three additional tweaks for realtime in this
>>> patch set, with your patches:
>>>
>>> [PATCH] [CPUISOL] Do not route IRQs to the CPUs isolated at boot
>> I didn't know we still routed IRQs to isolated CPUs. I guess I need to
>> look deeper into the code on this one. But I agree that isolated CPUs
>> should not have IRQs routed to them.
>
> While I agree with this in principle, I'm not sure flat out denying all
> IRQs to these cpus is a good option. What about the case where we want
> to service just this one specific IRQ on this CPU and no others?
>
> Can't this be done by userspace irq routing as used by irqbalanced?
Peter, I think you missed the point of this patch. It's just a convenience feature.
It simply excludes isolated CPUs from IRQ smp affinity masks. That's all. What did you
mean by "flat out denying all IRQs to these cpus" ? IRQs can still be routed to them
by writing to /proc/irq/N/smp_affinity.
Also, this happens naturally when we bring a CPU off-line and then bring it back online.
ie When CPU comes back online it's excluded from the IRQ smp_affinity masks even without
my patch.
>>> [PATCH] [CPUISOL] Support for workqueue isolation
>> The thing about workqueues is that they should only be woken on a CPU if
>> something on that CPU accessed them. IOW, the workqueue on a CPU handles
>> work that was called by something on that CPU. Which means that
>> something that high prio task did triggered a workqueue to do some work.
>> But this can also be triggered by interrupts, so by keeping interrupts
>> off the CPU no workqueue should be activated.
>
> Quite so, if nobody uses it, there is no harm in having them around. If
> they are used, its by someone already allowed on the cpu.
No no no. I just replied to Steven about that. The problem is that things like NFS and
friends expect _all_ their workqueue threads to report back when they do certain things
like flushing buffers and stuff. The reason I added this is because my machines were
getting stuck because CPU0 was waiting for CPU1 to run NFS work queue threads even though
no IRQs, softirqs or other things are running on it.
>>> [PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"
>> This I find very dangerous. We are making an assumption that tasks on an
>> isolated CPU wont be doing things that stopmachine requires. What stops
>> a task on an isolated CPU from calling something into the kernel that
>> stop_machine requires to halt?
>
> Very dangerous indeed!
Please see my reply to Steven. I agree it's somewhat dangerous. What we could do is make it
configurable with a big fat warning. In other words I'd rather have an option than just says
"do not use dynamic module loading" on those systems.
Max
On Mon, 28 Jan 2008, Max Krasnyanskiy wrote:
> >> [PATCH] [CPUISOL] Support for workqueue isolation
> >
> > The thing about workqueues is that they should only be woken on a CPU if
> > something on that CPU accessed them. IOW, the workqueue on a CPU handles
> > work that was called by something on that CPU. Which means that
> > something that high prio task did triggered a workqueue to do some work.
> > But this can also be triggered by interrupts, so by keeping interrupts
> > off the CPU no workqueue should be activated.
> No no no. That's what I though too ;-). The problem is that things like NFS and friends
> expect _all_ their workqueue threads to report back when they do certain things like
> flushing buffers and stuff. The reason I added this is because my machines were getting
> stuck because CPU0 was waiting for CPU1 to run NFS work queue threads even though no IRQs
> or other things are running on it.
This sounds more like we should fix NFS than add this for all workqueues.
Again, we want workqueues to run on the behalf of whatever is running on
that CPU, including those tasks that are running on an isolcpu.
>
> >> [PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"
> >
> > This I find very dangerous. We are making an assumption that tasks on an
> > isolated CPU wont be doing things that stopmachine requires. What stops
> > a task on an isolated CPU from calling something into the kernel that
> > stop_machine requires to halt?
> I agree in general. The thing is though that stop machine just kills any kind of latency
> guaranties. Without the patch the machine just hangs waiting for the stop-machine to run
> when module is inserted/removed. And running without dynamic module loading is not very
> practical on general purpose machines. So I'd rather have an option with a big red warning
> than no option at all :).
Well, that's something one of the greater powers (Linus, Andrew, Ingo)
must decide. ;-)
-- Steve
Max wrote:
> So far it seems that extending cpu_isolated_map
> is more natural way of propagating this notion to the rest of the kernel.
> Since it's very similar to the cpu_online_map concept and it's easy to integrated
> with the code that already uses it.
If it were just realtime support, then I suspect I'd agree that
extending cpu_isolated_map makes more sense.
But some people use realtime on systems that are also heavily
managed using cpusets. The two have to work together. I have
customers with systems running realtime on a few CPUs, at the
same time that they have a large batch scheduler (which is layered
on top of cpusets) managing jobs on a few hundred other CPUs.
Hence with the cpuset 'sched_load_balance' flag I think I've already
done what I think is one part of what your patches achieve by extending
the cpu_isolated_map.
This is a common situation with "resource management" mechanisms such
as cpusets (and more recently cgroups and the subsystem modules it
supports.) They cut across existing core kernel code that manages such
key resources as CPUs and memory. As best we can, they have to work
with each other.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
Max wrote:
> Also "CPU sets" seem to mostly deal with the scheduler domains.
True - though "cpusets" (no space ;) sched_load_balance flag can
be used to see that some CPUs are not in any scheduler domain,
which is equivalent to not having the scheduler run on them.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
On Mon, 2008-01-28 at 14:00 -0500, Steven Rostedt wrote:
>
> On Mon, 28 Jan 2008, Max Krasnyanskiy wrote:
> > >> [PATCH] [CPUISOL] Support for workqueue isolation
> > >
> > > The thing about workqueues is that they should only be woken on a CPU if
> > > something on that CPU accessed them. IOW, the workqueue on a CPU handles
> > > work that was called by something on that CPU. Which means that
> > > something that high prio task did triggered a workqueue to do some work.
> > > But this can also be triggered by interrupts, so by keeping interrupts
> > > off the CPU no workqueue should be activated.
>
> > No no no. That's what I though too ;-). The problem is that things like NFS and friends
> > expect _all_ their workqueue threads to report back when they do certain things like
> > flushing buffers and stuff. The reason I added this is because my machines were getting
> > stuck because CPU0 was waiting for CPU1 to run NFS work queue threads even though no IRQs
> > or other things are running on it.
>
> This sounds more like we should fix NFS than add this for all workqueues.
> Again, we want workqueues to run on the behalf of whatever is running on
> that CPU, including those tasks that are running on an isolcpu.
agreed, by looking at my top output (and not the nfs code) it looks like
it just spawns a configurable number of active kernel threads which are
not cpu bound by in any way. I think just removing the isolated cpus
from their runnable mask should take care of them.
>
> >
> > >> [PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"
> > >
> > > This I find very dangerous. We are making an assumption that tasks on an
> > > isolated CPU wont be doing things that stopmachine requires. What stops
> > > a task on an isolated CPU from calling something into the kernel that
> > > stop_machine requires to halt?
>
> > I agree in general. The thing is though that stop machine just kills any kind of latency
> > guaranties. Without the patch the machine just hangs waiting for the stop-machine to run
> > when module is inserted/removed. And running without dynamic module loading is not very
> > practical on general purpose machines. So I'd rather have an option with a big red warning
> > than no option at all :).
>
> Well, that's something one of the greater powers (Linus, Andrew, Ingo)
> must decide. ;-)
I'm in favour of better engineered method, that is, we really should try
to solve these problems in a proper way. Hacks like this might be fine
for custom kernels, but I think we should have a higher standard when it
comes to upstream - we all have to live many years with whatever we put
in there, we'd better think well about it.
Peter Zijlstra wrote:
> On Mon, 2008-01-28 at 14:00 -0500, Steven Rostedt wrote:
>> On Mon, 28 Jan 2008, Max Krasnyanskiy wrote:
>>>>> [PATCH] [CPUISOL] Support for workqueue isolation
>>>> The thing about workqueues is that they should only be woken on a CPU if
>>>> something on that CPU accessed them. IOW, the workqueue on a CPU handles
>>>> work that was called by something on that CPU. Which means that
>>>> something that high prio task did triggered a workqueue to do some work.
>>>> But this can also be triggered by interrupts, so by keeping interrupts
>>>> off the CPU no workqueue should be activated.
>>> No no no. That's what I though too ;-). The problem is that things like NFS and friends
>>> expect _all_ their workqueue threads to report back when they do certain things like
>>> flushing buffers and stuff. The reason I added this is because my machines were getting
>>> stuck because CPU0 was waiting for CPU1 to run NFS work queue threads even though no IRQs
>>> or other things are running on it.
>> This sounds more like we should fix NFS than add this for all workqueues.
>> Again, we want workqueues to run on the behalf of whatever is running on
>> that CPU, including those tasks that are running on an isolcpu.
>
> agreed, by looking at my top output (and not the nfs code) it looks like
> it just spawns a configurable number of active kernel threads which are
> not cpu bound by in any way. I think just removing the isolated cpus
> from their runnable mask should take care of them.
Actually NFS was just one example. I cannot remember of a top of my head what else was there
but there are definitely other users of work queues that expect all the threads to run at
some point in time.
Also if you think about it. The patch does _exactly_ what you propose. It removes workqueue
threads from isolated CPUs. But instead of doing just for NFS and/or other subsystems
separately it just does it in a generic way by simply not starting those threads in first
place.
>>>>> [PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"
>>>> This I find very dangerous. We are making an assumption that tasks on an
>>>> isolated CPU wont be doing things that stopmachine requires. What stops
>>>> a task on an isolated CPU from calling something into the kernel that
>>>> stop_machine requires to halt?
>>> I agree in general. The thing is though that stop machine just kills any kind of latency
>>> guaranties. Without the patch the machine just hangs waiting for the stop-machine to run
>>> when module is inserted/removed. And running without dynamic module loading is not very
>>> practical on general purpose machines. So I'd rather have an option with a big red warning
>>> than no option at all :).
>> Well, that's something one of the greater powers (Linus, Andrew, Ingo)
>> must decide. ;-)
>
> I'm in favour of better engineered method, that is, we really should try
> to solve these problems in a proper way. Hacks like this might be fine
> for custom kernels, but I think we should have a higher standard when it
> comes to upstream - we all have to live many years with whatever we put
> in there, we'd better think well about it.
100% agree. That's why I said mentioned that this patches is controversial in the first place.
Right now those short from rewriting module loading to not use stop machine there is no other
option. I'll think some more about it. If you guys have other ideas please drop me a note.
Thanx
Max
Paul Jackson wrote:
> Max wrote:
>> So far it seems that extending cpu_isolated_map
>> is more natural way of propagating this notion to the rest of the kernel.
>> Since it's very similar to the cpu_online_map concept and it's easy to integrated
>> with the code that already uses it.
>
> If it were just realtime support, then I suspect I'd agree that
> extending cpu_isolated_map makes more sense.
>
> But some people use realtime on systems that are also heavily
> managed using cpusets. The two have to work together. I have
> customers with systems running realtime on a few CPUs, at the
> same time that they have a large batch scheduler (which is layered
> on top of cpusets) managing jobs on a few hundred other CPUs.
> Hence with the cpuset 'sched_load_balance' flag I think I've already
> done what I think is one part of what your patches achieve by extending
> the cpu_isolated_map.
>
> This is a common situation with "resource management" mechanisms such
> as cpusets (and more recently cgroups and the subsystem modules it
> supports.) They cut across existing core kernel code that manages such
> key resources as CPUs and memory. As best we can, they have to work
> with each other.
Thanks for the info Paul. I'll definitely look into using this flag instead
and reply with pros and cons (if any).
Max
On Mon, 2008-01-28 at 10:32 -0800, Max Krasnyanskiy wrote:
> Just this patches. RT patches cannot achieve what I needed. Even RTAI/Xenomai can't do that.
> For example I have separate tasks with hard deadlines that must be enforced in 50usec kind
> of range and basically no idle time whatsoever. Just to give more background it's a wireless
> basestation with SW MAC/Scheduler. Another requirement is for the SW to know precise timing
> because SW. For example there is no way we can do predictable 1-2 usec sleeps.
> So I wrote a user-space engine that does all this, it requires full control of the CPU ie minimal
> overhead from the kernel, just IPIs for memory management and that's basically it. When my legal
> department lets me I'll do a presentation on this stuff at Linux RT conference or something.
What kind of hardware are you doing this on? Also I should note there is
HRT (High resolution timers) which provided microsecond level
granularity ..
Daniel
Daniel Walker wrote:
> On Mon, 2008-01-28 at 10:32 -0800, Max Krasnyanskiy wrote:
>> Just this patches. RT patches cannot achieve what I needed. Even RTAI/Xenomai can't do that.
>> For example I have separate tasks with hard deadlines that must be enforced in 50usec kind
>> of range and basically no idle time whatsoever. Just to give more background it's a wireless
>> basestation with SW MAC/Scheduler. Another requirement is for the SW to know precise timing
>> because SW. For example there is no way we can do predictable 1-2 usec sleeps.
>> So I wrote a user-space engine that does all this, it requires full control of the CPU ie minimal
>> overhead from the kernel, just IPIs for memory management and that's basically it. When my legal
>> department lets me I'll do a presentation on this stuff at Linux RT conference or something.
>
> What kind of hardware are you doing this on?
All kinds of HW. I mentioned it in the intro email.
Here are the highlights
HP XW9300 (Dual Opteron NUMA box) and XW9400 (Dual Core Opteron)
HP DL145 G2 (Dual Opteron) and G3 (Dual Core Opteron)
Dell Precision workstations (Core2 Duo and Quad)
Various Core2 Duo based systems uTCA boards
Mercury AXA110 (1.5Ghz)
Concurrent Tech AM110 (2.1Ghz)
This scheme should work on anything that lets you disable SMI on the isolated core(s).
> Also I should note there is HRT (High resolution timers) which provided microsecond level
> granularity ..
Not accurate enough and way too much overhead for what I need. I know at this point it probably
sounds like I'm talking BS :). I wish I've released the engine and examples by now. Anyway let
me just say that SW MAC has crazy tight deadlines with lots of small tasks. Using nanosleep() &
gettimeofday() is simply not practical. So it's all TSC based with clever time sync logic between
HW and SW.
Max
On Mon, 2008-01-28 at 16:12 -0800, Max Krasnyanskiy wrote:
> Not accurate enough and way too much overhead for what I need. I know at this point it probably
> sounds like I'm talking BS :). I wish I've released the engine and examples by now. Anyway let
> me just say that SW MAC has crazy tight deadlines with lots of small tasks. Using nanosleep() &
> gettimeofday() is simply not practical. So it's all TSC based with clever time sync logic between
> HW and SW.
I don't know if it's BS or not, you clearly fixed your own problem which
is good .. Although when you say "RT patches cannot achieve what I
needed. Even RTAI/Xenomai can't do that." , and HRT is "Not accurate
enough and way too much overhead" .. Given the hardware your using,
that's all difficult to believe.. You also said this code has been
running on production systems for two year, which means it's at least
two years old .. There's been some good sized leaps in real time linux
in the past two years ..
Daniel
[email protected] wrote:
> Following patch series extends CPU isolation support. Yes, most people want to virtuallize
> CPUs these days and I want to isolate them :).
> The primary idea here is to be able to use some CPU cores as dedicated engines for running
> user-space code with minimal kernel overhead/intervention, think of it as an SPE in the
> Cell processor.
>
> We've had scheduler support for CPU isolation ever since O(1) scheduler went it.
> I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
> In fact that the primary distinction that I'm making between say "CPU sets" and
> "CPU isolation". "CPU sets" let you manage user-space load while "CPU isolation" provides
> a way to isolate a CPU as much as possible (including kernel activities).
>
> I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to
> achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf
> multi- processor/core systems under exteme system load. I'm working with legal folks on releasing
> hard RT user-space framework for that.
> I can also see other application like simulators and stuff that can benefit from this.
>
> I've been maintaining this stuff since around 2.6.18 and it's been running in production
> environment for a couple of years now. It's been tested on all kinds of machines, from NUMA
> boxes like HP xw9300/9400 to tiny uTCA boards like Mercury AXA110.
> The messiest part used to be SLAB garbage collector changes. With the new SLUB all that mess
> goes away (ie no changes necessary). Also CFS seems to handle CPU hotplug much better than O(1)
> did (ie domains are recomputed dynamically) so that isolation can be done at any time (via sysfs).
> So this seems like a good time to merge.
>
> Anyway. The patchset consist of 5 patches. First three are very simple and non-controversial.
> They simply make "CPU isolation" a configurable feature, export cpu_isolated_map and provide
> some helper functions to access it (just like cpu_online() and friends).
> Last two patches add support for isolating CPUs from running workqueus and stop machine.
> More details in the individual patch descriptions.
>
> Ideally I'd like all of this to go in during this merge window. If people think it's acceptable
> Linus or Andrew (or whoever is more appropriate Ingo maybe) can pull this patch set from
> git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git
>
It's good to see hear from someone else that thinks a multi-processor
box _should_ be able to run a CPU intensive (%100) RT app on one of the
processors without adversely affecting or being affected by the others.
I have had issues that were _traced_ back to the fact that I am doing
just that. All I got was, you can't do that or we don't support that
kind of thing in the Linux kernel.
One example, Andrew Mortons feedback to the LKML thread "floppy.c soft
lockup"
Good luck with this. I hope this gets someones attention.
BTW, I have tried your patches against a vanilla 2.6.24 kernel but am
not successful.
# echo '1' > /sys/devices/system/cpu/cpu1/isolated
bash: echo: write error: Device or resource busy
The cpuisol=1 cmdline option yields:
harley:# cat /sys/devices/system/cpu/cpu1/isolated
0
harley:# cat /proc/cmdline
root=/dev/sda3 vga=normal apm=off selinux=0 noresume splash=silent
kmalloc=192M cpuisol=1
Regards
Mark
Hi Mark,
> [email protected] wrote:
>> Following patch series extends CPU isolation support. Yes, most people want to virtuallize
>> CPUs these days and I want to isolate them :).
>> The primary idea here is to be able to use some CPU cores as dedicated engines for running
>> user-space code with minimal kernel overhead/intervention, think of it as an SPE in the
>> Cell processor.
>>
>> We've had scheduler support for CPU isolation ever since O(1) scheduler went it.
>> I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
>> In fact that the primary distinction that I'm making between say "CPU sets" and
>> "CPU isolation". "CPU sets" let you manage user-space load while "CPU isolation" provides
>> a way to isolate a CPU as much as possible (including kernel activities).
>>
>> I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to
>> achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf
>> multi- processor/core systems under exteme system load. I'm working with legal folks on releasing
>> hard RT user-space framework for that.
>> I can also see other application like simulators and stuff that can benefit from this.
>>
>> I've been maintaining this stuff since around 2.6.18 and it's been running in production
>> environment for a couple of years now. It's been tested on all kinds of machines, from NUMA
>> boxes like HP xw9300/9400 to tiny uTCA boards like Mercury AXA110.
>> The messiest part used to be SLAB garbage collector changes. With the new SLUB all that mess
>> goes away (ie no changes necessary). Also CFS seems to handle CPU hotplug much better than O(1)
>> did (ie domains are recomputed dynamically) so that isolation can be done at any time (via sysfs).
>> So this seems like a good time to merge.
>>
>> Anyway. The patchset consist of 5 patches. First three are very simple and non-controversial.
>> They simply make "CPU isolation" a configurable feature, export cpu_isolated_map and provide
>> some helper functions to access it (just like cpu_online() and friends).
>> Last two patches add support for isolating CPUs from running workqueus and stop machine.
>> More details in the individual patch descriptions.
>>
>> Ideally I'd like all of this to go in during this merge window. If people think it's acceptable
>> Linus or Andrew (or whoever is more appropriate Ingo maybe) can pull this patch set from
>> git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git
>>
>
> It's good to see hear from someone else that thinks a multi-processor
> box _should_ be able to run a CPU intensive (%100) RT app on one of the
> processors without adversely affecting or being affected by the others.
> I have had issues that were _traced_ back to the fact that I am doing
> just that. All I got was, you can't do that or we don't support that
> kind of thing in the Linux kernel.
>
> One example, Andrew Mortons feedback to the LKML thread "floppy.c soft lockup"
>
> Good luck with this. I hope this gets someones attention.
Thanks for the support. I do the best I can because just like you I believe that it's
a perfectly valid workload and there a lot of interesting applications that will benefit
from mainline support.
> BTW, I have tried your patches against a vanilla 2.6.24 kernel but am
> not successful.
>
> # echo '1' > /sys/devices/system/cpu/cpu1/isolated
> bash: echo: write error: Device or resource busy
You have to bring it offline first.
In other words:
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/isolated
echo 1 > /sys/devices/system/cpu/cpu1/online
> The cpuisol=1 cmdline option yields:
>
> harley:# cat /sys/devices/system/cpu/cpu1/isolated
> 0
>
> harley:# cat /proc/cmdline
> root=/dev/sda3 vga=normal apm=off selinux=0 noresume splash=silent
> kmalloc=192M cpuisol=1
Sorry my bad. I had a typo in the patch description the option is "isolcpus=N".
We've had that option for awhile now. I mean it's not even part of my patch.
Thanx
Max
Paul Jackson wrote:
> Max wrote:
>> So far it seems that extending cpu_isolated_map
>> is more natural way of propagating this notion to the rest of the kernel.
>> Since it's very similar to the cpu_online_map concept and it's easy to integrated
>> with the code that already uses it.
>
> If it were just realtime support, then I suspect I'd agree that
> extending cpu_isolated_map makes more sense.
>
> But some people use realtime on systems that are also heavily
> managed using cpusets. The two have to work together. I have
> customers with systems running realtime on a few CPUs, at the
> same time that they have a large batch scheduler (which is layered
> on top of cpusets) managing jobs on a few hundred other CPUs.
> Hence with the cpuset 'sched_load_balance' flag I think I've already
> done what I think is one part of what your patches achieve by extending
> the cpu_isolated_map.
>
> This is a common situation with "resource management" mechanisms such
> as cpusets (and more recently cgroups and the subsystem modules it
> supports.) They cut across existing core kernel code that manages such
> key resources as CPUs and memory. As best we can, they have to work
> with each other.
Hi Paul,
I thought some more about your proposal to use sched_load_balance flag in cpusets instead
of extending cpu_isolated_map. I looked at the cpusets, cgroups, latest thread started by
Peter (about sched domains and stuff) and here are my thoughts on this.
Here is the list of things of issues with sched_load_balance flag from CPU isolation
perspective:
--
(1) Boot time isolation is not possible. There is currently no way to setup a cpuset at
boot time. For example we won't be able to isolate cpus from irqs and workqueues at boot.
Not a major issue but still an inconvenience.
--
(2) There is currently no easy way to figure out what cpuset a cpu belongs to in order
to query it's sched_load_balance flag. In order to do that we need a method that iterates
all active cpusets and checks their cpus_allowed masks. This implies holding cgroup and
cpuset mutexes. It's not clear whether it's ok to do that from the the contexts CPU
isolation happens in (apic, sched, workqueue). It seems that cgroup/cpuset api is designed
from top down access. ie adding a cpu to a set and then recomputing domains. Which makes
perfect sense for the common cpuset usecase but is not what cpu isolation needs.
In other words I think it's much simpler and cleaner to use the cpu_isolated_map for isolation
purposes.
--
(3) cpusets are a bit too dynamic :). What I mean by this is that sched_load_balance flag
can be changed at any time without bringing a CPU offline. What that means is that we'll
need some notifier mechanisms for killing and restarting workqueue threads when that flag
changes. Also we'd need some logic that makes sure that a user does not disable load balancing
on all cpus because that effectively will kill workqueues on all the cpus.
This particular case is already handled very nicely in my patches. Isolated bit can be set
only when cpu is offline and it cannot be set on the first online cpu. Workqueus and other
subsystems already handle cpu hotplug events nicely and can easily ignore isolated cpus when
they come online.
-----
#1 is probably unfixable. #2 and #3 can be fixed but at the expense of extra complexity across
the board. I seriously doubt that I'll be able to push that through the reviews ;-).
Also personally I still think cpusets and cpu isolation attack two different problems. cpusets
is about partitioning cpus and memory nodes, and managing tasks. Most of the cgroups/cpuset APIs
are designed to deal with tasks. CPU isolation is much simpler and is at the lower layer. It deals
with IRQs, kernel per cpu threads, etc. The only intersection I see is that both features affect
scheduling domains (cpu isolation is again simple here it just puts cpus into null domains and
that's an existing logic in sched.c nothing new here).
So here are some proposal on how we can make them play nicely with each other.
--
(A) Make cpusets aware of isolated cpus.
All we have to do here is to change
guarantee_online_cpus()
common_cpu_mem_hotplug_unplug()
to exclude cpu_isolated_map from cpu_online_map before using it.
And we'd need to change
update_cpumasks()
to simply ignore isolated cpus.
That way if a cpu is isolated it'll be ignored by the cpusets logic. Which I believe would be
correct behavior.
We're talking trivial ~5 liner patch which will be noop if cpu isolation is disabled.
(B) Ignore isolated map in cpuset. That's the current state of affairs with my patches applied.
Looks like your customers are happy with what they have now so they will probably not enable
cpu isolation anyway :).
(C) Introduce cpu_usable_map. That map will be recomputed on hotplug events. Essentially it'd be
cpu_online_map AND ~cpu_isolated_map. Convert things like cpusets to use that map instead of
online map.
We can probably come up with other options. My preference would be option (A).
I can kook up a patch for this and re-send the patch series.
What do you think ?
btw My impression is that we're talking about very different use cases here. You're talking big
machines with lots of cpus and I'm thinking your probably talking soft RT here, probably RT
networking services or something like that.
Use case I'm talking about is a dedicated machine for a certain task. Like HW simulator, wireless
base station with SW MAC, etc. For this in any foreseeable future most common configuration will
be 2-8 cores. cpusets is probably an overkill here because apps will want to manage thread affinities
themselves anyways (for example right now we bind soft-RT threads to CPU0 and hard-RT thread to CPU1).
Sorry for the typos :)
Max
Max wrote:
> Here is the list of things of issues with sched_load_balance flag from CPU isolation
> perspective:
A separate thread happened to start up on lkml.org, shortly after
yours, that went into this in considerable detail.
For example, the interaction of cpusets, sched_load_balance,
sched_domains and real time scheduling is examined in some detail on
this thread. Everyone participating on that thread learned something
(we all came into it with less than a full picture of what's there.)
I would encourage you to read it closely. For example, the scheduler
code should not be trying to access per-cpuset attributes such as
the sched_load_balance flag (you are correct that this would be
difficult to do because of the locking; however by design, that is
not to be done.)
This thread begins at:
scheduler scalability - cgroups, cpusets and load-balancing
http://lkml.org/lkml/2008/1/29/60
Too bad we didn't think to include you in the CC list of that
thread from the beginning.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
Paul Jackson wrote:
> Max wrote:
>> Here is the list of things of issues with sched_load_balance flag from CPU isolation
>> perspective:
>
> A separate thread happened to start up on lkml.org, shortly after
> yours, that went into this in considerable detail.
>
> For example, the interaction of cpusets, sched_load_balance,
> sched_domains and real time scheduling is examined in some detail on
> this thread. Everyone participating on that thread learned something
> (we all came into it with less than a full picture of what's there.)
>
> I would encourage you to read it closely. For example, the scheduler
> code should not be trying to access per-cpuset attributes such as
> the sched_load_balance flag (you are correct that this would be
> difficult to do because of the locking; however by design, that is
> not to be done.)
>
> This thread begins at:
>
> scheduler scalability - cgroups, cpusets and load-balancing
> http://lkml.org/lkml/2008/1/29/60
>
> Too bad we didn't think to include you in the CC list of that
> thread from the beginning.
Paul, I actually mentioned at the beginning of my email that I did read that thread
started by Peter. I did learn quite a bit from it :)
You guys did not discuss isolation stuff though. The thread was only about scheduling
and my cpu isolation extension patches deal with other aspects.
Sounds like at this point we're in agreement that sched_load_balance is not suitable
for what I'd like to achieve. But how about making cpusets aware of the cpu_isolated_map ?
Even without my patches it's somewhat of an issue right now. I mean of you use isolcpus=
boot option to put cpus into null domain, cpusets will not be aware of it. The result maybe
a bit confusing if an isolated cpu is added to some cpuset.
Max
Max wrote:
> Paul, I actually mentioned at the beginning of my email that I did read that thread
> started by Peter. I did learn quite a bit from it :)
Ah - sorry - I missed that part. However, I'm still getting the feeling
that there were some key points in that thread that we have not managed
to communicate successfully.
> Sounds like at this point we're in agreement that sched_load_balance is not suitable
> for what I'd like to achieve.
I don't think we're in agreement; I think we're in confusion ;)
Yes, sched_load_balance does not *directly* have anything to do with
this.
But indirectly it is a critical element in what I think you'd like to
achieve. It affects how the cpuset code sets up sched_domains, and
if I understand correctly, you require either (1) some sched_domains to
only contain RT tasks, or (2) some CPUs to be in no sched_domain at all.
Proper configuration of the cpuset hierarchy, including the setting of
the per-cpuset sched_load_balance flag, can provide either of these
sched_domain partitions, as desired.
> But how about making cpusets aware of the cpu_isolated_map ?
No. That's confusing cpusets and the scheduler again.
The cpu_isolated_map is a file static variable known only within
the kernel/sched.c file; this should not change.
Presently, the boot parameter isolcpus= is just used to initialize
what CPUs are isolated at boot, and then the sched_domain partitioning,
as done in kernel/sched.c:partition_sched_domains() (the hook into
the sched code that cpusets uses) determines which CPUs are isolated
from that point forward. I doubt that this should change either.
In that thread referenced above, did you see the part where RT is
achieved not by isolating CPUs from any scheduler, but rather by
polymorphically having several schedulers available to operate on each
sched_domain, and having RT threads self-select the RT scheduler?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
Paul Jackson wrote:
> Max wrote:
>> Paul, I actually mentioned at the beginning of my email that I did read that thread
>> started by Peter. I did learn quite a bit from it :)
>
> Ah - sorry - I missed that part. However, I'm still getting the feeling
> that there were some key points in that thread that we have not managed
> to communicate successfully.
I think you are assuming that I only need to deal with RT scheduler and scheduler
domains which is not correct. See below.
>> Sounds like at this point we're in agreement that sched_load_balance is not suitable
>> for what I'd like to achieve.
>
> I don't think we're in agreement; I think we're in confusion ;)
Yeah. I don't believe I'm the confused side though ;-)
> Yes, sched_load_balance does not *directly* have anything to do with this.
>
> But indirectly it is a critical element in what I think you'd like to
> achieve. It affects how the cpuset code sets up sched_domains, and
> if I understand correctly, you require either (1) some sched_domains to
> only contain RT tasks, or (2) some CPUs to be in no sched_domain at all.
>
> Proper configuration of the cpuset hierarchy, including the setting of
> the per-cpuset sched_load_balance flag, can provide either of these
> sched_domain partitions, as desired.
Again you're assuming that scheduling domain partitioning satisfies my requirements
or addresses my use case. It does not. See below for more details.
>> But how about making cpusets aware of the cpu_isolated_map ?
>
> No. That's confusing cpusets and the scheduler again.
>
> The cpu_isolated_map is a file static variable known only within
> the kernel/sched.c file; this should not change.
I completely disagree. In fact I think all the cpu_xxx_map (online, present, isolated)
variables do not belong in the scheduler code. I'm thinking of submitting a patch that
factors them out into kernel/cpumask.c We already have cpumask.h.
> Presently, the boot parameter isolcpus= is just used to initialize
> what CPUs are isolated at boot, and then the sched_domain partitioning,
> as done in kernel/sched.c:partition_sched_domains() (the hook into
> the sched code that cpusets uses) determines which CPUs are isolated
> from that point forward. I doubt that this should change either.
Sure, I did not even touch that part. I just proposed to extend the meaning of the
'isolated' bit.
> In that thread referenced above, did you see the part where RT is
> achieved not by isolating CPUs from any scheduler, but rather by
> polymorphically having several schedulers available to operate on each
> sched_domain, and having RT threads self-select the RT scheduler?
Absolutely. Yes that is. I saw that part. But it has nothing to do with my use case.
Looks like I failed to explain what I'm trying to achieve. So let me try again.
I'd like to be able to run a CPU intensive (%100) RT task on one of the processors without
adversely affecting or being affected by the other system activities. System activities
here include _kernel_ activities as well. Hence the proposal is to extend current CPU
isolation feature.
The new definition of the CPU isolation would be:
---
1. Isolated CPU(s) must not be subject to scheduler load balancing
Users must explicitly bind threads in order to run on those CPU(s).
2. By default interrupts must not be routed to the isolated CPU(s)
User must route interrupts (if any) explicitly.
3. In general kernel subsystems must avoid activity on the isolated CPU(s) as much as possible
Includes workqueues, per CPU threads, etc.
This feature is configurable and is disabled by default.
---
#1 affects scheduler and scheduler domains. It's already supported either by using isolcpus= boot
option or by setting "sched_load_balance" in cpusets. I'm totally happy with the current behavior
and my original patch did not mess with this functionality in any way.
#2 and #3 have _nothing_ to do with the scheduler or scheduler domains. I've been trying to explain
that for a few days now ;-). When you saw my patches for #2 and #3 you told me that you'd be interested
to see them implemented on top of the "sched_load_balance" flag. Here is your original reply
http://marc.info/?l=linux-kernel&m=120153260217699&w=2
So I looked into that and provided an explanation why it would not work or would work but would add
lots of complexity (access to internal cpuset structures, locking, etc).
My email on that is here:
http://marc.info/?l=linux-kernel&m=120180692331461&w=2
Now, I felt from the beginning that cpusets is not the right mechanism to address number #2 and #3.
The best mechanism IMO is to simply provide an access to the cpu_isolated_map to the rest of the kernel.
Again the fact that cpu_isolated_map currently lives in the scheduler code does not change anything
here because as I explained I'm proposing to extend the meaning of the "CPU isolation". I provided
dynamic access to the "isolated" bit only for convince, it does _not_ change existing scheduler/sched
domain/cpuset logic in any way.
Hopefully we're on the same page with regards to the "CPU isolation" now.
If not please let me know what I missed from the earlier discussions or other scheduler related threads.
---
If you think that making cpusets aware of isolated cpus is not the right thing to do that's perfectly
fine by me. I think it'd be better if they were but we can keep things the way they are right now.
Max
Hi Daniel,
Sorry for not replying right away.
Daniel Walker wrote:
> On Mon, 2008-01-28 at 16:12 -0800, Max Krasnyanskiy wrote:
>
>> Not accurate enough and way too much overhead for what I need. I know at this point it probably
>> sounds like I'm talking BS :). I wish I've released the engine and examples by now. Anyway let
>> me just say that SW MAC has crazy tight deadlines with lots of small tasks. Using nanosleep() &
>> gettimeofday() is simply not practical. So it's all TSC based with clever time sync logic between
>> HW and SW.
>
> I don't know if it's BS or not, you clearly fixed your own problem which
> is good .. Although when you say "RT patches cannot achieve what I
> needed. Even RTAI/Xenomai can't do that." , and HRT is "Not accurate
> enough and way too much overhead" .. Given the hardware your using,
> that's all difficult to believe.. You also said this code has been
> running on production systems for two year, which means it's at least
> two years old .. There's been some good sized leaps in real time linux
> in the past two years ..
I've been actually tracking RT patches fairly closely. I can't say I tried all of them but I do try
them from time to time. I just got latest 2.6.24-rt1 running on HP xw9300. Looks like it does not handle
CPU hotplug very well, I manged to kill it by bringing cpu 1 off-line. So I cannot run any tests right
now will run some tomorrow.
For now let me mention that I have a simple tests that sleeps for a millisecond, then does some bitbanging
for 200 usec. It measures jitter caused by the periodic scheduler tick, IPIs and other kernel activities.
With high-res timers disabled on most of the machines I mentioned before it shows around 1-1.2usec worst case.
With high-res timers enabled it shows 5-6usec. This is with 2.6.24 running on an isolated CPU. Forget about
using a user-space timer (nanosleep(), etc). Even scheduler tick itself is fairly heavy.
gettimeofday() call on that machine takes on average 2-3usec (not a vsyscall) and SW MAC is all about precise
timing. That's why I said that it's not practical to use that stuff for me. I do not see anything in -rt kernel
that would improve this.
This is btw not to say that -rt kernel is not useful for my app in general. We have a bunch of soft-RT threads
that talk to the MAC thread. Those would definitely benefit. I think cpu isolation + -rt would work beautifully
for wireless basestations.
Max
Max wrote:
> Looks like I failed to explain what I'm trying to achieve. So let me try again.
Well done. I read through that, expecting to disagree or at least
to not understand at some point, and got all the way through nodding
my head in agreement. Good.
Whether the earlier confusions were lack of clarity in the presentation,
or lack of competence in my brain ... well guess I don't want to ask that
question ;).
Well ... just one minor point:
Max wrote in reply to pj:
> > The cpu_isolated_map is a file static variable known only within
> > the kernel/sched.c file; this should not change.
> I completely disagree. In fact I think all the cpu_xxx_map (online, present, isolated)
> variables do not belong in the scheduler code. I'm thinking of submitting a patch that
> factors them out into kernel/cpumask.c We already have cpumask.h.
Huh? Why would you want to do that?
For one thing, the map being discussed here, cpu_isolated_map,
is only used in sched.c, so why publish it wider?
And for another thing, we already declare externs in cpumask.h for
the other, more widely used, cpu_*_map variables cpu_possible_map,
cpu_online_map, and cpu_present_map.
Other than that detail, we seem to be communicating and in agreement on
your first item, isolating CPU scheduler load balancing. Good.
On your other two items, irq and workqueue isolation, which I had
suggested doing via cpuset sched_load_balance, I now agree that that
wasn't a good idea.
I am still a little surprised at using isolation extensions to
disable irqs on select CPUs; but others have thought far more about
irqs than I have, so I'll be quiet.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
Paul Jackson wrote:
> Max wrote:
>> Looks like I failed to explain what I'm trying to achieve. So let me try again.
>
> Well done. I read through that, expecting to disagree or at least
> to not understand at some point, and got all the way through nodding
> my head in agreement. Good.
>
> Whether the earlier confusions were lack of clarity in the presentation,
> or lack of competence in my brain ... well guess I don't want to ask that
> question ;).
:)
> Well ... just one minor point:
>
> Max wrote in reply to pj:
>>> The cpu_isolated_map is a file static variable known only within
>>> the kernel/sched.c file; this should not change.
>> I completely disagree. In fact I think all the cpu_xxx_map (online, present, isolated)
>> variables do not belong in the scheduler code. I'm thinking of submitting a patch that
>> factors them out into kernel/cpumask.c We already have cpumask.h.
>
> Huh? Why would you want to do that?
>
> For one thing, the map being discussed here, cpu_isolated_map,
> is only used in sched.c, so why publish it wider?
>
> And for another thing, we already declare externs in cpumask.h for
> the other, more widely used, cpu_*_map variables cpu_possible_map,
> cpu_online_map, and cpu_present_map.
Well, to address #2 and #3 isolated map will need to be exported as well.
Those other maps do not really have much to do with the scheduler code.
That's why I think either kernel/cpumask.c or kernel/cpu.c is a better place for them.
> Other than that detail, we seem to be communicating and in agreement on
> your first item, isolating CPU scheduler load balancing. Good.
>
> On your other two items, irq and workqueue isolation, which I had
> suggested doing via cpuset sched_load_balance, I now agree that that
> wasn't a good idea.
>
> I am still a little surprised at using isolation extensions to
> disable irqs on select CPUs; but others have thought far more about
> irqs than I have, so I'll be quiet.
Please note that we're not talking about completely disabling IRQs. We're talking about
not routing them to the isolated CPUs by default. It's still possible to explicitly reroute an IRQ
to the isolated CPU.
Why is this needed ? It is actually very easy to explain. IRQs are the major source of latency
and overhead. IRQ handlers themselves are mostly ok but they typically schedule soft irqs, work
queues and timers on the same CPU where an IRQ is handled. In other words if an isolated CPU is
receiving IRQs it's not really isolated, because it's running a whole bunch of different kernel
code (ie we're talking latencies, cache usage, etc).
If course some folks may want to explicitly route certain IRQs to the isolated CPUs. For example
if an app depends on the network stack it may make sense to route an IRQ from the NIC to the same
CPU the app is running on.
Max
Peter Zijlstra wrote:
> On Mon, 2008-01-28 at 14:00 -0500, Steven Rostedt wrote:
>> On Mon, 28 Jan 2008, Max Krasnyanskiy wrote:
>>>>> [PATCH] [CPUISOL] Support for workqueue isolation
>>>> The thing about workqueues is that they should only be woken on a CPU if
>>>> something on that CPU accessed them. IOW, the workqueue on a CPU handles
>>>> work that was called by something on that CPU. Which means that
>>>> something that high prio task did triggered a workqueue to do some work.
>>>> But this can also be triggered by interrupts, so by keeping interrupts
>>>> off the CPU no workqueue should be activated.
>>> No no no. That's what I though too ;-). The problem is that things like NFS and friends
>>> expect _all_ their workqueue threads to report back when they do certain things like
>>> flushing buffers and stuff. The reason I added this is because my machines were getting
>>> stuck because CPU0 was waiting for CPU1 to run NFS work queue threads even though no IRQs
>>> or other things are running on it.
>> This sounds more like we should fix NFS than add this for all workqueues.
>> Again, we want workqueues to run on the behalf of whatever is running on
>> that CPU, including those tasks that are running on an isolcpu.
>
> agreed, by looking at my top output (and not the nfs code) it looks like
> it just spawns a configurable number of active kernel threads which are
> not cpu bound by in any way. I think just removing the isolated cpus
> from their runnable mask should take care of them.
Peter, Steven,
I think I convinced you guys last time but I did not have a convincing example. So here is some
more info on why workqueues need to be aware of isolated cpus.
Here is how a work queue gets flushed.
static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
{
int active;
if (cwq->thread == current) {
/*
* Probably keventd trying to flush its own queue. So simply run
* it by hand rather than deadlocking.
*/
run_workqueue(cwq);
active = 1;
} else {
struct wq_barrier barr;
active = 0;
spin_lock_irq(&cwq->lock);
if (!list_empty(&cwq->worklist) || cwq->current_work != NULL) {
insert_wq_barrier(cwq, &barr, 1);
active = 1;
}
spin_unlock_irq(&cwq->lock);
if (active)
wait_for_completion(&barr.done);
}
return active;
}
void fastcall flush_workqueue(struct workqueue_struct *wq)
{
const cpumask_t *cpu_map = wq_cpu_map(wq);
int cpu;
might_sleep();
lock_acquire(&wq->lockdep_map, 0, 0, 0, 2, _THIS_IP_);
lock_release(&wq->lockdep_map, 1, _THIS_IP_);
for_each_cpu_mask(cpu, *cpu_map)
flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, cpu));
}
In other words it schedules some work on each cpu and expects workqueue thread to run and
trigger the completion. This is what I meant that _all_ threads are expected to report
back even if there is nothing running on that CPU.
So my patch simply makes sure that isolated CPUs are ignored (if work queue isolation is enabled)
that work queue threads are not started on isolated in the CPUs that are isolated.
Max
Max K wrote:
> > And for another thing, we already declare externs in cpumask.h for
> > the other, more widely used, cpu_*_map variables cpu_possible_map,
> > cpu_online_map, and cpu_present_map.
> Well, to address #2 and #3 isolated map will need to be exported as well.
> Those other maps do not really have much to do with the scheduler code.
> That's why I think either kernel/cpumask.c or kernel/cpu.c is a better place for them.
Well, if you have need it to be exported for #2 or #3, then that's ok
by me - export it.
I'm unaware of any kernel/cpumask.c. If you meant lib/cpumask.c, then
I'd prefer you not put it there, as lib/cpumask.c just contains the
implementation details of the abstract data type cpumask_t, not any of
its uses. If you mean kernel/cpuset.c, then that's not a good choice
either, as that just contains the implementation details of the cpuset
subsystem. You should usually define such things in one of the files
using it, and unless there is clearly a -better- place to move the
definition, it's usually better to just leave it where it is.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
Paul Jackson wrote:
> Max K wrote:
>>> And for another thing, we already declare externs in cpumask.h for
>>> the other, more widely used, cpu_*_map variables cpu_possible_map,
>>> cpu_online_map, and cpu_present_map.
>> Well, to address #2 and #3 isolated map will need to be exported as well.
>> Those other maps do not really have much to do with the scheduler code.
>> That's why I think either kernel/cpumask.c or kernel/cpu.c is a better place for them.
>
> Well, if you have need it to be exported for #2 or #3, then that's ok
> by me - export it.
>
> I'm unaware of any kernel/cpumask.c. If you meant lib/cpumask.c, then
> I'd prefer you not put it there, as lib/cpumask.c just contains the
> implementation details of the abstract data type cpumask_t, not any of
> its uses. If you mean kernel/cpuset.c, then that's not a good choice
> either, as that just contains the implementation details of the cpuset
> subsystem. You should usually define such things in one of the files
> using it, and unless there is clearly a -better- place to move the
> definition, it's usually better to just leave it where it is.
I was thinking of creating the new file kernel/cpumask.c. But it probably does not make sense
just for the masks. I'm now thinking kernel/cpu.c is the best place for it. It contains all
the cpu hotplug logic that deals with those maps at the very top it has stuff like
/* Serializes the updates to cpu_online_map, cpu_present_map */
static DEFINE_MUTEX(cpu_add_remove_lock);
So it seems to make sense to keep the maps in there.
Max