(Aside to the RealTime folks -- is there a 'realtime'
email list which I should include in this discussion?)
The kernel has a "isolcpus=" kernel boot time parameter. This
parameter isolates CPUs from scheduler load balancing, minimizing the
impact of scheduler latencies on realtime tasks running on those CPUs.
Questions:
==========
Do you, or someone you know, use "isolcpus="?
Can we remove it?
Should we remove it?
Should we first deprecate it somehow, for a while, before
removing it?
Background:
===========
In July 2004, Dimitri Sivanich <[email protected]> proposed
"isolcpus=" for realtime isolation of CPUs from the scheduler
(http://lkml.org/lkml/2004/7/22/97).
Ingo said of it "looks good", and Nick said "Cool."
It appeared in 2.6.9 Linux kernels.
It made Item #6 of Zack Brown's Kernel Traffic #272, dated Sept
5, 2004.
It also made LWN.net Weekly Edition for October 28, 2004, at
http://lwn.net/Articles/107490/bigpage.
Dimitri's fifteen minutes of fame had begun ;).
In April 2005, Dinakar Guniguntala <[email protected]> proposed
dynamic scheduler domains (http://lkml.org/lkml/2005/4/18/187).
It was immediately recognized, by Nick that this new work was a
"complete superset of the isolcpus= functionality."
Dinakar concurred, responding that he "was hoping that by the
time we are done with this, we would be able to completely get
rid of the isolcpus= option."
To which I (pj) replied "I won't miss it. Though, since it's
in the main line kernel, do you need to mark it deprecated for
a while first?"
Since then, dynamic scheduler domains and cpusets have seen much
work. See for example http://lkml.org/lkml/2007/9/30/29, which
added the sched_load_balance flag to cpusets.
However nothing much has changed with regard to the "isolcpus=" kernel
boot time parameter. This parameter is still there. In October of
2006, Derek Fults <[email protected]> did extend the syntax of the CPU
list argument to the "isolcpus=" parameter to handle CPU ranges.
Some of us (perhaps not including Ingo) tend to agree "isolcpus="
should go, but I am still recommending that we "deprecate" it in some
fashion for a while first, as I am usually opposed to suddenly
removing visible kernel features, because that breaks things.
Recently:
=========
Recently, Peter Zijlstra <[email protected]> and Max Krasnyansky
<[email protected]> have been advocating removing the "isolcpus="
option. See for example Peter's http://lkml.org/lkml/2008/2/27/378
or Max's http://lkml.org/lkml/2008/5/27/356. I've been resisting,
still advocating that we deprecate it first, before removing it,
if that is we even agree to remove it.
Next Step:
==========
This message begins the next steps, which are:
1) Survey the current usage of "isolcpus=". If we find evidence
of usage, then this should delay, or even argue against, the
removal of this feature.
2) Alert potential users of the change being considered here,
so that they can plan their work to adapt if we decided to
deprecate or remove the "isolcpus=" kernel boot parameter.
My recommendation (which may change with feedback from this inquiry)
is be to add a kernel printk, once at boot, issuing a KERN_WARN that
isolcpus is deprecated, if isolcpus was specified. Then in some
future release, remove isolcpus (and the warning).
One possible reason for keeping "isolcpus=" could be that it is
available even when cpusets is not configured into kernel. I don't
know if that is a case that is valuable to some or not.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
Paul,
On Sun, Jun 01, 2008 at 09:30:19PM -0500, Paul Jackson wrote:
>
> Do you, or someone you know, use "isolcpus="?
We use it.
> Can we remove it?
We use isolcpus to ensure that boot-time intialization, specifically timer initialization, happens on a specific set of cpus that we won't be using for lower latency purposes. Some of these timers will repeatedly restart themselves on the same cpu and a few do add latency (although admittedly I haven't checked timer latency recently).
Looking at tracebacks in 2.6.26-rc3 from hrtimer_init() and internal_add_timer() things still appear to be working this way, with the timer starting on the originating cpu. If I isolate all but, say one, cpu, timers all seem to start on the unisolated cpu.
Attempts have been made to add an interface to ward timers off of specific cpus, but these have always been rejected.
>
> Should we remove it?
Why?
> Should we first deprecate it somehow, for a while, before
> removing it?
A better idea than just removing it.
Dimitri Sivanich wrote:
> Paul,
>
> On Sun, Jun 01, 2008 at 09:30:19PM -0500, Paul Jackson wrote:
>> Do you, or someone you know, use "isolcpus="?
>
> We use it.
>
>> Can we remove it?
>
> We use isolcpus to ensure that boot-time intialization, specifically timer
> initialization, happens on a specific set of cpus that we won't be using for
> lower latency purposes. Some of these timers will repeatedly restart
> themselves on the same cpu and a few do add latency (although admittedly I
> haven't checked timer latency recently).
>
> Looking at tracebacks in 2.6.26-rc3 from hrtimer_init() and
> internal_add_timer() things still appear to be working this way, with the
> timer starting on the originating cpu. If I isolate all but, say one, cpu,
> timers all seem to start on the unisolated cpu.
>
> Attempts have been made to add an interface to ward timers off of specific
> cpus, but these have always been rejected.
Ah, I know exactly what you're talking about. However this is non-issue these
days. In order to clear cpuN from all the timers and other things all you need
to do is to bring that cpu off-line
echo 0 > /sys/devices/cpu/cpuN/online
and then bring it back online
echo 1 > /sys/devices/cpu/cpuN/online
There are currently a couple of issues with scheduler domains and hotplug
event handling. I do have the fix for them, and Paul had already acked it.
btw Disabling scheduler load balancer is not enough. Some timers are started
from the hard- and soft- irq handlers. Which means that you have to also
ensure that those CPUs do not handle any irqs (at least during
initialization). See my latest "default IRQ affinity" patch.
>> Should we remove it?
>
> Why?
Because the same functionality is available via more flexible mechanism that
is actively supported. isolcpus= is a static mechanism that requires reboots.
cpusets and cpu hotplug let you dynamically repartition the system at any time.
Also isolcpus= conflicts with the scheduler domains created by the cpusets.
>
>> Should we first deprecate it somehow, for a while, before
>> removing it?
>
> A better idea than just removing it.
I'd either nuke it or expose it when cpusets are disabled.
In other words
- if cpusets are enabled people should use cpusets to configure cpu resources.
- if cpusets are disabled then we could provide a sysctl (sched_balancer_mask
for example) that lets us control which cpus are balanced and which aren't.
Max
On Mon, Jun 02, 2008 at 11:39:34AM -0700, Max Krasnyansky wrote:
> Ah, I know exactly what you're talking about. However this is non-issue these
> days. In order to clear cpuN from all the timers and other things all you need
> to do is to bring that cpu off-line
> echo 0 > /sys/devices/cpu/cpuN/online
> and then bring it back online
> echo 1 > /sys/devices/cpu/cpuN/online
Although it seemed like something of a hack, we experimented with this
previously and found that it didn't work reliably. I'm sure things
have gotten better, but will need to revisit.
>
> There are currently a couple of issues with scheduler domains and hotplug
> event handling. I do have the fix for them, and Paul had already acked it.
Until a proven reliable method for doing this is firmly in place (as
firmly as anything is, anyway), I don't think we should be removing
the alternative.
> initialization). See my latest "default IRQ affinity" patch.
Nice idea.
> Also isolcpus= conflicts with the scheduler domains created by the cpusets.
What sort of conflict are we talking about? I assume once you've begun setting up cpusets that include those cpus that you're intention is to change the original behavior.
Dimitri Sivanich wrote:
> On Mon, Jun 02, 2008 at 11:39:34AM -0700, Max Krasnyansky wrote:
>> Ah, I know exactly what you're talking about. However this is non-issue these
>> days. In order to clear cpuN from all the timers and other things all you need
>> to do is to bring that cpu off-line
>> echo 0 > /sys/devices/cpu/cpuN/online
>> and then bring it back online
>> echo 1 > /sys/devices/cpu/cpuN/online
>
> Although it seemed like something of a hack, we experimented with this
> previously and found that it didn't work reliably. I'm sure things
> have gotten better, but will need to revisit.
Yes it used to be somewhat unstable. These days it solid. I'm using it on a
wide range of systems: uTCA Core2Duo, NUMA dual-Opteron, 8way Core2, etc. And
things work as expected.
I forgot to mention that it's not just timers. There are also work queues and
delayed work that have similar side effects (ie they stick to the CPU they
were originally scheduled on). Hotplug cleans all that stuff very nicely.
btw I would not call it a hack. ie Using cpu hotplug for isolation purposes.
By definition hotplug must be able to migrate _everything_ running on the cpuN
when it goes off-line, otherwise it simply won't work. And that's exactly what
we need for the isolation too (migrate everything running on a cpuN to other
cpus).
>> There are currently a couple of issues with scheduler domains and hotplug
>> event handling. I do have the fix for them, and Paul had already acked it.
>
> Until a proven reliable method for doing this is firmly in place (as
> firmly as anything is, anyway), I don't think we should be removing
> the alternative.
Agree. That's why I submitted the removal patch along with those fixes ;-).
>> initialization). See my latest "default IRQ affinity" patch.
> Nice idea.
Thanx.
>> Also isolcpus= conflicts with the scheduler domains created by the cpusets.
>
> What sort of conflict are we talking about? I assume once you've begun
> setting up cpusets that include those cpus that you're intention is to change
> the original behavior.
That exactly where the conflict is. Lets say you boot with isolcpus=2 (ie cpu2
is not load balanced), then you add cpu2 along with cpu3 to cpuset N and
enable load balancing in cpusetN. In that case cpu2 will still remain
unbalanced which is definitely a wrong behaviour.
Max
Hi Paul,
in short: NAK!
On Monday 02 June 2008, Paul Jackson wrote:
> (Aside to the RealTime folks -- is there a 'realtime'
> email list which I should include in this discussion?)
>
> The kernel has a "isolcpus=" kernel boot time parameter. This
> parameter isolates CPUs from scheduler load balancing, minimizing the
> impact of scheduler latencies on realtime tasks running on those CPUs.
I used it to mask out a defect CPU on a 8-CPU node of a
HPC-cluster at a customer site, until the $BIG_VENDOR
sent a replacement. And to prove $BIG_VENDOR, that we actually
have a problem on THAT CPU.
So I would really like to keep this fault isolation capability.
I made my customer happy with that.
I wish Linux had more such "mask out bad hardware" features
to faciliate fault isolation and boot and runtime.
Best Regards
Ingo Oeser
On Tue, 2008-06-03 at 00:35 +0200, Ingo Oeser wrote:
> Hi Paul,
>
> in short: NAK!
>
> On Monday 02 June 2008, Paul Jackson wrote:
> > (Aside to the RealTime folks -- is there a 'realtime'
> > email list which I should include in this discussion?)
> >
> > The kernel has a "isolcpus=" kernel boot time parameter. This
> > parameter isolates CPUs from scheduler load balancing, minimizing the
> > impact of scheduler latencies on realtime tasks running on those CPUs.
>
> I used it to mask out a defect CPU on a 8-CPU node of a
> HPC-cluster at a customer site, until the $BIG_VENDOR
> sent a replacement. And to prove $BIG_VENDOR, that we actually
> have a problem on THAT CPU.
>
> So I would really like to keep this fault isolation capability.
> I made my customer happy with that.
>
> I wish Linux had more such "mask out bad hardware" features
> to faciliate fault isolation and boot and runtime.
Yeah - except that its not meant to be used as such - it will still
brings the cpu up, and it is still usable for the OS.
So sorry, your abuse doesn't make for a case to keep this abomination.
Peter Zijlstra wrote:
> On Tue, 2008-06-03 at 00:35 +0200, Ingo Oeser wrote:
>> Hi Paul,
>>
>> in short: NAK!
>>
>> On Monday 02 June 2008, Paul Jackson wrote:
>>> (Aside to the RealTime folks -- is there a 'realtime'
>>> email list which I should include in this discussion?)
>>>
>>> The kernel has a "isolcpus=" kernel boot time parameter. This
>>> parameter isolates CPUs from scheduler load balancing, minimizing the
>>> impact of scheduler latencies on realtime tasks running on those CPUs.
>> I used it to mask out a defect CPU on a 8-CPU node of a
>> HPC-cluster at a customer site, until the $BIG_VENDOR
>> sent a replacement. And to prove $BIG_VENDOR, that we actually
>> have a problem on THAT CPU.
>>
>> So I would really like to keep this fault isolation capability.
>> I made my customer happy with that.
>>
>> I wish Linux had more such "mask out bad hardware" features
>> to faciliate fault isolation and boot and runtime.
>
> Yeah - except that its not meant to be used as such - it will still
> brings the cpu up, and it is still usable for the OS.
>
> So sorry, your abuse doesn't make for a case to keep this abomination.
Ingo, I just wanted to elaborate on what Peter is saying. That CPU will still
have to be _booted_ properly. It may be used for hard- and soft- interrupt
processing, workqueues (internal kernel queuing mechanism) and kernel timers.
In your particular case you're much much much better off with doing
echo 0 > /sys/devices/system/cpuN/online
either during initrd stage or as a first init script.
That way bad cpu will be _completely_ disabled.
Max
Hi Max,
Hi Peter,
On Tuesday 03 June 2008, Max Krasnyansky wrote:
> Ingo, I just wanted to elaborate on what Peter is saying. That CPU will still
> have to be _booted_ properly. It may be used for hard- and soft- interrupt
> processing, workqueues (internal kernel queuing mechanism) and kernel timers.
Oh! Didn't know that user process scheduling is so much
> In your particular case you're much much much better off with doing
> echo 0 > /sys/devices/system/cpuN/online
> either during initrd stage or as a first init script.
> That way bad cpu will be _completely_ disabled.
The initrd is from the distribution. I have no sane way to change it
fast and permanent. Can I change the initrd and still have a certified
RHEL or SLES? Are there initrd hooks, which survive packet installation?
I would really appreciate some way to keep the kernel from using
a CPU at all to do fault isolation. If possible not even booting it.
Bootparameters survived all distro fiddling so far. I love them!
Try to convince a hardware vendor, that you don't have a software bug.
Try to convince him that you didn't break the hardware by swapping it around.
So I'll ACK removing isolcpus, if we get a better replacement boot option.
Best Regards
Ingo Oeser
Ingo Oeser wrote:
> Hi Max,
> Hi Peter,
>
> On Tuesday 03 June 2008, Max Krasnyansky wrote:
>> Ingo, I just wanted to elaborate on what Peter is saying. That CPU will still
>> have to be _booted_ properly. It may be used for hard- and soft- interrupt
>> processing, workqueues (internal kernel queuing mechanism) and kernel timers.
>
> Oh! Didn't know that user process scheduling is so much
Not sure what you meant here. Stuff that I listed has nothing to do with user
process scheduling.
>> In your particular case you're much much much better off with doing
>> echo 0 > /sys/devices/system/cpuN/online
>> either during initrd stage or as a first init script.
>> That way bad cpu will be _completely_ disabled.
>
> The initrd is from the distribution. I have no sane way to change it
> fast and permanent. Can I change the initrd and still have a certified
> RHEL or SLES? Are there initrd hooks, which survive packet installation?
That's why I mentioned "first init" script. You can create a simple init.d
compliant script that runs with priority 0 (see /etc/init.d/network for
example). That should be early enough.
> I would really appreciate some way to keep the kernel from using
> a CPU at all to do fault isolation. If possible not even booting it.
How does isolcpu= boot option helps in this case ?
I suppose the closes option is maxcpus=. We can probably add ignorecpus= or
something to handle your use case but it has nothing to do with isolcpus=.
> Bootparameters survived all distro fiddling so far. I love them!
So do custom init.d scripts.
>
> Try to convince a hardware vendor, that you don't have a software bug.
> Try to convince him that you didn't break the hardware by swapping it around.
>
> So I'll ACK removing isolcpus, if we get a better replacement boot option.
I think you're missing the point here. It's like saying
"Lets not switch to electric cars because I use gasoline to kill weeds".
As I mentioned before, cpus listed in the isolcpus= boot option will still
handle hard-/soft- irqs, kernel work, kernel timers. You are much better off
using cpu hotplug (ie putting bad cpu offline). Feel free to propose
ignorecpus= option in a separate thread.
Max
On Tuesday 03 June 2008 08:45, Peter Zijlstra wrote:
> On Tue, 2008-06-03 at 00:35 +0200, Ingo Oeser wrote:
> > Hi Paul,
> >
> > in short: NAK!
> >
> > On Monday 02 June 2008, Paul Jackson wrote:
> > > (Aside to the RealTime folks -- is there a 'realtime'
> > > email list which I should include in this discussion?)
> > >
> > > The kernel has a "isolcpus=" kernel boot time parameter. This
> > > parameter isolates CPUs from scheduler load balancing, minimizing the
> > > impact of scheduler latencies on realtime tasks running on those CPUs.
> >
> > I used it to mask out a defect CPU on a 8-CPU node of a
> > HPC-cluster at a customer site, until the $BIG_VENDOR
> > sent a replacement. And to prove $BIG_VENDOR, that we actually
> > have a problem on THAT CPU.
> >
> > So I would really like to keep this fault isolation capability.
> > I made my customer happy with that.
> >
> > I wish Linux had more such "mask out bad hardware" features
> > to faciliate fault isolation and boot and runtime.
>
> Yeah - except that its not meant to be used as such - it will still
> brings the cpu up, and it is still usable for the OS.
>
> So sorry, your abuse doesn't make for a case to keep this abomination.
How come it is an abonination? It is an easy way to do what it does,
and it's actually not a bad thing for some uses not to have to use
cpusets.
Given that it's all __init code anyway, is there a real reason _to_
remove it?
On Mon, Jun 02, 2008 at 02:59:34PM -0700, Max Krasnyansky wrote:
> Yes it used to be somewhat unstable. These days it solid. I'm using it on a
> wide range of systems: uTCA Core2Duo, NUMA dual-Opteron, 8way Core2, etc. And
> things work as expected.
Max,
I tried the following scenario on an ia64 Altix running 2.6.26-rc4 with cpusets compiled in but cpuset fs unmounted. Do your patches already address this?
$ taskset -cp 3 $$ (attach to cpu 3)
pid 4591's current affinity list: 0-3
pid 4591's new affinity list: 3
$ echo 0 > /sys/devices/system/cpu/cpu2/online (down cpu 2)
(above command hangs)
Backtrace of pid 4591 (bash)
Call Trace:
[<a00000010078e990>] schedule+0x1210/0x13c0
sp=e0000060b6dffc90 bsp=e0000060b6df11e0
[<a00000010078ef60>] schedule_timeout+0x40/0x180
sp=e0000060b6dffce0 bsp=e0000060b6df11b0
[<a00000010078d3e0>] wait_for_common+0x240/0x3c0
sp=e0000060b6dffd10 bsp=e0000060b6df1180
[<a00000010078d760>] wait_for_completion+0x40/0x60
sp=e0000060b6dffd40 bsp=e0000060b6df1160
[<a000000100114ee0>] __stop_machine_run+0x120/0x160
sp=e0000060b6dffd40 bsp=e0000060b6df1120
[<a000000100765ae0>] _cpu_down+0x2a0/0x600
sp=e0000060b6dffd80 bsp=e0000060b6df10c8
[<a000000100765ea0>] cpu_down+0x60/0xa0
sp=e0000060b6dffe20 bsp=e0000060b6df10a0
[<a000000100768090>] store_online+0x50/0xe0
sp=e0000060b6dffe20 bsp=e0000060b6df1070
[<a0000001004f8800>] sysdev_store+0x60/0xa0
sp=e0000060b6dffe20 bsp=e0000060b6df1038
[<a00000010022e370>] sysfs_write_file+0x250/0x300
sp=e0000060b6dffe20 bsp=e0000060b6df0fe0
[<a00000010018a750>] vfs_write+0x1b0/0x300
sp=e0000060b6dffe20 bsp=e0000060b6df0f90
[<a00000010018b350>] sys_write+0x70/0xe0
sp=e0000060b6dffe20 bsp=e0000060b6df0f18
[<a00000010000af80>] ia64_ret_from_syscall+0x0/0x20
sp=e0000060b6dffe30 bsp=e0000060b6df0f18
[<a000000000010720>] ia64_ivt+0xffffffff00010720/0x400
sp=e0000060b6e00000 bsp=e0000060b6df0f18
I also got this to hang after doing this:
- taskset -cp 3 $$ (attach to cpu 3)
- echo 0 > /sys/devices/system/cpu/cpu2/online (down cpu 2, successful this time)
- echo 1 > /sys/devices/system/cpu/cpu2/online (up cpu 2, successful)
- taskset -p $$ (read cpumask, this command hangs)
Traceback here was:
Backtrace of pid 4653 (bash)
Call Trace:
[<a00000010078e990>] schedule+0x1210/0x13c0
sp=e0000060b78afab0 bsp=e0000060b78a1320
[<a00000010078ef60>] schedule_timeout+0x40/0x180
sp=e0000060b78afb00 bsp=e0000060b78a12f0
[<a00000010078d3e0>] wait_for_common+0x240/0x3c0
sp=e0000060b78afb30 bsp=e0000060b78a12c0
[<a00000010078d760>] wait_for_completion+0x40/0x60
sp=e0000060b78afb60 bsp=e0000060b78a12a0
[<a0000001000a63b0>] set_cpus_allowed_ptr+0x210/0x2a0
sp=e0000060b78afb60 bsp=e0000060b78a1270
[<a000000100786930>] cache_add_dev+0x970/0xbc0
sp=e0000060b78afbb0 bsp=e0000060b78a11d0
[<a000000100786c20>] cache_cpu_callback+0xa0/0x1e0
sp=e0000060b78afe10 bsp=e0000060b78a1190
[<a0000001000e96b0>] notifier_call_chain+0x50/0xe0
sp=e0000060b78afe10 bsp=e0000060b78a1148
[<a0000001000e9900>] __raw_notifier_call_chain+0x40/0x60
sp=e0000060b78afe10 bsp=e0000060b78a1108
[<a0000001000e9960>] raw_notifier_call_chain+0x40/0x60
sp=e0000060b78afe10 bsp=e0000060b78a10d8
[<a00000010078ba70>] cpu_up+0x2d0/0x380
sp=e0000060b78afe10 bsp=e0000060b78a10a0
[<a0000001007680b0>] store_online+0x70/0xe0
sp=e0000060b78afe20 bsp=e0000060b78a1070
[<a0000001004f8800>] sysdev_store+0x60/0xa0
sp=e0000060b78afe20 bsp=e0000060b78a1038
[<a00000010022e370>] sysfs_write_file+0x250/0x300
sp=e0000060b78afe20 bsp=e0000060b78a0fe0
[<a00000010018a750>] vfs_write+0x1b0/0x300
sp=e0000060b78afe20 bsp=e0000060b78a0f90
[<a00000010018b350>] sys_write+0x70/0xe0
sp=e0000060b78afe20 bsp=e0000060b78a0f18
[<a00000010000af80>] ia64_ret_from_syscall+0x0/0x20
sp=e0000060b78afe30 bsp=e0000060b78a0f18
[<a000000000010720>] ia64_ivt+0xffffffff00010720/0x400
sp=e0000060b78b0000 bsp=e0000060b78a0f18
Dimitri Sivanich wrote:
> On Mon, Jun 02, 2008 at 02:59:34PM -0700, Max Krasnyansky wrote:
>> Yes it used to be somewhat unstable. These days it solid. I'm using it on a
>> wide range of systems: uTCA Core2Duo, NUMA dual-Opteron, 8way Core2, etc. And
>> things work as expected.
>
> Max,
>
> I tried the following scenario on an ia64 Altix running 2.6.26-rc4 with
> cpusets compiled in but cpuset fs unmounted. Do your patches already address this?
Nope. My patch was a trivial fix for not destroying scheduler domains on
hotplug events. The problem you're seeing is different.
I'm not an expert in cpu hotplug internal machinery especially on ia64. Recent
kernels (.22 and up) I've tried on x86 and x86-64 have no issues with cpu
hotplug. You probably want to submit a bug report (in a separate thread) maybe
it's a regression in the latest .26-rc series.
Max
>> I would really appreciate some way to keep the kernel from using
>> a CPU at all to do fault isolation. If possible not even booting it.
> How does isolcpu= boot option helps in this case ?
> I suppose the closes option is maxcpus=. We can probably add ignorecpus= or
> something to handle your use case but it has nothing to do with isolcpus=.
btw Ingo, I just realized that maxcpu= option is exactly what you need.
Here is how you can use it.
Boot your system with maxcpus=1. That way the kernel will only bring up
processor 0. I'm assuming cpu0 is "good" otherwise your system is totally
busted :). Other cpus will stay off-line and will not be initialized.
Then once the system boots you can selectively bring "good" processors online
by doing
echo 1 > /sys/devices/system/cpu/cpuN/online
This actually solves the case you're talking about (ie ignoring bad
processors) instead of partially covering it with isolcpus=.
Dimitri, you can probably use that too. ie Boot the thing with most CPUs
offline and then bring them online. That way you'll know for sure that no
timers, works, hard-/soft-irqs, etc are running on them.
So I expect two ACKs for isolcpu= removal from both of you, in bold please :)
Max
Max wrote:
> So I expect two ACKs for isolcpu= removal from both of you, in bold please :)
Not from me, anyway.
I've seen enough replies (thanks!) from users of isolcpus=
to be quite certain that we should not just remove it outright.
I will NAQ such proposals.
And until, and unless, someone comes up with a persuasive answer
to Nick's question:
> is there a real reason _to_ remove it?
I'll probably NAQ proposals to deprecate it as well.
Max ... I think one place where you and I disagree is on whether
it is a good idea to have multiple ways to accomplish the same
thing.
As Ingo Oeser pointed out:
> The initrd is from the distribution. I have no sane way to change it
Even if you do find a way that seems sane to you, that's not the point,
in my view. Further, given the constraints on producing product that
will fit in with multiple distributions, I doubt that the alternatives
you suggest to Info Oeser would work well for him anyway.
A key reason that Linux has succeeded is that it actively seeks to work
for a variety of people, purposes and products. One operating system is
now a strong player in the embedded market, the real time market, and
the High Performance Computing market, as well as being an important
player in a variety of other markets. That's a rather stunning success.
If you went to your local grocery store with your (if you have one)
young child, and found that they had no Lucky Charms breakfast cereal
(your childs favorite) you would not be pleased if the store manager
tried to sell you Fruit Loops instead ... just as much sugar and food
coloring.
If we have features that seem to duplicate functionality, in a
different way, and that aren't causing us substantial grief to
maintain, and that aren't significantly hurting our performance or
robustness or security or seriously getting in the way of further
development, then we usually leave those features in.
Please understand, Max, that for every kernel hacker working in this
corner of the Linux kernel, there are a hundred or a thousand users
depending on what we do, and who will have to adapt to any incompatible
changes we make. If we save ourselves an hour by removing "unnecessary"
features, we can cost a hundred others each some time adapting to this
change. A few of those others may get hit for substantial effort, if
the change catches them unawares at the wrong time and place.
As good citizens of the universe, we should seek to optimize the
aggregate effort we spend to obtain a particular level of quality and
functionality.
Saving yourself an hour while you cost a hundred others ten minutes
each is not a net gain. Sometimes this means not enforcing a "one way,
and one way only, to do any given task." I wouldn't go as far as Perl
does in this regard, but we do run a more polyglot product than say
Python.
Try thinking a little more like a WalMart product manager than a
Ferrari designer. If it is currently selling to our customers, and if
we can fit it into our supply and distribution chain, and if we can
continue to make an adequate profit per foot of shelf space, then
continue to buy it, stock it, ship it, and sell it.
Thanks.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
On Wednesday 04 June 2008 09:47, Max Krasnyanskiy wrote:
> >> I would really appreciate some way to keep the kernel from using
> >> a CPU at all to do fault isolation. If possible not even booting it.
> >
> > How does isolcpu= boot option helps in this case ?
> > I suppose the closes option is maxcpus=. We can probably add ignorecpus=
> > or something to handle your use case but it has nothing to do with
> > isolcpus=.
>
> btw Ingo, I just realized that maxcpu= option is exactly what you need.
> Here is how you can use it.
> Boot your system with maxcpus=1. That way the kernel will only bring up
> processor 0. I'm assuming cpu0 is "good" otherwise your system is totally
> busted :). Other cpus will stay off-line and will not be initialized.
> Then once the system boots you can selectively bring "good" processors
> online by doing
> echo 1 > /sys/devices/system/cpu/cpuN/online
>
> This actually solves the case you're talking about (ie ignoring bad
> processors) instead of partially covering it with isolcpus=.
For your case, that's probably the best way to solve it, yes.
> Dimitri, you can probably use that too. ie Boot the thing with most CPUs
> offline and then bring them online. That way you'll know for sure that no
> timers, works, hard-/soft-irqs, etc are running on them.
When you bring a CPU online, in theory the sched domains should get
set up for them, so you should start seeing processes get migrated
onto it, and with them timers work queues etc.
If you have irqbalanced running, it probably migrates irqs onto them
as well if it needs to.
Nick Piggin wrote:
> On Wednesday 04 June 2008 09:47, Max Krasnyanskiy wrote:
>> Dimitri, you can probably use that too. ie Boot the thing with most CPUs
>> offline and then bring them online. That way you'll know for sure that no
>> timers, works, hard-/soft-irqs, etc are running on them.
>
> When you bring a CPU online, in theory the sched domains should get
> set up for them, so you should start seeing processes get migrated
> onto it, and with them timers work queues etc.
>
> If you have irqbalanced running, it probably migrates irqs onto them
> as well if it needs to.
Sure. My suggestion assumes that system wide balancing is disabled (via top
level cpuset) and that IRQ affinity masks are properly setup. I mentioned that
in my earlier emails.
Max
Paul Jackson wrote:
> Max wrote:
>> So I expect two ACKs for isolcpu= removal from both of you, in bold please :)
>
> Not from me, anyway.
>
> I've seen enough replies (thanks!) from users of isolcpus=
> to be quite certain that we should not just remove it outright.
We've seen exactly two replies with usage examples. Dimitri's case is legit
but can be handled much better (as in it not only avoids timers, but any other
kernel work) with cpu hotplug and cpusets. Ingo's case is bogus because it
does not actually do what he needs. There is a much better way to do exactly
what he needs which involves only cpu hotplug and has nothing to do with the
scheduler and such.
> I will NAQ such proposals.
>
> And until, and unless, someone comes up with a persuasive answer
> to Nick's question:
>
>> is there a real reason _to_ remove it?
>
> I'll probably NAQ proposals to deprecate it as well.
>
> Max ... I think one place where you and I disagree is on whether
> it is a good idea to have multiple ways to accomplish the same
> thing.
Not really. I thought that the two ways that we have are conflicting.
I just looked at the partition_sched_domains() code again and realized that
there is no conflict (cpusets settings override isolcpus=). I was wrong on that.
So I guess there are no reasons to nuke other than
"oh, but it's was a hack" :)
> As Ingo Oeser pointed out:
>> The initrd is from the distribution. I have no sane way to change it
>
> Even if you do find a way that seems sane to you, that's not the point,
> in my view. Further, given the constraints on producing product that
> will fit in with multiple distributions, I doubt that the alternatives
> you suggest to Info Oeser would work well for him anyway.
> A key reason that Linux has succeeded is that it actively seeks to work
> for a variety of people, purposes and products. One operating system is
> now a strong player in the embedded market, the real time market, and
> the High Performance Computing market, as well as being an important
> player in a variety of other markets. That's a rather stunning success.
>
> If you went to your local grocery store with your (if you have one)
> young child, and found that they had no Lucky Charms breakfast cereal
> (your childs favorite) you would not be pleased if the store manager
> tried to sell you Fruit Loops instead ... just as much sugar and food
> coloring.
>
> If we have features that seem to duplicate functionality, in a
> different way, and that aren't causing us substantial grief to
> maintain, and that aren't significantly hurting our performance or
> robustness or security or seriously getting in the way of further
> development, then we usually leave those features in.
>
> Please understand, Max, that for every kernel hacker working in this
> corner of the Linux kernel, there are a hundred or a thousand users
> depending on what we do, and who will have to adapt to any incompatible
> changes we make. If we save ourselves an hour by removing "unnecessary"
> features, we can cost a hundred others each some time adapting to this
> change. A few of those others may get hit for substantial effort, if
> the change catches them unawares at the wrong time and place.
>
> As good citizens of the universe, we should seek to optimize the
> aggregate effort we spend to obtain a particular level of quality and
> functionality.
>
> Saving yourself an hour while you cost a hundred others ten minutes
> each is not a net gain. Sometimes this means not enforcing a "one way,
> and one way only, to do any given task." I wouldn't go as far as Perl
> does in this regard, but we do run a more polyglot product than say
> Python.
>
> Try thinking a little more like a WalMart product manager than a
> Ferrari designer. If it is currently selling to our customers, and if
> we can fit it into our supply and distribution chain, and if we can
> continue to make an adequate profit per foot of shelf space, then
> continue to buy it, stock it, ship it, and sell it.
Ingo's case is a bad example. If you reread his use case more carefully you'll
see that he was not actually getting what he expected out of the boot param in
question.
btw Impressive write up. I do like to think of myself as a Ferrari designer,
actually these days I'm more into http://www.teslamotors.com/ rather than
Ferrari :).
So I agree in general of course. As I mentioned my reasoning was 1) I thought
it conflicts with cpusets and 2) it's considered a hack by the scheduler folks
and is not supported (ie my attempts to extend it were rejected). Given that
there is a better mechanism available it seemed to make sense to nuke it.
Peter Z and Ingo M were of the similar opinion, so it seemed.
Anyway, I do not mind us keeping isolcpus= boot option even though use cases
mentioned so far as not very convincing.
Max
Max wrote:
> Ingo's case is a bad example.
Could be ... I wasn't paying close attention to the details.
If so, a good product marketing manager would first upsell the customer
to the better product, and then let falling sales guide the removal
of the old product.
That is, if you can guide most of the users of "isolcpus=" to a better
solution, in -their- view, so that they voluntary choose to migrate
to the other solution, then you get to deprecate and then remove the
old mechanism.
To the extent that you can show that the old mechanism is costing us
(maintenance, reliability, performance, impeding progress, ...) then
you get to accelerate the deprecation period, even to the point of
an immediate removal of the old feature, if it's of sufficiently little
use and great pain.
We do have one problem with letting "falling sales" guide feature
removal. Unlike Walmart, where they know what has sold where before
the customer has even left the store, we can't easily track usage of
kernel features. Occassionally, we can stir the pot and get some
feedback, as I've done on this thread, if we have a narrow target
audience that we have good reason is especially interested. But that
only works occassionally.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
Nick Piggin wrote:
> On Tuesday 03 June 2008 08:45, Peter Zijlstra wrote:
>> On Tue, 2008-06-03 at 00:35 +0200, Ingo Oeser wrote:
>>> Hi Paul,
>>>
>>> in short: NAK!
>>>
>>> On Monday 02 June 2008, Paul Jackson wrote:
>>>> (Aside to the RealTime folks -- is there a 'realtime'
>>>> email list which I should include in this discussion?)
>>>>
>>>> The kernel has a "isolcpus=" kernel boot time parameter. This
>>>> parameter isolates CPUs from scheduler load balancing, minimizing the
>>>> impact of scheduler latencies on realtime tasks running on those CPUs.
>>> I used it to mask out a defect CPU on a 8-CPU node of a
>>> HPC-cluster at a customer site, until the $BIG_VENDOR
>>> sent a replacement. And to prove $BIG_VENDOR, that we actually
>>> have a problem on THAT CPU.
>>>
>>> So I would really like to keep this fault isolation capability.
>>> I made my customer happy with that.
>>>
>>> I wish Linux had more such "mask out bad hardware" features
>>> to faciliate fault isolation and boot and runtime.
>> Yeah - except that its not meant to be used as such - it will still
>> brings the cpu up, and it is still usable for the OS.
>>
>> So sorry, your abuse doesn't make for a case to keep this abomination.
>
> How come it is an abonination? It is an easy way to do what it does,
> and it's actually not a bad thing for some uses not to have to use
> cpusets.
>
> Given that it's all __init code anyway, is there a real reason _to_
> remove it?
IMHO,
What is an abonination, is that cpusets are equired for this type of
isolation to begin with, even on a 2 processor machine.
I would like the option to stay and be extended like Max originally
proposed. If cpusets/hotplug are configured isolation would be obtained
using them. If not then isolcpus could be used to get the same isolation.
From a user land point of view, I just want an easy way to fully
isolate a particular cpu. Even a new syscall or extension to
sched_setaffinity would make me happy. Cpusets and hotplug don't.
Again this is just MHO.
Regards
Mark
Max Krasnyansky <[email protected]> writes:
> We've seen exactly two replies with usage examples. Dimitri's case is legit
> but can be handled much better (as in it not only avoids timers, but any other
> kernel work) with cpu hotplug and cpusets. Ingo's case is bogus because it
> does not actually do what he needs. There is a much better way to do exactly
> what he needs which involves only cpu hotplug and has nothing to do with the
> scheduler and such.
One example I've seen in the past is that someone wanted to isolate a node
completely from any memory traffic to avoid performance disturbance
for memory intensive workloads.
Right now the system boot could put pages from some daemon in there before any
cpusets are set up and there's no easy way to get them away again
(short of migratepages for all running pids, but that's pretty ugly and won't
cover kernel level allocations and also can mess up locality)
Given the use case wants more a "isolnodes", but given that there
tends to be enough free memory at boot "isolcpus" tended to work.
-Andi
On Tue, Jun 03, 2008 at 09:40:10AM -0500, Dimitri Sivanich wrote:
> I tried the following scenario on an ia64 Altix running 2.6.26-rc4 with cpusets compiled in but cpuset fs unmounted. Do your patches already address this?
>
> $ taskset -cp 3 $$ (attach to cpu 3)
> pid 4591's current affinity list: 0-3
> pid 4591's new affinity list: 3
> $ echo 0 > /sys/devices/system/cpu/cpu2/online (down cpu 2)
> (above command hangs)
>
> Backtrace of pid 4591 (bash)
>
> Call Trace:
> [<a00000010078e990>] schedule+0x1210/0x13c0
> sp=e0000060b6dffc90 bsp=e0000060b6df11e0
> [<a00000010078ef60>] schedule_timeout+0x40/0x180
> sp=e0000060b6dffce0 bsp=e0000060b6df11b0
> [<a00000010078d3e0>] wait_for_common+0x240/0x3c0
> sp=e0000060b6dffd10 bsp=e0000060b6df1180
> [<a00000010078d760>] wait_for_completion+0x40/0x60
> sp=e0000060b6dffd40 bsp=e0000060b6df1160
> [<a000000100114ee0>] __stop_machine_run+0x120/0x160
> sp=e0000060b6dffd40 bsp=e0000060b6df1120
> [<a000000100765ae0>] _cpu_down+0x2a0/0x600
> sp=e0000060b6dffd80 bsp=e0000060b6df10c8
> [<a000000100765ea0>] cpu_down+0x60/0xa0
> sp=e0000060b6dffe20 bsp=e0000060b6df10a0
> [<a000000100768090>] store_online+0x50/0xe0
> sp=e0000060b6dffe20 bsp=e0000060b6df1070
> [<a0000001004f8800>] sysdev_store+0x60/0xa0
> sp=e0000060b6dffe20 bsp=e0000060b6df1038
> [<a00000010022e370>] sysfs_write_file+0x250/0x300
> sp=e0000060b6dffe20 bsp=e0000060b6df0fe0
> [<a00000010018a750>] vfs_write+0x1b0/0x300
> sp=e0000060b6dffe20 bsp=e0000060b6df0f90
> [<a00000010018b350>] sys_write+0x70/0xe0
> sp=e0000060b6dffe20 bsp=e0000060b6df0f18
> [<a00000010000af80>] ia64_ret_from_syscall+0x0/0x20
> sp=e0000060b6dffe30 bsp=e0000060b6df0f18
> [<a000000000010720>] ia64_ivt+0xffffffff00010720/0x400
> sp=e0000060b6e00000 bsp=e0000060b6df0f18
The following workaround alleviates the symptom and hopefully is a hint as to the solution:
echo -1 > /proc/sys/kernel/sched_rt_runtime_us
Hi Max,
On Wednesday 04 June 2008, Max Krasnyanskiy wrote:
> btw Ingo, I just realized that maxcpu= option is exactly what you need.
> Here is how you can use it.
> Boot your system with maxcpus=1. That way the kernel will only bring up
> processor 0. I'm assuming cpu0 is "good" otherwise your system is totally
> busted :). Other cpus will stay off-line and will not be initialized.
> Then once the system boots you can selectively bring "good" processors online
> by doing
> echo 1 > /sys/devices/system/cpu/cpuN/online
>
I just tested it on the Ubuntu Hardy standard kernel on an AMD DualCore.
The /sys/devices/system/cpu/cpu1 entry doesn't show up.
I can send you the dmesg/config in private, if you want.
Did you test your suggestion?
After our discussion, I tried to find the right spot to implement
"disablecpus=" myself, but couldn't find the right position to hook it up.
If your idea would work out, I would ACK it. Even in bold :-)
Best Regards
Ingo Oeser
Mark wrote:
> What is an abonination, is that cpusets are equired for this type of
> isolation to begin with, even on a 2 processor machine.
Just to be sure I'm following you here, you stating that you
want to be able to manipulate the isolated cpu map at runtime,
not just with the boot option isolcpus, right? Where this
isolated cpu map works just fine even on systems which do
not have cpusets configured, right?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
Andi wrote:
> Right now the system boot could put pages from some daemon in there before any
> cpusets are set up and there's no easy way to get them away again
We (SGI) routinely handle that need with a custom init program,
invoked with the init= parameter to the booting kernel, which
sets up cpusets and then invokes the normal (real) init program
in a cpuset configured to exclude those CPUs and nodes which we
want to remain unloaded. For example, on a 256 CPU, 64 node
system, we might have init running on a single node of 4 CPUs,
and leave the remaining 63 nodes and 252 CPUs isolated from all
the usual user level daemons started by init.
There is no need for additional kernel changes to accomplish this.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
Ingo Oeser wrote:
> Hi Max,
>
> On Wednesday 04 June 2008, Max Krasnyanskiy wrote:
>> btw Ingo, I just realized that maxcpu= option is exactly what you need.
>> Here is how you can use it.
>> Boot your system with maxcpus=1. That way the kernel will only bring up
>> processor 0. I'm assuming cpu0 is "good" otherwise your system is totally
>> busted :). Other cpus will stay off-line and will not be initialized.
>> Then once the system boots you can selectively bring "good" processors online
>> by doing
>> echo 1 > /sys/devices/system/cpu/cpuN/online
>>
>
> I just tested it on the Ubuntu Hardy standard kernel on an AMD DualCore.
> The /sys/devices/system/cpu/cpu1 entry doesn't show up.
>
> I can send you the dmesg/config in private, if you want.
>
> Did you test your suggestion?
No, I just give random suggestions without verifying them first ;-).
Of course I tried it. Sounds like your Ubuntu kernel does not have "CPU
hotplug / suspend on SMP" enabled. I thought that all distributions enable it
by default these days.
Max
Peter, Ingo,
Take a look at the report below (came up during isolcpu= remove discussions).
It looks like stop_machine threads are getting forcefully preempted because
they exceed their RT quanta. It's strange because rt period is pretty long.
But given that disabling rt period logic solves the issue the machine was not
really stuck.
Max
Dimitri Sivanich wrote:
> On Tue, Jun 03, 2008 at 09:40:10AM -0500, Dimitri Sivanich wrote:
>> I tried the following scenario on an ia64 Altix running 2.6.26-rc4 with cpusets compiled in but cpuset fs unmounted. Do your patches already address this?
>>
>> $ taskset -cp 3 $$ (attach to cpu 3)
>> pid 4591's current affinity list: 0-3
>> pid 4591's new affinity list: 3
>> $ echo 0 > /sys/devices/system/cpu/cpu2/online (down cpu 2)
>> (above command hangs)
>>
>> Backtrace of pid 4591 (bash)
>>
>> Call Trace:
>> [<a00000010078e990>] schedule+0x1210/0x13c0
>> sp=e0000060b6dffc90 bsp=e0000060b6df11e0
>> [<a00000010078ef60>] schedule_timeout+0x40/0x180
>> sp=e0000060b6dffce0 bsp=e0000060b6df11b0
>> [<a00000010078d3e0>] wait_for_common+0x240/0x3c0
>> sp=e0000060b6dffd10 bsp=e0000060b6df1180
>> [<a00000010078d760>] wait_for_completion+0x40/0x60
>> sp=e0000060b6dffd40 bsp=e0000060b6df1160
>> [<a000000100114ee0>] __stop_machine_run+0x120/0x160
>> sp=e0000060b6dffd40 bsp=e0000060b6df1120
>> [<a000000100765ae0>] _cpu_down+0x2a0/0x600
>> sp=e0000060b6dffd80 bsp=e0000060b6df10c8
>> [<a000000100765ea0>] cpu_down+0x60/0xa0
>> sp=e0000060b6dffe20 bsp=e0000060b6df10a0
>> [<a000000100768090>] store_online+0x50/0xe0
>> sp=e0000060b6dffe20 bsp=e0000060b6df1070
>> [<a0000001004f8800>] sysdev_store+0x60/0xa0
>> sp=e0000060b6dffe20 bsp=e0000060b6df1038
>> [<a00000010022e370>] sysfs_write_file+0x250/0x300
>> sp=e0000060b6dffe20 bsp=e0000060b6df0fe0
>> [<a00000010018a750>] vfs_write+0x1b0/0x300
>> sp=e0000060b6dffe20 bsp=e0000060b6df0f90
>> [<a00000010018b350>] sys_write+0x70/0xe0
>> sp=e0000060b6dffe20 bsp=e0000060b6df0f18
>> [<a00000010000af80>] ia64_ret_from_syscall+0x0/0x20
>> sp=e0000060b6dffe30 bsp=e0000060b6df0f18
>> [<a000000000010720>] ia64_ivt+0xffffffff00010720/0x400
>> sp=e0000060b6e00000 bsp=e0000060b6df0f18
>
> The following workaround alleviates the symptom and hopefully is a hint as to the solution:
> echo -1 > /proc/sys/kernel/sched_rt_runtime_us
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Wed, 2008-06-04 at 11:07 -0700, Max Krasnyansky wrote:
> Peter, Ingo,
>
> Take a look at the report below (came up during isolcpu= remove discussions).
>
> It looks like stop_machine threads are getting forcefully preempted because
> they exceed their RT quanta. It's strange because rt period is pretty long.
> But given that disabling rt period logic solves the issue the machine was not
> really stuck.
Yeah, I know, I'm already looking at this
Peter Zijlstra wrote:
> On Wed, 2008-06-04 at 11:07 -0700, Max Krasnyansky wrote:
>> Peter, Ingo,
>>
>> Take a look at the report below (came up during isolcpu= remove discussions).
>>
>> It looks like stop_machine threads are getting forcefully preempted because
>> they exceed their RT quanta. It's strange because rt period is pretty long.
>> But given that disabling rt period logic solves the issue the machine was not
>> really stuck.
>
> Yeah, I know, I'm already looking at this
I see. Does it look like a bug in the rt period logic ?
Or did the stop_machine thread really run for a long time (in the report that
you got that is) ?
Max
Paul Jackson wrote:
> Andi wrote:
>> Right now the system boot could put pages from some daemon in there before any
>> cpusets are set up and there's no easy way to get them away again
>
> We (SGI) routinely handle that need with a custom init program,
> invoked with the init= parameter to the booting kernel, which
> sets up cpusets and then invokes the normal (real) init program
> in a cpuset configured to exclude those CPUs and nodes which we
> want to remain unloaded. For example, on a 256 CPU, 64 node
> system, we might have init running on a single node of 4 CPUs,
> and leave the remaining 63 nodes and 252 CPUs isolated from all
> the usual user level daemons started by init.
>
> There is no need for additional kernel changes to accomplish this.
You do not even need to replace /sbin/init for this, no ?
Simply installing custom
/etc/init.d/create_cpusets
with priority 0
# chkconfig: 12345 0 99
will do the job.
That script will move init itself into the appropriate cpuset and from then on
everything will inherit it.
Max
On Wed, 2008-06-04 at 11:24 -0700, Max Krasnyansky wrote:
>
> Peter Zijlstra wrote:
> > On Wed, 2008-06-04 at 11:07 -0700, Max Krasnyansky wrote:
> >> Peter, Ingo,
> >>
> >> Take a look at the report below (came up during isolcpu= remove discussions).
> >>
> >> It looks like stop_machine threads are getting forcefully preempted because
> >> they exceed their RT quanta. It's strange because rt period is pretty long.
> >> But given that disabling rt period logic solves the issue the machine was not
> >> really stuck.
> >
> > Yeah, I know, I'm already looking at this
>
> I see. Does it look like a bug in the rt period logic ?
> Or did the stop_machine thread really run for a long time (in the report that
> you got that is) ?
looks like a fun race between refreshing the period and updating
cpu_online_map.
On Wed, 2008-06-04 at 11:29 -0700, Max Krasnyansky wrote:
>
> Paul Jackson wrote:
> > Andi wrote:
> >> Right now the system boot could put pages from some daemon in there before any
> >> cpusets are set up and there's no easy way to get them away again
> >
> > We (SGI) routinely handle that need with a custom init program,
> > invoked with the init= parameter to the booting kernel, which
> > sets up cpusets and then invokes the normal (real) init program
> > in a cpuset configured to exclude those CPUs and nodes which we
> > want to remain unloaded. For example, on a 256 CPU, 64 node
> > system, we might have init running on a single node of 4 CPUs,
> > and leave the remaining 63 nodes and 252 CPUs isolated from all
> > the usual user level daemons started by init.
> >
> > There is no need for additional kernel changes to accomplish this.
>
> You do not even need to replace /sbin/init for this, no ?
> Simply installing custom
> /etc/init.d/create_cpusets
> with priority 0
> # chkconfig: 12345 0 99
> will do the job.
>
> That script will move init itself into the appropriate cpuset and from then on
> everything will inherit it.
The advantage of using a replacement /sbin/init is that you execute
before the rest of userspace, unlike what you propose.
Max wrote:
> You do not even need to replace /sbin/init for this, no ?
> Simply installing custom
> /etc/init.d/create_cpusets
That can ensure that the deamons that init starts later
on placed, but it doesn't ensure that the glibc pages that
init (and the shell it spawned to run 'create_cpusets')
are placed.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
Mark Hounschell wrote:
> IMHO,
>
> What is an abonination, is that cpusets are equired for this type of
> isolation to begin with, even on a 2 processor machine.
>
> I would like the option to stay and be extended like Max originally
> proposed. If cpusets/hotplug are configured isolation would be obtained
> using them. If not then isolcpus could be used to get the same isolation.
>
> From a user land point of view, I just want an easy way to fully isolate
> a particular cpu. Even a new syscall or extension to sched_setaffinity
> would make me happy. Cpusets and hotplug don't.
>
> Again this is just MHO.
Mark, I used to be the same way and I'm a convert now. It does seems like an
overkill for 2cpu machine to have cpusets and cpu hotplug. But both options
cost around 50KB worth of text and maybe another 10KB of data. That's on the
x86-64 box. Let's say it's a 100KB. Not a terribly huge overhead.
Now if you think about it. In order to be able to dynamically isolate a cpu we
have to do exact same thing that CPU hotplug does. Which is to clear all
timers, kernel, threads, etc from that CPUs. It does not make sense to
implement a separate logic for that. You could argue that you do not need
dynamic isolation but it's too inflexible in general even on 2way machines
it's waste to not be able to use second cpu for general load even when RT app
is not running. Given that CPU hotplug is necessary for many things, including
suspend on multi-cpu machines it's practically guaranteed to be very stable
and well supported. In other words we have a perfect synergy here :).
Now, about the cpusets. You do not really have to do anything fancy with them.
If all you want to do is to disable systemwide load balancing
mount -tcgroup -o cpuset cpuset /dev/cpuset
echo 0 > /dev/cpuset/cpuset.sched_load_banace
That's it. You get _exactly_ the same effect as with isolcpus=. And you can
change that dynamically, and when you switch to quad- and eight- core machines
then you'll be to do that with groups of cpus, not just system wide.
Just to complete the example above. Lets say you want to isolate cpu2
(assuming that cpusets are already mounted).
# Bring cpu2 offline
echo 0 > /sys/devices/system/cpu/cpu2/online
# Disable system wide load balancing
echo 0 > /dev/cpuset/cpuset.sched_load_banace
# Bring cpu2 online
echo 1 > /sys/devices/system/cpu/cpu2/online
Now if you want to un-isolate cpu2 you do
# Disable system wide load balancing
echo 1 > /dev/cpuset/cpuset.sched_load_banace
Of course this is not a complete isolation. There are also irqs (see my
"default irq affinity" patch), workqueues and the stop machine. I'm working on
those too and will release .25 base cpuisol tree when I'm done.
Max
Paul Jackson wrote:
> Max wrote:
>> You do not even need to replace /sbin/init for this, no ?
>> Simply installing custom
>> /etc/init.d/create_cpusets
>
> That can ensure that the deamons that init starts later
> on placed, but it doesn't ensure that the glibc pages that
> init (and the shell it spawned to run 'create_cpusets')
> are placed.
Ah, I missed the memory placement part. As far cpu placement goes the result
would be equivalent but not for memory.
btw Are you guys using some kind of castrated, statically linked shell to run
that script ? Otherwise regular shell and friends will suck in the same glibc
pages as the regular init would.
Max
Peter Zijlstra wrote:
> On Wed, 2008-06-04 at 11:29 -0700, Max Krasnyansky wrote:
>> Paul Jackson wrote:
>>> Andi wrote:
>>>> Right now the system boot could put pages from some daemon in there before any
>>>> cpusets are set up and there's no easy way to get them away again
>>> We (SGI) routinely handle that need with a custom init program,
>>> invoked with the init= parameter to the booting kernel, which
>>> sets up cpusets and then invokes the normal (real) init program
>>> in a cpuset configured to exclude those CPUs and nodes which we
>>> want to remain unloaded. For example, on a 256 CPU, 64 node
>>> system, we might have init running on a single node of 4 CPUs,
>>> and leave the remaining 63 nodes and 252 CPUs isolated from all
>>> the usual user level daemons started by init.
>>>
>>> There is no need for additional kernel changes to accomplish this.
>> You do not even need to replace /sbin/init for this, no ?
>> Simply installing custom
>> /etc/init.d/create_cpusets
>> with priority 0
>> # chkconfig: 12345 0 99
>> will do the job.
>>
>> That script will move init itself into the appropriate cpuset and from then on
>> everything will inherit it.
>
> The advantage of using a replacement /sbin/init is that you execute
> before the rest of userspace, unlike what you propose.
That does not matter for the cpu placement (ie the end result is the same) but
does matter for memory placement as PaulJ pointed out.
Thanx
Max
> btw Are you guys using some kind of castrated, statically linked shell to run
> that script ?
It (that init= program) is not a script. It is its own
castrated, statically linked special purpose binary.
Once it has done its duty setting up cpusets, it then
exec's the normal init, confined to the cpuset configured
for it.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
Paul Jackson wrote:
>> btw Are you guys using some kind of castrated, statically linked shell to run
>> that script ?
>
> It (that init= program) is not a script. It is its own
> castrated, statically linked special purpose binary.
> Once it has done its duty setting up cpusets, it then
> exec's the normal init, confined to the cpuset configured
> for it.
Got it.
Max
> We (SGI) routinely handle that need with a custom init program,
> invoked with the init= parameter to the booting kernel, which
> sets up cpusets and then invokes the normal (real) init program
> in a cpuset configured to exclude those CPUs and nodes which we
> want to remain unloaded. For example, on a 256 CPU, 64 node
> system, we might have init running on a single node of 4 CPUs,
> and leave the remaining 63 nodes and 252 CPUs isolated from all
> the usual user level daemons started by init.
>
> There is no need for additional kernel changes to accomplish this.
There are no additional changes needed, but you must admit that isolcpus
is a much more elegant solutation for this problem than hijacking init.
-Andi
> You do not even need to replace /sbin/init for this, no ?
> Simply installing custom
> /etc/init.d/create_cpusets
> with priority 0
> # chkconfig: 12345 0 99
> will do the job.
>
> That script will move init itself into the appropriate cpuset and from then on
> everything will inherit it.
It won't be the parent of the other init scripts.
-Andi
Peter Zijlstra wrote:
> On Wed, 2008-06-04 at 11:24 -0700, Max Krasnyansky wrote:
>> Peter Zijlstra wrote:
>>> On Wed, 2008-06-04 at 11:07 -0700, Max Krasnyansky wrote:
>>>> Peter, Ingo,
>>>>
>>>> Take a look at the report below (came up during isolcpu= remove discussions).
>>>>
>>>> It looks like stop_machine threads are getting forcefully preempted because
>>>> they exceed their RT quanta. It's strange because rt period is pretty long.
>>>> But given that disabling rt period logic solves the issue the machine was not
>>>> really stuck.
>>> Yeah, I know, I'm already looking at this
>> I see. Does it look like a bug in the rt period logic ?
>> Or did the stop_machine thread really run for a long time (in the report that
>> you got that is) ?
>
> looks like a fun race between refreshing the period and updating
> cpu_online_map.
Oh, I did not realize that rt period is a timer that iterates online cpus. I
assumed that you do it in the scheduler tick or something.
Max
pj wrote:
> We (SGI) routinely handle that need with a custom init program,
> invoked with the init= parameter to the booting kernel, ...
Andi replied:
> There are no additional changes needed, but you must admit that isolcpus
> is a much more elegant solutation for this problem than hijacking init.
While I cannot claim that hijacking init is elegant, our gentle
readers are at risk of losing the context here.
I was responding to a need you noticed to isolate memory nodes (such as
from stray glibc pages placed by init or the shell running early
scripts), not to the need to isolate CPUs:
Andi had written earlier:
> One example I've seen in the past is that someone wanted to isolate a node
> completely from any memory traffic to avoid performance disturbance
> for memory intensive workloads.
Granted, this might be a distinction without a difference, because on
the very lightly loaded system seen at boot, local node memory placement
will pretty much guarantee that the memory is placed on the nodes next
to the CPUs on which init or its inelegant replacements are run.
You noted this yourself, when you wrote:
> Given the use case wants more a "isolnodes", but given that there
> tends to be enough free memory at boot "isolcpus" tended to work.
So perhaps it boils down to a question of which is easiest to do,
the answer to which will vary depending on where you are in the food
chain of distributions. Here "easy" means least likely to break
something else. All these mechanisms are relatively trivial, until
one has to deal with conflicting software packages, configurations and
distributions, changing out from under oneself.
That is, it can be desirable to have multiple mechanisms, so that the
various folks independently needing to manipulate such placement can
minimize stepping on each others feet. By using the rarely hacked init=
mechanism for SGI software addons, we don't interfere with those who
are using the more common isolcpus= mechanism for such purposes as
offlining a bad CPU.
In sum, I suspect we agree that we have enough mechanisms, and don't
need an isolnodes as well.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
Andi, replying to Max:
> > That script will move init itself into the appropriate cpuset and from then on
> > everything will inherit it.
>
> It won't be the parent of the other init scripts.
True, but init will be the parent of other init scripts,
and init itself was moved into the appropriate cpuset.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
On Wed, 2008-06-04 at 12:26 -0700, Max Krasnyansky wrote:
> Mark Hounschell wrote:
> > IMHO,
> >
> > What is an abonination, is that cpusets are equired for this type of
> > isolation to begin with, even on a 2 processor machine.
> >
> > I would like the option to stay and be extended like Max originally
> > proposed. If cpusets/hotplug are configured isolation would be obtained
> > using them. If not then isolcpus could be used to get the same isolation.
> >
> > From a user land point of view, I just want an easy way to fully isolate
> > a particular cpu. Even a new syscall or extension to sched_setaffinity
> > would make me happy. Cpusets and hotplug don't.
> >
> > Again this is just MHO.
>
> Mark, I used to be the same way and I'm a convert now. It does seems like an
> overkill for 2cpu machine to have cpusets and cpu hotplug. But both options
> cost around 50KB worth of text and maybe another 10KB of data. That's on the
> x86-64 box. Let's say it's a 100KB. Not a terribly huge overhead.
>
> Now if you think about it. In order to be able to dynamically isolate a cpu we
> have to do exact same thing that CPU hotplug does. Which is to clear all
> timers, kernel, threads, etc from that CPUs. It does not make sense to
> implement a separate logic for that. You could argue that you do not need
> dynamic isolation but it's too inflexible in general even on 2way machines
> it's waste to not be able to use second cpu for general load even when RT app
> is not running. Given that CPU hotplug is necessary for many things, including
> suspend on multi-cpu machines it's practically guaranteed to be very stable
> and well supported. In other words we have a perfect synergy here :).
>
> Now, about the cpusets. You do not really have to do anything fancy with them.
> If all you want to do is to disable systemwide load balancing
> mount -tcgroup -o cpuset cpuset /dev/cpuset
> echo 0 > /dev/cpuset/cpuset.sched_load_banace
>
> That's it. You get _exactly_ the same effect as with isolcpus=. And you can
> change that dynamically, and when you switch to quad- and eight- core machines
> then you'll be to do that with groups of cpus, not just system wide.
>
> Just to complete the example above. Lets say you want to isolate cpu2
> (assuming that cpusets are already mounted).
>
> # Bring cpu2 offline
> echo 0 > /sys/devices/system/cpu/cpu2/online
>
> # Disable system wide load balancing
> echo 0 > /dev/cpuset/cpuset.sched_load_banace
>
> # Bring cpu2 online
> echo 1 > /sys/devices/system/cpu/cpu2/online
>
> Now if you want to un-isolate cpu2 you do
>
> # Disable system wide load balancing
> echo 1 > /dev/cpuset/cpuset.sched_load_banace
>
> Of course this is not a complete isolation. There are also irqs (see my
> "default irq affinity" patch), workqueues and the stop machine. I'm working on
> those too and will release .25 base cpuisol tree when I'm done.
Furthermore, cpusets allow for isolated but load-balanced RT domains. We
now have a reasonably strong RT balancer, and I'm looking at
implementing a full partitioned EDF scheduler somewhere in the future.
This could never be done using isolcpus.
> I was responding to a need you noticed to isolate memory nodes (such as
> from stray glibc pages placed by init or the shell running early
> scripts), not to the need to isolate CPUs:
Yes, but in practice (enough memory for bootup) isolating CPUs
is equivalent to isolating nodes. So isolcpus=... tended to work.
I occasionally recommended it to people because it was much easier
to explain than replacing init.
The perfect solution would be probably just fix it in init(8)
and make it parse some command line option that then sets up
the right cpusets.
But you asked for isolcpus=... use cases and I just wanted to describe
one.
> So perhaps it boils down to a question of which is easiest to do,
> the answer to which will vary depending on where you are in the food
> chain of distributions. Here "easy" means least likely to break
> something else. All these mechanisms are relatively trivial, until
> one has to deal with conflicting software packages, configurations and
> distributions, changing out from under oneself.
One solution would be to move isolcpus=/isonodes= into init(8) and make
sure it's always statically linked. But failing that keeping it in the
kernel is not too bad. It's certainly not a lot of code.
On the other hand if the kernel implemented a isolnodes=... it would
be possible to exclude those nodes from the interleaving the kernel
does at boot, which might be also beneficial and give slightly
more isolation.
-Andi
Andi wrote:
> I occasionally recommended it to people because it was much easier
> to explain than replacing init.
Definitely ... I'd do the same in such cases.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
Mark wrote:
> Yes to both questions. However after reading Max and Peter's response, I
> guess there is another, probably better or _only_, way to get what I really
> need anyway so please don't consider my intrusion into this thread as a NAK.
>
> I do not rely on this option as it is implemented.
Thank-you for your clearly stated conclusions, and thanks for stopping by.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
Paul Jackson wrote:
> Mark wrote:
>> What is an abonination, is that cpusets are equired for this type of
>> isolation to begin with, even on a 2 processor machine.
>
> Just to be sure I'm following you here, you stating that you
> want to be able to manipulate the isolated cpu map at runtime,
> not just with the boot option isolcpus, right?
> Where this
> isolated cpu map works just fine even on systems which do
> not have cpusets configured, right?
>
Yes to both questions. However after reading Max and Peter's response, I
guess there is another, probably better or _only_, way to get what I really
need anyway so please don't consider my intrusion into this thread as a NAK.
I do not rely on this option as it is implemented.
Regards
Mark
Paul Jackson wrote:
> Andi wrote:
>> I occasionally recommended it to people because it was much easier
>> to explain than replacing init.
>
> Definitely ... I'd do the same in such cases.
>
btw I'm putting together a set of scripts that I call "syspart" (for "system
partitioning") which creates cpusets, sets up IRQ affinity, etc at boot. We
can make your init replacement a part of that package. You could then tell
people to
1. Install syspart rpm
2. Change boot opts to "init=/sbin/syspart par0_cpus=0-3 par0_mems=0-2 par0_init"
The thing will then create /dev/cpuset/par0 with cpus0-3 and mems0-2 and put
/sbin/init into that cpuset.
With the exception of the #1 it's as easy to explain as "isolcpus=".
Paul, I beleive you mentioned a while ago that the tools you guys use for this
aren't open source. Has that changed ? If not I'll write my own. I have all
the scripts ready to go but as you pointed out it has to be a standalone
statically linked binary.
Thanx
Max
Max wrote:
> Paul, I beleive you mentioned a while ago that the tools you guys use for this
> aren't open source. Has that changed ?
No change.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
Paul Jackson wrote:
> Max wrote:
>> Paul, I beleive you mentioned a while ago that the tools you guys use for this
>> aren't open source. Has that changed ?
>
> No change.
Ack.
What do you of that idea btw ? ie Generally available "syspart" thing.
Max
> What do you of that idea btw ? ie Generally available "syspart" thing.
I have no particular thoughts one way or the other.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
[ sorry if this is going OT ]
Hi,
>
> Furthermore, cpusets allow for isolated but load-balanced RT domains. We
> now have a reasonably strong RT balancer, and I'm looking at implementing
> a full partitioned EDF scheduler somewhere in the future.
>
I'm working on a partitioned EDF scheduler right now, and I have to
face several issues, starting from the interface to use to expose the
EDF scheduler to userspace, and the integration with the existing
sched_rt policy.
By now I'm experimenting with an additional sched_class that implements
a SCHED_EDF policy, extending the POSIX struct sched_param with the
EDF parameters of the task, do you see any better way to do that?
Could that approach be reasonable?
Michael
___________________________________
Scopri il Blog di Yahoo! Mail: trucchi, novit?, consigli... e la tua opinione!
http://www.ymailblogit.com/blog/
On Wed, 2008-06-04 at 21:44 +0000, Michael Trimarchi wrote:
> [ sorry if this is going OT ]
>
> Hi,
>
> >
> > Furthermore, cpusets allow for isolated but load-balanced RT domains. We
> > now have a reasonably strong RT balancer, and I'm looking at implementing
> > a full partitioned EDF scheduler somewhere in the future.
> >
>
> I'm working on a partitioned EDF scheduler right now, and I have to
> face several issues, starting from the interface to use to expose the
> EDF scheduler to userspace, and the integration with the existing
> sched_rt policy.
I would add a sched_class above sched_rt and let sched_rt run in all
unclaimed time by sched_edf.
Have you looked at deadline inheritance to replace PI? I think it can be
done reasonably simple by replacing the plist with a RB tree.
> By now I'm experimenting with an additional sched_class that implements
> a SCHED_EDF policy, extending the POSIX struct sched_param with the
> EDF parameters of the task, do you see any better way to do that?
> Could that approach be reasonable?
Yes, that is the way I'm leaning.
Hi,
> >
> > I'm working on a partitioned EDF scheduler right
> now, and I have to
> > face several issues, starting from the interface to
> use to expose the
> > EDF scheduler to userspace, and the integration with
> the existing
> > sched_rt policy.
>
> I would add a sched_class above sched_rt and let sched_rt
> run in all
> unclaimed time by sched_edf.
>
I add this type of class before sched_rt, so the next of sched_edf
point to sched_rt class.
> Have you looked at deadline inheritance to replace PI? I
> think it can be
> done reasonably simple by replacing the plist with a RB
> tree.
I think it can be done with an rb tree. The only tricky
part would be mixing tasks coming from the sched_edf and the sched_rt class, but it should not be a problem.
>
> > By now I'm experimenting with an additional
> sched_class that implements
> > a SCHED_EDF policy, extending the POSIX struct
> sched_param with the
> > EDF parameters of the task, do you see any better way
> to do that?
> > Could that approach be reasonable?
>
> Yes, that is the way I'm leaning.
By now I'm facing some problems. I still have not clear what parameters a task forked from a sched_edf task should get, as it would involve some form of admission control, and how to deal with tasks that run longer than their nominal execution time (i.e., should we use some server mechanism to limit the amount of cpu they're using, or handle that in some other way?)
Michael
___________________________________
Scopri il Blog di Yahoo! Mail: trucchi, novit?, consigli... e la tua opinione!
http://www.ymailblogit.com/blog/
Peter Zijlstra wrote:
> On Wed, 2008-06-04 at 12:26 -0700, Max Krasnyansky wrote:
>> Mark Hounschell wrote:
>>> IMHO,
>>>
>>> What is an abonination, is that cpusets are equired for this type of
>>> isolation to begin with, even on a 2 processor machine.
>>>
>>> I would like the option to stay and be extended like Max originally
>>> proposed. If cpusets/hotplug are configured isolation would be obtained
>>> using them. If not then isolcpus could be used to get the same isolation.
>>>
>>> From a user land point of view, I just want an easy way to fully isolate
>>> a particular cpu. Even a new syscall or extension to sched_setaffinity
>>> would make me happy. Cpusets and hotplug don't.
>>>
>>> Again this is just MHO.
>> Mark, I used to be the same way and I'm a convert now. It does seems like an
>> overkill for 2cpu machine to have cpusets and cpu hotplug. But both options
>> cost around 50KB worth of text and maybe another 10KB of data. That's on the
>> x86-64 box. Let's say it's a 100KB. Not a terribly huge overhead.
>>
>> Now if you think about it. In order to be able to dynamically isolate a cpu we
>> have to do exact same thing that CPU hotplug does. Which is to clear all
>> timers, kernel, threads, etc from that CPUs. It does not make sense to
>> implement a separate logic for that. You could argue that you do not need
>> dynamic isolation but it's too inflexible in general even on 2way machines
>> it's waste to not be able to use second cpu for general load even when RT app
>> is not running. Given that CPU hotplug is necessary for many things, including
>> suspend on multi-cpu machines it's practically guaranteed to be very stable
>> and well supported. In other words we have a perfect synergy here :).
>>
>> Now, about the cpusets. You do not really have to do anything fancy with them.
>> If all you want to do is to disable systemwide load balancing
>> mount -tcgroup -o cpuset cpuset /dev/cpuset
>> echo 0 > /dev/cpuset/cpuset.sched_load_banace
>>
>> That's it. You get _exactly_ the same effect as with isolcpus=. And you can
>> change that dynamically, and when you switch to quad- and eight- core machines
>> then you'll be to do that with groups of cpus, not just system wide.
>>
>> Just to complete the example above. Lets say you want to isolate cpu2
>> (assuming that cpusets are already mounted).
>>
>> # Bring cpu2 offline
>> echo 0 > /sys/devices/system/cpu/cpu2/online
>>
>> # Disable system wide load balancing
>> echo 0 > /dev/cpuset/cpuset.sched_load_banace
>>
>> # Bring cpu2 online
>> echo 1 > /sys/devices/system/cpu/cpu2/online
>>
>> Now if you want to un-isolate cpu2 you do
>>
>> # Disable system wide load balancing
>> echo 1 > /dev/cpuset/cpuset.sched_load_banace
>>
>> Of course this is not a complete isolation. There are also irqs (see my
>> "default irq affinity" patch), workqueues and the stop machine. I'm working on
>> those too and will release .25 base cpuisol tree when I'm done.
>
Thanks for the detailed tutorial Max. I'm personally still very
skeptical. I really don't believe you'll ever be able to run multiple
_demanding_ RT environments on the same machine. Now matter how many
processors you've got. But even though I might be wrong there, thats
actually OK with me. I, and I'm sure most, don't have a problem with
dedicating a machine to a single RT env.
You've got to hold your tongue just right, look at the right spot on the
wall, and be running the RT patched kernel, all at the same time, to run
just one successfully. I just want to stop using my tongue and staring
at the wall. I personally feel that a single easy method of completely
isolating a single processor from the rest of the machine _might_
benefit the RT community more than all this fancy stuff coming down the
pipe. Something like your original proposed isolcpus or even a simple
SCHED_ISOLATE arg to the setschedular call.
> Furthermore, cpusets allow for isolated but load-balanced RT domains. We
> now have a reasonably strong RT balancer, and I'm looking at
> implementing a full partitioned EDF scheduler somewhere in the future.
>
> This could never be done using isolcpus.
I'm sure my thoughts reflect a gross under estimate of what really has
to happen. I will hope for the best and wait.
Regards
Mark
On Thu, 2008-06-05 at 11:16 +0000, Michael Trimarchi wrote:
> Hi,
>
> > >
> > > I'm working on a partitioned EDF scheduler right
> > now, and I have to
> > > face several issues, starting from the interface to
> > use to expose the
> > > EDF scheduler to userspace, and the integration with
> > the existing
> > > sched_rt policy.
> >
> > I would add a sched_class above sched_rt and let sched_rt
> > run in all
> > unclaimed time by sched_edf.
> >
> I add this type of class before sched_rt, so the next of sched_edf
> point to sched_rt class.
Exactly.
> > Have you looked at deadline inheritance to replace PI? I
> > think it can be
> > done reasonably simple by replacing the plist with a RB
> > tree.
> I think it can be done with an rb tree. The only tricky
> part would be mixing tasks coming from the sched_edf and the sched_rt
> class, but it should not be a problem.
Mapping them onto U64_MAX - prio or something like that ought to do.
Handling wraparound of the timeline might get a little involved though -
then again realtime takes 584 yrs to wrap a 64bit ns counter.
> > > By now I'm experimenting with an additional
> > sched_class that implements
> > > a SCHED_EDF policy, extending the POSIX struct
> > sched_param with the
> > > EDF parameters of the task, do you see any better way
> > to do that?
> > > Could that approach be reasonable?
> >
> > Yes, that is the way I'm leaning.
> By now I'm facing some problems. I still have not clear what
> parameters a task forked from a sched_edf task should get, as it would
> involve some form of admission control,
I'd start with something like:
u64 sched_param::edf_period [ns]
u64 sched_param::edf_runtime [ns]
so that deadline = time_of_schedule + edf_period, and his allowance
within that period is edf_runtime.
fork would inherit the parent's settings, and we'd need to do admission
control on all tasks entering SCHED_EDF, either through setscheduler()
or fork(). We could fail with -ENOTIME or somesuch error.
> and how to deal with tasks that run longer than their nominal
> execution time (i.e., should we use some server mechanism to limit the
> amount of cpu they're using, or handle that in some other way?)
Yeah - we already account the amount of runtime, we can send them
SIGXCPU and stop running them. Look at the rt_bandwidth code upstream -
it basically stops rt task groups from running once their quota is
depleted - waking them up once it gets refreshed due to the period
expiring.
For single tasks its easier, just account their time and dequeue them
once it exceeds the quota, and enqueue them on a refresh timer thingy to
start them again once the period rolls over.
The only tricky bit here is PI :-) it would need to keep running despite
being over quota.
> From: Peter Zijlstra <[email protected]>
> Date: Thu, Jun 05, 2008 02:07:40PM +0200
>
> On Thu, 2008-06-05 at 11:16 +0000, Michael Trimarchi wrote:
...
> > By now I'm facing some problems. I still have not clear what
> > parameters a task forked from a sched_edf task should get, as it would
> > involve some form of admission control,
>
> I'd start with something like:
>
> u64 sched_param::edf_period [ns]
> u64 sched_param::edf_runtime [ns]
>
> so that deadline = time_of_schedule + edf_period, and his allowance
> within that period is edf_runtime.
>
This is what I'm doing right now (apart from using timespec structs
instead of u64 values to align the sched_param struct specified by
POSIX on systems with SCHED_SPORADIC support).
I'll clean up the code and post it here in the next few days.
> fork would inherit the parent's settings, and we'd need to do admission
> control on all tasks entering SCHED_EDF, either through setscheduler()
> or fork(). We could fail with -ENOTIME or somesuch error.
>
> > and how to deal with tasks that run longer than their nominal
> > execution time (i.e., should we use some server mechanism to limit the
> > amount of cpu they're using, or handle that in some other way?)
>
> Yeah - we already account the amount of runtime, we can send them
> SIGXCPU and stop running them. Look at the rt_bandwidth code upstream -
> it basically stops rt task groups from running once their quota is
> depleted - waking them up once it gets refreshed due to the period
> expiring.
>
> For single tasks its easier, just account their time and dequeue them
> once it exceeds the quota, and enqueue them on a refresh timer thingy to
> start them again once the period rolls over.
>
Ok, using the same mechanism even for SCHED_EDF tasks seems the
right way to go.
> The only tricky bit here is PI :-) it would need to keep running despite
> being over quota.
>
There is some work in this area, and there are some protocols
handling that, but that simple solution will be a good starting
point.
Mark Hounschell wrote:
>
> Thanks for the detailed tutorial Max. I'm personally still very
> skeptical. I really don't believe you'll ever be able to run multiple
> _demanding_ RT environments on the same machine. Now matter how many
> processors you've got. But even though I might be wrong there, thats
> actually OK with me. I, and I'm sure most, don't have a problem with
> dedicating a machine to a single RT env.
>
> You've got to hold your tongue just right, look at the right spot on the
> wall, and be running the RT patched kernel, all at the same time, to run
> just one successfully. I just want to stop using my tongue and staring
> at the wall.
I understand your scepticism but it's quite easy to do these days. Yes there
are certain restrictions on how RT applications have to be designed, but
definitely not a rocket science. It can be summed up in a few words:
"cpu isolation, lock-free communication and memory management,
and direct HW access"
In other words you want to talk using lock-free queues and mempools between
soft- and hard- RT components and use something like libe1000.sf.net to talk
to the outside world.
There are other approaches of course, those involve RT kernels, Xenomai, etc.
As I mentioned awhile ago we (here at Qualcomm) actually implemented full
blown UMB (one of the 4G broadband technologies) basestation that runs entire
MAC and part of PHY layers in the user-space using CPU isolation techniques.
Vanilla 2.6.17 to .24 kernel + cpuisol and off-the-shelf dual-Opteron and
Core2Duo based machines. We have very very tight deadlines and yet everything
works just fine. And no we don't have to do any special tong holding or other
rituals :) for it to work. In fact quite the opposite. I can do full SW
(kernel, etc) builds and do just about anything else while our basestation
application is running. Worst case latencies in the RT thread running on the
isolated CPU is about ~1.5usec.
Now I switched to 8way Core2Quad machines. I can run 7 RT engines on 7
isolated CPUs and load cpu0. Latencies are a bit higher 5-6 usec (I guessing
due to shared caches and stuff) but otherwise it works fine. This is with the
2.6.25.4-cpuisol2 and syspart (syspart is a set of scripts for setting up
system partitions). I'll release both either later today or early next week.
So I think you're underestimating the power of Linux and CPU isolation ;-).
> I personally feel that a single easy method of completely
> isolating a single processor from the rest of the machine _might_
> benefit the RT community more than all this fancy stuff coming down the
> pipe. Something like your original proposed isolcpus or even a simple
> SCHED_ISOLATE arg to the setschedular call.
Yes it may seem that way. But as I explained in the previous email. In order
to actually implement something like that we'd need to do reimplement parts of
the cpusets and cpu hotplug. I'm not sure if you noticed or not but my
original patch actually relied on the cpu hotplug anyway. Just because it
makes no sense not to awesome powers of hotplug that can migrate _everything_
running on one cpu to an other cpu.
And cpuset.sched_load_balance flag provides equivalent functionality for
controlling scheduler domains and load balancer.
Other stuff like workqueue have to be dealt with in either case. So what I'm
getting at is that you get equivalent functionality.
Max