LinuxLists.cc - [PATCH sched-devel 0/7] CPU isolation extensions

2008-02-22 02:39:00

Subject: [PATCH sched-devel 0/7] CPU isolation extensions

Ingo,

As you suggested I'm sending CPU isolation patches for review/inclusion into
sched-devel tree. They are against 2.6.25-rc2.
You can also pull them from my GIT tree at
git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git master

Diffstat:
b/Documentation/ABI/testing/sysfs-devices-system-cpu | 41 ++++++
b/Documentation/cpu-isolation.txt | 114 ++++++++++++++++++-
b/arch/x86/Kconfig | 1
b/arch/x86/kernel/genapic_flat_64.c | 5
b/drivers/base/cpu.c | 48 ++++++++
b/include/linux/cpumask.h | 3
b/kernel/Kconfig.cpuisol | 15 ++
b/kernel/Makefile | 4
b/kernel/cpu.c | 49 ++++++++
b/kernel/sched.c | 37 ------
b/kernel/stop_machine.c | 9 +
b/kernel/workqueue.c | 31 +++--
kernel/Kconfig.cpuisol | 56 ++++++---
kernel/cpu.c | 16 +-
14 files changed, 356 insertions(+), 73 deletions(-)

List of commits
cpuisol: Make cpu isolation configrable and export isolated map
cpuisol: Do not route IRQs to the CPUs isolated at boot
cpuisol: Do not schedule workqueues on the isolated CPUs
cpuisol: Move on-stack array used for boot cmd parsing into __initdata
cpuisol: Documentation updates
cpuisol: Minor updates to the Kconfig options
cpuisol: Do not halt isolated CPUs with Stop Machine

This patch series extends CPU isolation support.
The primary idea here is to be able to use some CPU cores as the dedicated engines for running
user-space code with minimal kernel overhead/intervention, think of it as an SPE in the
Cell processor. I'd like to be able to run a CPU intensive (%100) RT task on one of the
processors without adversely affecting or being affected by the other system activities.
System activities here include _kernel_ activities as well.

I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to
achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf
multi- processor/core systems (vanilla kernel plus these patches) even under extreme system load.
I'm working with legal folks on releasing hard RT user-space framework for that.
I believe with the current multi-core CPU trend we will see more and more applications that
explore this capability: RT gaming engines, simulators, hard RT apps, etc.

Hence the proposal is to extend current CPU isolation feature.
The new definition of the CPU isolation would be:
---
1. Isolated CPU(s) must not be subject to scheduler load balancing
Users must explicitly bind threads in order to run on those CPU(s).

2. By default interrupts must not be routed to the isolated CPU(s)
User must route interrupts (if any) to those CPUs explicitly.

3. In general kernel subsystems must avoid activity on the isolated CPU(s) as much as possible
Includes workqueues, per CPU threads, etc.
This feature is configurable and is disabled by default.
---

I've been maintaining this stuff since around 2.6.18 and it's been running in production
environment for a couple of years now. It's been tested on all kinds of machines, from NUMA
boxes like HP xw9300/9400 to tiny uTCA boards like Mercury AXA110.
The messiest part used to be SLAB garbage collector changes. With the new SLUB all that mess
goes away (ie no changes necessary). Also CFS seems to handle CPU hotplug much better than O(1)
did (ie domains are recomputed dynamically) so that isolation can be done at any time (via sysfs).
So this seems like a good time to merge.

We've had scheduler support for CPU isolation ever since O(1) scheduler went it. In other words
#1 is already supported. These patches do not change/affect that functionality in any way.
#2 is trivial one liner change to the IRQ init code.
#3 is addressed by a couple of separate patches. The main problem here is that RT thread can prevent
kernel threads from running and machine gets stuck because other CPUs are waiting for those threads
to run and report back.

Folks involved in the scheduler/cpuset development provided a lot of feedback on the first series
of patches. I believe I managed to explain and clarify every aspect.
Paul Jackson initially suggested to implement #2 and #3 using cpusets subsystem. Paul and I looked
at it more closely and determined that exporting cpu_isolated_map instead is a better option.
Details here
http://marc.info/?l=linux-kernel&m=120180692331461&w=2

Last patch to the stop machine is potentially unsafe and is marked as experimental. Unfortunately
it's currently the only option that allows dynamic module insertion/removal for above scenarios.
>From the previous discussions it's the only controversial part.

Thanx
Max

2008-02-22 08:36:49

by Dmitry Adamushko

[permalink] [raw]

Subject: Re: [PATCH sched-devel 0/7] CPU isolation extensions

Hi Max,

> [ ... ]
> Last patch to the stop machine is potentially unsafe and is marked as experimental. Unfortunately
> it's currently the only option that allows dynamic module insertion/removal for above scenarios.

I'm puzzled by the following part (can be a misunderstanding from my side)

+config CPUISOL_STOPMACHINE
+ bool "Do not halt isolated CPUs with Stop Machine (EXPERIMENTAL)"
+ depends on CPUISOL && STOP_MACHINE && EXPERIMENTAL
+ help
+ If this option is enabled kernel will not halt isolated CPUs
+ when Stop Machine is triggered. Stop Machine is currently only
+ used by the module insertion and removal.

this "only" part. What about e.g. a 'cpu hotplug' case (_cpu_down())?
(or we should abstract it a bit to the point that e.g. a cpu can be
considered as 'a module'? :-)

--
Best regards,
Dmitry Adamushko

2008-02-22 11:44:27

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH sched-devel 0/7] CPU isolation extensions

On Thu, 2008-02-21 at 18:38 -0800, Max Krasnyanskiy wrote:

> As you suggested I'm sending CPU isolation patches for review/inclusion into
> sched-devel tree. They are against 2.6.25-rc2.
> You can also pull them from my GIT tree at
> git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git master

Post patches! I can't review a git tree..

> Diffstat:
> b/Documentation/ABI/testing/sysfs-devices-system-cpu | 41 ++++++
> b/Documentation/cpu-isolation.txt | 114 ++++++++++++++++++-
> b/arch/x86/Kconfig | 1
> b/arch/x86/kernel/genapic_flat_64.c | 5
> b/drivers/base/cpu.c | 48 ++++++++
> b/include/linux/cpumask.h | 3
> b/kernel/Kconfig.cpuisol | 15 ++
> b/kernel/Makefile | 4
> b/kernel/cpu.c | 49 ++++++++
> b/kernel/sched.c | 37 ------
> b/kernel/stop_machine.c | 9 +
> b/kernel/workqueue.c | 31 +++--
> kernel/Kconfig.cpuisol | 56 ++++++---
> kernel/cpu.c | 16 +-
> 14 files changed, 356 insertions(+), 73 deletions(-)
>
> List of commits
> cpuisol: Make cpu isolation configrable and export isolated map

cpu_isolated_map was a bad hack when it was introduced, I feel we should
deprecate it and fully integrate the functionality into cpusets. That would
give a much more flexible end-result.

CPU-sets can already isolate cpus by either creating a cpu outside of any set,
or a set with a single cpu not shared by any other sets.

This also allows for isolated groups, there are good reasons to isolate groups,
esp. now that we have a stronger RT balancer. SMP and hard RT are not
exclusive. A design that does not take that into account is too rigid.

> cpuisol: Do not route IRQs to the CPUs isolated at boot

>From the diffstat you're not touching the genirq stuff, but instead hack a
single architecture to support this feature. Sounds like an ill designed hack.

A better approach would be to add a flag to the cpuset infrastructure that says
whether its a system set or not. A system set would be one that services the
general purpose OS and would include things like the IRQ affinity and unbound
kernel threads (including unbound workqueues - or single workqueues). This flag
would default to on, and by switching it off for the root set, and a select
subset you would push the System away from those cpus, thereby isolating them.

> cpuisol: Do not schedule workqueues on the isolated CPUs

(per-cpu workqueues, the single ones are treated in the previous section)

I still strongly disagree with this approach. Workqueues are passive, they
don't do anything unless work is provided to them. By blindly not starting them
you handicap the system and services that rely on them.

(you even acknowledged this problem, by saying it breaks oprofile for instance
- still trying to push a change that knowingly breaks a lot of stuff is bad
manners on lkml and not acceptable for mainline)

The way to do this is to avoid the generation of work, not the execution of it.

> cpuisol: Move on-stack array used for boot cmd parsing into __initdata
> cpuisol: Documentation updates
> cpuisol: Minor updates to the Kconfig options

No idea about these patches,...

> cpuisol: Do not halt isolated CPUs with Stop Machine

Very strong NACK on this one, it breaks a lot of functionality in non-obvious
ways, as has been pointed out to you numerous times. Such patches are just not
acceptable for mainline - full stop.

Please, the ideas are not bad and have been suggested in the context of -rt a
number of times, its just the execution. Design features so that they are
flexible and can be used in ways you never imagined them. But more
importantly so that they don't knowingly break tons of stuff.

2008-02-22 13:41:22

by Mark Hounschell

[permalink] [raw]

Subject: Re: [PATCH sched-devel 0/7] CPU isolation extensions

2008-02-22 14:00:22

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH sched-devel 0/7] CPU isolation extensions

On Fri, 2008-02-22 at 08:38 -0500, Mark Hounschell wrote:

> >> List of commits
> >> cpuisol: Make cpu isolation configrable and export isolated map
> >
> > cpu_isolated_map was a bad hack when it was introduced, I feel we should
> > deprecate it and fully integrate the functionality into cpusets. That would
> > give a much more flexible end-result.
> >
> > CPU-sets can already isolate cpus by either creating a cpu outside of any set,
> > or a set with a single cpu not shared by any other sets.
> >
>
> Peter, what about when I am NOT using cpusets and are disabled in my config but
> I still want to use this?

Then you enable it?

> >> cpuisol: Do not schedule workqueues on the isolated CPUs
> >
> > (per-cpu workqueues, the single ones are treated in the previous section)
> >
> > I still strongly disagree with this approach. Workqueues are passive, they
> > don't do anything unless work is provided to them. By blindly not starting them
> > you handicap the system and services that rely on them.
> >
>
> Have things changed since since my first bad encounter with Workqueues.
> I am referring to this thread.
>
> http://kerneltrap.org/mailarchive/linux-kernel/2007/5/29/97039

Just means you get to fix those problems. By blindly not starting them
you introduce others.

2008-02-22 21:04:34

by Max Krasnyansky

[permalink] [raw]

Subject: Re: [PATCH sched-devel 0/7] CPU isolation extensions

Dmitry Adamushko wrote:
> Hi Max,
>
>> [ ... ]
>> Last patch to the stop machine is potentially unsafe and is marked as experimental. Unfortunately
>> it's currently the only option that allows dynamic module insertion/removal for above scenarios.
>
> I'm puzzled by the following part (can be a misunderstanding from my side)
>
> +config CPUISOL_STOPMACHINE
> + bool "Do not halt isolated CPUs with Stop Machine (EXPERIMENTAL)"
> + depends on CPUISOL && STOP_MACHINE && EXPERIMENTAL
> + help
> + If this option is enabled kernel will not halt isolated CPUs
> + when Stop Machine is triggered. Stop Machine is currently only
> + used by the module insertion and removal.
>
> this "only" part. What about e.g. a 'cpu hotplug' case (_cpu_down())?
> (or we should abstract it a bit to the point that e.g. a cpu can be
> considered as 'a module'? :-)

My bad. I forgot to update that text. As you and other folks pointed out
stopmachine is used in a few other places besides module loading. We had
a discussion about this awhile ago. I just forgot to update the text.
Will do.

Max

2008-02-22 22:05:35

by Max Krasnyansky

[permalink] [raw]

Subject: Re: [PATCH sched-devel 0/7] CPU isolation extensions

Peter Zijlstra wrote:
> On Thu, 2008-02-21 at 18:38 -0800, Max Krasnyanskiy wrote:
>
>> As you suggested I'm sending CPU isolation patches for review/inclusion into
>> sched-devel tree. They are against 2.6.25-rc2.
>> You can also pull them from my GIT tree at
>> git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git master
>
> Post patches! I can't review a git tree..
I did. But it looks like I screwed the --cc list. I just resent them.

>> List of commits
>> cpuisol: Make cpu isolation configrable and export isolated map
>
> cpu_isolated_map was a bad hack when it was introduced, I feel we should
> deprecate it and fully integrate the functionality into cpusets. That would
> give a much more flexible end-result.
That's not not currently possible and will introduce a lot of complexity.
I'm pretty sure you missied the discussion I had with Paul (you were cc'ed on that btw).
In fact I provided the link to that discussion in the original email.
Here it is again:
http://marc.info/?l=linux-kernel&m=120180692331461&w=2

Basically the problem is very simple. CPU isolation needs a simple/efficient way to check
if CPU N is isolated. cpuset/cgroup APIs are not designed for that. In other to figure
out if a CPU N is isolated one has to iterate through all cpusets and checks their cpu maps.
That requires several levels of locking (cgroup and cpuset).
The other issue is that cpusets are a bit too dynamic (see the thread above for more details)
we'd need notified mechanisms to notify subsystems when a CPUs become isolated. Again more
complexity. Since I integrated cpu isolation with cpu hotplug it's already addressed in a nice
simple way.
Please take a look at that discussion. I do not think it's worth the effort to put this into
cpusets. cpu_isolated_map is very clean and simple concept and integrates nicely with the
rest of the cpu maps. ie It's very much the same concept and API as cpu_online_map, etc.

If we want to integrate this stuff with cpusets I think the best approach would be is to
have cpusets update the cpu_isolated_map just like it currently updates scheduler domains.

> CPU-sets can already isolate cpus by either creating a cpu outside of any set,
> or a set with a single cpu not shared by any other sets.
This only works for user-space. As I mentioned about for full CPU isolation various kernel
subsystems need to be aware of that CPUs are isolated in order to avoid activities on them.

> This also allows for isolated groups, there are good reasons to isolate groups,
> esp. now that we have a stronger RT balancer. SMP and hard RT are not
> exclusive. A design that does not take that into account is too rigid.
You're thinking scheduling only. Paul had the same confusion ;-)
As I explained before I'm redefining (or proposing to redefine) CPU isolation to

1. Isolated CPU(s) must not be subject to scheduler load balancing
Users must explicitly bind threads in order to run on those CPU(s).

2. By default interrupts must not be routed to the isolated CPU(s)
User must route interrupts (if any) to those CPUs explicitly.

3. In general kernel subsystems must avoid activity on the isolated CPU(s) as much as possible
Includes workqueues, per CPU threads, etc.
This feature is configurable and is disabled by default.

Only #1 has to do with the scheduling. The rest has _nothing_ to do with it.

>> cpuisol: Do not route IRQs to the CPUs isolated at boot
>
>>From the diffstat you're not touching the genirq stuff, but instead hack a
> single architecture to support this feature. Sounds like an ill designed hack.
Ah, good point. This patches started before genirq was merged and I did not realize
that there is a way to set default irq affinity with genirq.
I'll definitely take a look at that.

> A better approach would be to add a flag to the cpuset infrastructure that says
> whether its a system set or not. A system set would be one that services the
> general purpose OS and would include things like the IRQ affinity and unbound
> kernel threads (including unbound workqueues - or single workqueues). This flag
> would default to on, and by switching it off for the root set, and a select
> subset you would push the System away from those cpus, thereby isolating them.
You're talking about very high level API. I'm totally all for it. What the patches deal
with is actual low level stuff that is needed to do the "push the system away from those cpus".
As I mentioned above we can have cpuset update cpu_isolated_map for example.

>> cpuisol: Do not schedule workqueues on the isolated CPUs
>
> (per-cpu workqueues, the single ones are treated in the previous section)
>
> I still strongly disagree with this approach. Workqueues are passive, they
> don't do anything unless work is provided to them. By blindly not starting them
> you handicap the system and services that rely on them.
Oh boy, back to square one. I covered this already.
I even started a thread on that and explained what this is and why its needed.
http://marc.info/?l=linux-kernel&m=120217173001671&w=2
You guys never replied.
Anyway. Workqueues are indeed passive for the most part. But there are several conditions
when _every_ work queue thread is expected to report back. Look for example at how workqueues
are flushed (see thread above for the code example). In short it schedules dummy work on each
online CPU and _waits_ for it to report back.
As I mentioned before I also thought that they are passive. They are _not_. They way I noticed
problems in first place is when systems started hanging while unmounting NFS mounts because
workqueue threads could not run due to higher priority user-space threads.

You last statement above (about handicap) does not make much sense with respect to the CPU
isolation. I mean we cannot have it both ways, for isolated CPUs not to run kernel subsystems
and at the same time say that we must not impact the system. Of course it impacts the system,
in a predictable and controlled way. Non-isolated CPUs are not affected at all. And app threads
that need full system services should not be running on the isolated CPUs.

> (you even acknowledged this problem, by saying it breaks oprofile for instance
> - still trying to push a change that knowingly breaks a lot of stuff is bad
> manners on lkml and not acceptable for mainline)
Uh, hold on. How did "it breaks oprofile" translated into "breaks a lot of stuff" ?
As far as I know it only affects oprofile. And I did test the heck out of this. As I mentioned
we're running our full fledged production systems with these patches. Surely I may have missed
some other feature that may break. But saying that it breaks a lot of things is a major
overstatement.
Most subsystems start workqueue threads on each CPU purely for load balancing. And like I said
above we cannot have it both ways, either the CPU is isolated and does not participate in load
balancing or it's not.

>> cpuisol: Do not halt isolated CPUs with Stop Machine
>
> Very strong NACK on this one, it breaks a lot of functionality in non-obvious
> ways, as has been pointed out to you numerous times. Such patches are just not
> acceptable for mainline - full stop.
Ok Peter. You're not participating in stop machine discussions yet you're ok with making statements
like that. I did mark this as experimental and explained that at this point it's the only way to
support module loading/unloading with the usecase I described. We're thinking about ways to make
module load/unload not use stopmachine. In which case the patch will not be necessary. For now I'd
rather give users an option, than no options.
Anyway I'm ok if this particular patch does not go in at this point. I'll drop it from the next
submissions for now, until we figure out how to not use stopmachine or something like that.

> Please, the ideas are not bad and have been suggested in the context of -rt a
> number of times, its just the execution. Design features so that they are
> flexible and can be used in ways you never imagined them. But more
> importantly so that they don't knowingly break tons of stuff.
"breaks a lot of stuff" is now "breaks a ton of stuff" ;-).
I agree about the stop machine thing. Everything else though I believe is very clean. Hopefully
my explanation above clarified the design choices better this time.
I'm open to suggestions.

Max

2008-02-22 22:08:29

by Max Krasnyansky

[permalink] [raw]

Subject: Re: [PATCH sched-devel 0/7] CPU isolation extensions

Mark Hounschell wrote:
> Peter Zijlstra wrote:
>> On Thu, 2008-02-21 at 18:38 -0800, Max Krasnyanskiy wrote:
>>
>>> As you suggested I'm sending CPU isolation patches for review/inclusion into
>>> sched-devel tree. They are against 2.6.25-rc2.
>>> You can also pull them from my GIT tree at
>>> git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git master
>>
>> Post patches! I can't review a git tree..
>>
> Max, could you also post them for 2.6.24.2 stable please. Thanks
Will do.

Max

2008-02-22 22:26:18

by Max Krasnyansky

[permalink] [raw]

Subject: Re: [PATCH sched-devel 0/7] CPU isolation extensions

Peter Zijlstra wrote:
> On Fri, 2008-02-22 at 08:38 -0500, Mark Hounschell wrote:
>
>>>> List of commits
>>>> cpuisol: Make cpu isolation configrable and export isolated map
>>>
>>> cpu_isolated_map was a bad hack when it was introduced, I feel we should
>>> deprecate it and fully integrate the functionality into cpusets. That would
>>> give a much more flexible end-result.
>>>
>>> CPU-sets can already isolate cpus by either creating a cpu outside of any set,
>>> or a set with a single cpu not shared by any other sets.
>>>
>> Peter, what about when I am NOT using cpusets and are disabled in my config but
>> I still want to use this?
>
> Then you enable it?
I'm with Mark on this one. For example if I have two core machine I do not need cpusets
to manage them.
Plus like I explained in prev email cpuset is higher level API. We can think of a way to
integrated them if needed.

>>>> cpuisol: Do not schedule workqueues on the isolated CPUs
>>>
>>> (per-cpu workqueues, the single ones are treated in the previous section)
>>>
>>> I still strongly disagree with this approach. Workqueues are passive, they
>>> don't do anything unless work is provided to them. By blindly not starting them
>>> you handicap the system and services that rely on them.
>>>
>> Have things changed since since my first bad encounter with Workqueues.
>> I am referring to this thread.
>>
>> http://kerneltrap.org/mailarchive/linux-kernel/2007/5/29/97039
>
> Just means you get to fix those problems. By blindly not starting them
> you introduce others.

Please give me an example of what you have in mind.
Also if you look at the patch (which I've now posted properly) it's not just not starting them.
I also redirected all future scheduled work to non-isolated CPU. ie If work is scheduled on the
isolated CPU this work is treated as if the work queue is single threaded. As I explained before
most subsystem do not care which CPU actually gets to execute the work. Oprofile is the only
one I know of that breaks because it cannot collect the stats from the isolated CPUs. I'm thinking
of a different solution for oprofile, maybe collection samples through IPIs or something.

Max

2008-02-23 13:05:59

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH sched-devel 0/7] CPU isolation extensions

Please, wrap your emails at 78 - most mailers can do this.

On Fri, 2008-02-22 at 14:05 -0800, Max Krasnyanskiy wrote:
> Peter Zijlstra wrote:
> > On Thu, 2008-02-21 at 18:38 -0800, Max Krasnyanskiy wrote:
> >

> >> List of commits
> >> cpuisol: Make cpu isolation configrable and export isolated map
> >
> > cpu_isolated_map was a bad hack when it was introduced, I feel we should
> > deprecate it and fully integrate the functionality into cpusets. That would
> > give a much more flexible end-result.
> That's not not currently possible and will introduce a lot of complexity.
> I'm pretty sure you missied the discussion I had with Paul (you were cc'ed on that btw).
> In fact I provided the link to that discussion in the original email.
> Here it is again:
> http://marc.info/?l=linux-kernel&m=120180692331461&w=2

I read it, I just firmly disagree.

> Basically the problem is very simple. CPU isolation needs a simple/efficient way to check
> if CPU N is isolated.

I'm not seeing the need for that outside of setting up the various
states. That is, once all the affinities are set up, you'd hardly ever
would (or should - imho) need to know if a particular CPU is isolated or
not.

> cpuset/cgroup APIs are not designed for that. In other to figure
> out if a CPU N is isolated one has to iterate through all cpusets and checks their cpu maps.
> That requires several levels of locking (cgroup and cpuset).

> The other issue is that cpusets are a bit too dynamic (see the thread above for more details)
> we'd need notified mechanisms to notify subsystems when a CPUs become isolated. Again more
> complexity. Since I integrated cpu isolation with cpu hotplug it's already addressed in a nice
> simple way.

I guess you have another definition of nice than I do.

> Please take a look at that discussion. I do not think it's worth the effort to put this into
> cpusets. cpu_isolated_map is very clean and simple concept and integrates nicely with the
> rest of the cpu maps. ie It's very much the same concept and API as cpu_online_map, etc.

I'm thinking cpu_isolated_map is a very dirty hack.

> If we want to integrate this stuff with cpusets I think the best approach would be is to
> have cpusets update the cpu_isolated_map just like it currently updates scheduler domains.
>
> > CPU-sets can already isolate cpus by either creating a cpu outside of any set,
> > or a set with a single cpu not shared by any other sets.
> This only works for user-space. As I mentioned about for full CPU isolation various kernel
> subsystems need to be aware of that CPUs are isolated in order to avoid activities on them.

Yes, hence the proposed system flag to handle the kernel bits like
unbounded kernel threads and IRQs.

> > This also allows for isolated groups, there are good reasons to isolate groups,
> > esp. now that we have a stronger RT balancer. SMP and hard RT are not
> > exclusive. A design that does not take that into account is too rigid.

> You're thinking scheduling only. Paul had the same confusion ;-)

I'm not, I'm thinking it ought to allow for it.

> As I explained before I'm redefining (or proposing to redefine) CPU isolation to
>
> 1. Isolated CPU(s) must not be subject to scheduler load balancing
> Users must explicitly bind threads in order to run on those CPU(s).

I'm thinking that should be optional, load-balancing might make sense
under some scenarios.

> 2. By default interrupts must not be routed to the isolated CPU(s)
> User must route interrupts (if any) to those CPUs explicitly.

That is what I allowed for by the system flag.

> 3. In general kernel subsystems must avoid activity on the isolated CPU(s) as much as possible
> Includes workqueues, per CPU threads, etc.
> This feature is configurable and is disabled by default.

How am I refuting any of these points? I'm very clear on what you want,
I'm just saying I want to get there differently.

> >> cpuisol: Do not route IRQs to the CPUs isolated at boot
> >
> >>From the diffstat you're not touching the genirq stuff, but instead hack a
> > single architecture to support this feature. Sounds like an ill designed hack.
> Ah, good point. This patches started before genirq was merged and I did not realize
> that there is a way to set default irq affinity with genirq.
> I'll definitely take a look at that.

I said so before, but good to see you've picked this up. I didn't know
if there was a genirq way to set smp affinities - but if there wasn't
this was a good moment to introduce that.

> > A better approach would be to add a flag to the cpuset infrastructure that says
> > whether its a system set or not. A system set would be one that services the
> > general purpose OS and would include things like the IRQ affinity and unbound
> > kernel threads (including unbound workqueues - or single workqueues). This flag
> > would default to on, and by switching it off for the root set, and a select
> > subset you would push the System away from those cpus, thereby isolating them.
> You're talking about very high level API. I'm totally all for it. What the patches deal
> with is actual low level stuff that is needed to do the "push the system away from those cpus".
> As I mentioned above we can have cpuset update cpu_isolated_map for example.

I'm not wanting to extend cpu_isolated_map - I'm wanting to get rid of
it entirely.

> >> cpuisol: Do not schedule workqueues on the isolated CPUs
> >
> > (per-cpu workqueues, the single ones are treated in the previous section)
> >
> > I still strongly disagree with this approach. Workqueues are passive, they
> > don't do anything unless work is provided to them. By blindly not starting them
> > you handicap the system and services that rely on them.
> Oh boy, back to square one. I covered this already.
> I even started a thread on that and explained what this is and why its needed.
> http://marc.info/?l=linux-kernel&m=120217173001671&w=2
> You guys never replied.

Right, sorry, I did indeed miss that one (might still be in my inbox -
I'm having trouble getting to 0 unread these days).

I'm thinking moving away from flush would be a good idea. Often there is
no need for a full flush but just wait for completion of all 'your'
worklets - as opposed by all worklets.

I've ran into this in -rt in another context but managed to work around
it. It might be time to look at it again.

(fwiw, workqueues in -rt are priority aware)

> You last statement above (about handicap) does not make much sense with respect to the CPU
> isolation. I mean we cannot have it both ways, for isolated CPUs not to run kernel subsystems
> and at the same time say that we must not impact the system. Of course it impacts the system,
> in a predictable and controlled way. Non-isolated CPUs are not affected at all. And app threads
> that need full system services should not be running on the isolated CPUs.

I'm saying we should remove the generation of work, not the execution of
work when its needed. By keeping the workqueues around you might even
allow routing a single non custom IRQ to an isolated CPU. One that uses
softirqs for instance.

Its all about allowing other scenarios than the one you need, while
serving your needs equally well.

2008-02-26 02:10:36

by Max Krasnyansky

[permalink] [raw]

Subject: Re: [PATCH sched-devel 0/7] CPU isolation extensions

Hi Peter,

Sorry for delay in reply.

> Please, wrap your emails at 78 - most mailers can do this.
Done.

> On Fri, 2008-02-22 at 14:05 -0800, Max Krasnyanskiy wrote:
>> Peter Zijlstra wrote:
>>> On Thu, 2008-02-21 at 18:38 -0800, Max Krasnyanskiy wrote:
>>>
>
>>>> List of commits
>>>> cpuisol: Make cpu isolation configrable and export isolated map
>>>
>>> cpu_isolated_map was a bad hack when it was introduced, I feel we should
>>> deprecate it and fully integrate the functionality into cpusets. That would
>>> give a much more flexible end-result.
>> That's not not currently possible and will introduce a lot of complexity.
>> I'm pretty sure you missied the discussion I had with Paul (you were cc'ed on that btw).
>> In fact I provided the link to that discussion in the original email.
>> Here it is again:
>> http://marc.info/?l=linux-kernel&m=120180692331461&w=2
>
> I read it, I just firmly disagree.
>
>> Basically the problem is very simple. CPU isolation needs a simple/efficient way to check
>> if CPU N is isolated.
>
> I'm not seeing the need for that outside of setting up the various
> states. That is, once all the affinities are set up, you'd hardly ever
> would (or should - imho) need to know if a particular CPU is isolated or not.
Unless I'm missing something that's only possible for a very static system.
What I mean is that yes you could go and set irq affinity, apps affinity,
workqueue thread affinity, etc not to run on the isolated cpus. It works
_until_ something changes, at which point the system needs to know that it's
not supposed to touch CPU N.
For example new IRQ is registered, new workqueue is created (fs mounted, net
network interface is created, etc), new kthread is started, etc.
Sure we can introduce default affinity masks for irqs, workqueues, etc. But
that's essentially just duplicating cpu_isolated_map.

>> cpuset/cgroup APIs are not designed for that. In other to figure
>> out if a CPU N is isolated one has to iterate through all cpusets and checks their cpu maps.
>> That requires several levels of locking (cgroup and cpuset).
>
>> The other issue is that cpusets are a bit too dynamic (see the thread above for more details)
>> we'd need notified mechanisms to notify subsystems when a CPUs become isolated. Again more
>> complexity. Since I integrated cpu isolation with cpu hotplug it's already addressed in a nice
>> simple way.
>
> I guess you have another definition of nice than I do.
No, not really.
Lets talk specifics. My goal was not to introduce a bunch of new functionality
and rewrite workqueues and stuff, instead I wanted to integrated with existing
mechanisms. CPU maps are used everywhere and exporting cpu_isolated_map was a
natural way to make other parts of the kernel aware of the isolated CPUs.

>> Please take a look at that discussion. I do not think it's worth the effort to put this into
>> cpusets. cpu_isolated_map is very clean and simple concept and integrates nicely with the
>> rest of the cpu maps. ie It's very much the same concept and API as cpu_online_map, etc.
>
> I'm thinking cpu_isolated_map is a very dirty hack.
>
>> If we want to integrate this stuff with cpusets I think the best approach would be is to
>> have cpusets update the cpu_isolated_map just like it currently updates scheduler domains.
>>
>>> CPU-sets can already isolate cpus by either creating a cpu outside of any set,
>>> or a set with a single cpu not shared by any other sets.
>> This only works for user-space. As I mentioned about for full CPU isolation various kernel
>> subsystems need to be aware of that CPUs are isolated in order to avoid activities on them.
>
> Yes, hence the proposed system flag to handle the kernel bits like
> unbounded kernel threads and IRQs.
I do not see a specific proposals here. The funny part that we're not even
disagreeing on the high level. Yes It'd be nice to have such a flag ;-)
But how will genirq subsystem, for example, be aware of that flag ?
ie How would it know that by default it is not supposed to route irqs to the
CPUs in the cpusets with that flag ?
As I explained above setting affinity for existing irqs is not enough.
Same for workqueus or any other susbsytem that wants to run per-cpu threads
and stuff.

>>> This also allows for isolated groups, there are good reasons to isolate groups,
>>> esp. now that we have a stronger RT balancer. SMP and hard RT are not
>>> exclusive. A design that does not take that into account is too rigid.
>
>> You're thinking scheduling only. Paul had the same confusion ;-)
>
> I'm not, I'm thinking it ought to allow for it.
One way I can think of how to support groups and allow for RT balancer is
this: Make scheduler ignore cpu_isolated_map and give cpusets full control of
the scheduler domains. Use cpu_isolated_map to only for hw irq and other
kernel sub-systems. That way cpusets could mark cpus in the group as isolated
to get rid of the kernel activity and build sched domain such that tasks get
balanced in it.
The thing I do not like about it is that there is no way to boot the system
with CPU N isolated from the beginning. Also dynamic isolation currently
relies on the cpu hotplug to clear pending irqs, softirqs, kernel timers and
threads. So cpusets would have to simulate the cpu hotplug event I guess.

>> As I explained before I'm redefining (or proposing to redefine) CPU isolation to
>>
>> 1. Isolated CPU(s) must not be subject to scheduler load balancing
>> Users must explicitly bind threads in order to run on those CPU(s).
>
> I'm thinking that should be optional, load-balancing might make sense
> under some scenarios.
It's still optional. It's the sane default though. I mean users are welcome to
change affinity mask of an active IRQ.
btw You definitely want to start with that as default for complete isolation,
because some drivers, E1000 for example, schedule timers and stuff on the CPU
they handle IRQ on. The timer gets restarted from then on (until iface is
brought down). Which means that the timer sticks with the CPU even if IRQ is
moved.
Please do not tell me that I need to fix E1000 instead ;-). It's just an
example. I'm sure there are other drivers and friends. For the full isolation
you want to bring the cpu down to get everything off of it and then bring it
up such that things are not started on it by default. Only then you get a
truly clean isolated state.

>> 2. By default interrupts must not be routed to the isolated CPU(s)
>> User must route interrupts (if any) to those CPUs explicitly.
>
> That is what I allowed for by the system flag.
>
>> 3. In general kernel subsystems must avoid activity on the isolated CPU(s) as much as possible
>> Includes workqueues, per CPU threads, etc.
>> This feature is configurable and is disabled by default.
>
> How am I refuting any of these points? I'm very clear on what you want,
> I'm just saying I want to get there differently.
I guess maybe you're not refuting them. But you're not giving me concrete
examples of how low level stuff would work. Lets talks specifics. I'm not
stuck on the cpu_isolated_map, if there are better options I'm all for it.
I'm willing to prototype it and see if it works for my usecase. But I need
more details than "system flag in cpuset".

>>>> cpuisol: Do not route IRQs to the CPUs isolated at boot
>>> >From the diffstat you're not touching the genirq stuff, but instead hack a
>>> single architecture to support this feature. Sounds like an ill designed hack.
>> Ah, good point. This patches started before genirq was merged and I did not realize
>> that there is a way to set default irq affinity with genirq.
>> I'll definitely take a look at that.
>
> I said so before, but good to see you've picked this up. I didn't know
> if there was a genirq way to set smp affinities - but if there wasn't
> this was a good moment to introduce that.
"Default" affinity was the keyword here. Genirq definitely deals with
affinities in general. Anyway, as you saw I rewrote it on top of genirq.
Waiting for review feedback from Thomas.

>>> A better approach would be to add a flag to the cpuset infrastructure that says
>>> whether its a system set or not. A system set would be one that services the
>>> general purpose OS and would include things like the IRQ affinity and unbound
>>> kernel threads (including unbound workqueues - or single workqueues). This flag
>>> would default to on, and by switching it off for the root set, and a select
>>> subset you would push the System away from those cpus, thereby isolating them.
>> You're talking about very high level API. I'm totally all for it. What the patches deal
>> with is actual low level stuff that is needed to do the "push the system away from those cpus".
>> As I mentioned above we can have cpuset update cpu_isolated_map for example.
>
> I'm not wanting to extend cpu_isolated_map - I'm wanting to get rid of
> it entirely.
Please be specific. Lets say we get rid of the cpu_isolated_map. How would
that "system" flag be propagated to the lowest layers like genirq, workqueue,
etc ?

>>>> cpuisol: Do not schedule workqueues on the isolated CPUs
>>>
>>> (per-cpu workqueues, the single ones are treated in the previous section)
>>>
>>> I still strongly disagree with this approach. Workqueues are passive, they
>>> don't do anything unless work is provided to them. By blindly not starting them
>>> you handicap the system and services that rely on them.
>> Oh boy, back to square one. I covered this already.
>> I even started a thread on that and explained what this is and why its needed.
>> http://marc.info/?l=linux-kernel&m=120217173001671&w=2
>> You guys never replied.
>
> Right, sorry, I did indeed miss that one (might still be in my inbox -
> I'm having trouble getting to 0 unread these days).
>
> I'm thinking moving away from flush would be a good idea. Often there is
> no need for a full flush but just wait for completion of all 'your'
> worklets - as opposed by all worklets.
>
> I've ran into this in -rt in another context but managed to work around
> it. It might be time to look at it again.
I think there were other cases besides flushing that wanted all threads to
report back. But I do not remember exact scenario of the top of my head.

> (fwiw, workqueues in -rt are priority aware)
That does not help me because I need full isolation.

>> You last statement above (about handicap) does not make much sense with respect to the CPU
>> isolation. I mean we cannot have it both ways, for isolated CPUs not to run kernel subsystems
>> and at the same time say that we must not impact the system. Of course it impacts the system,
>> in a predictable and controlled way. Non-isolated CPUs are not affected at all. And app threads
>> that need full system services should not be running on the isolated CPUs.
>
> I'm saying we should remove the generation of work, not the execution of
> work when its needed. By keeping the workqueues around you might even
> allow routing a single non custom IRQ to an isolated CPU. One that uses
> softirqs for instance.
>
> Its all about allowing other scenarios than the one you need, while
> serving your needs equally well.
I agree in general. From my experience though it is kind of hard and the
system might be a bit fragile.
Also I can give you another example. In some cases it's preferable if
workqueues and stuff triggered as the result of the syscalls on the isolated
cpus ran on the non-isolated cpus. Same for the IRQ. Workqueues are generally
a background activity. It may make sense to handle IRQ/softirq on the isolated
CPU and off-load the rest to the non-isoalted one. In which case you'd want
exactly what my patch does, ie pro-active rerouting of work scheduled on the
isolated CPU.

To sum it up. Lets talk details.
How the "system" flag you suggested would be propagated to the genirq,
workqueue, etc ? How do we avoid work being scheduled on those CPUs ?

Max