2003-05-19 21:56:04

by Dave Hansen

[permalink] [raw]
Subject: userspace irq balancer

On Mon, 2003-05-19 at 13:14, Badari Pulavarty wrote:
> ---------- Forwarded Message ----------
>
> Subject: Re: 2.5 closeout
> Date: Mon, 19 May 2003 11:40:05 -0700
> From: Andrew Morton <[email protected]>
> To: Badari Pulavarty <[email protected]>
>
> The attribution wasn't very accurate. That's me, Arjan, perhaps
> Jeff Garzik.
>
> The problem is that kird does the wrong thing for some workloads (packet
> forwarding mainly - routing). We'll never get that right, so we'd like to
> deprecate the in-kernel IRQ balancer and merge Arjan's user-space balancer
> into the kernel tree instead.
>
> http://people.redhat.com/~arjanv/irqbalance/

The only thing I'm concerned about is how it's going to be packaged.
I'm envisioning explaining how to get the daemon out of its initrd
image, set it up and run it, especially before distros have it
integrated. The stuff that's in the kernel now isn't horribly broken;
it's just not optimal for some relatively unusual cases.

Can we leave the current code as-is, and make the added intelligence
from the userspace one an optional thing for those with unusal setups?
--
Dave Hansen
[email protected]


2003-05-19 21:58:16

by Arjan van de Ven

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, May 19, 2003 at 03:07:36PM -0700, Dave Hansen wrote:

> The only thing I'm concerned about is how it's going to be packaged.
> I'm envisioning explaining how to get the daemon out of its initrd
> image, set it up and run it, especially before distros have it
> integrated. The stuff that's in the kernel now isn't horribly broken;
> it's just not optimal for some relatively unusual cases.

as for distros: RHL8 and later ship with it on the RH side
(default enabled as of RHL9).

As for where to start it: I really think an initscript is the logical
place; there has been some discussion about doing it
from the initramfs but I don't see real benifit from that; from starting
init to running the initscripts isn't exactly THIS interrupt/performance
heavy.

2003-05-19 22:10:50

by Dave Hansen

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, 2003-05-19 at 15:11, Arjan van de Ven wrote:
> On Mon, May 19, 2003 at 03:07:36PM -0700, Dave Hansen wrote:
>
> > The only thing I'm concerned about is how it's going to be packaged.
> > I'm envisioning explaining how to get the daemon out of its initrd
> > image, set it up and run it, especially before distros have it
> > integrated. The stuff that's in the kernel now isn't horribly broken;
> > it's just not optimal for some relatively unusual cases.
>
> as for distros: RHL8 and later ship with it on the RH side
> (default enabled as of RHL9).

But, do you see the need for ripping out the current code? For those of
us that are still running a slightly more primitive distro, it would be
nice to have some pretty effective default behavior, like what is in the
kernel now.

> As for where to start it: I really think an initscript is the logical
> place; there has been some discussion about doing it
> from the initramfs but I don't see real benifit from that; from starting
> init to running the initscripts isn't exactly THIS interrupt/performance
> heavy.

Yeah, I don't think we need it the second the kernel boots :) Do you
really think this is a 2.6 showstopper? Since it will require distro
cooperation anyway, and those are many months from releasing a 2.6
distro, do we really need it in place for 2.6.0?

--
Dave Hansen
[email protected]

2003-05-20 03:12:44

by David Miller

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, 2003-05-19 at 15:22, Dave Hansen wrote:
> > as for distros: RHL8 and later ship with it on the RH side
> > (default enabled as of RHL9).
>
> But, do you see the need for ripping out the current code? For those of
> us that are still running a slightly more primitive distro, it would be
> nice to have some pretty effective default behavior, like what is in the
> kernel now.

You have to install new modutils to even use modules with the 2.5.x
kernel, given that why are we even talking about the "inconvenience"
of installing the usermode IRQ balancer as being a blocker for
ripping out the in-kernel stuff?

The in-kernel stuff MUST go. It went in because "some benchmark went
faster", but with no "why" describing why it might have improved
performance. We KNOW it absolutely sucks for routing and firewall
applications. The in-kernel bits were all a shamans dance, with zero
technical "here is why this makes things go faster" description
attached. If I remember properly, the changelog message when the
in-kernel irq balancing went in was of the form "this makes some
specweb run go faster".

--
David S. Miller <[email protected]>

2003-05-20 03:33:34

by William Lee Irwin III

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, May 19, 2003 at 08:25:31PM -0700, David S. Miller wrote:
> The in-kernel stuff MUST go. It went in because "some benchmark went
> faster", but with no "why" describing why it might have improved
> performance. We KNOW it absolutely sucks for routing and firewall
> applications. The in-kernel bits were all a shamans dance, with zero
> technical "here is why this makes things go faster" description
> attached. If I remember properly, the changelog message when the
> in-kernel irq balancing went in was of the form "this makes some
> specweb run go faster".

Absolutely. Not to mention the code for the in-kernel algorithm has
historically broken i386 ports using certain modes of Intel's
interrupt controllers.

Far better would be to validate that the affinity specified is feasible
to program into the interrupt controller in a system call and leave the
algorithm to userspace.


-- wli

2003-05-20 04:52:49

by Dave Hansen

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, 2003-05-19 at 20:46, William Lee Irwin III wrote:
> On Mon, May 19, 2003 at 08:25:31PM -0700, David S. Miller wrote:
> > The in-kernel stuff MUST go. It went in because "some benchmark went
> > faster", but with no "why" describing why it might have improved
> > performance. We KNOW it absolutely sucks for routing and firewall
> > applications. The in-kernel bits were all a shamans dance, with zero
> > technical "here is why this makes things go faster" description
> > attached. If I remember properly, the changelog message when the
> > in-kernel irq balancing went in was of the form "this makes some
> > specweb run go faster".
>
> Absolutely. Not to mention the code for the in-kernel algorithm has
> historically broken i386 ports using certain modes of Intel's
> interrupt controllers.

OK, I just went and actually looked at the code again. After
suppressing my gag reflex, I started to remember all of the problems
we've had with it, including fixing it for Intel's own clustered APIC
mode.

Does anyone have a patch to tear it out already? Is the current proc
interface acceptable, or do we want a syscall interface like wli
suggests?

--
Dave Hansen
[email protected]

2003-05-20 05:40:27

by Martin J. Bligh

[permalink] [raw]
Subject: Re: userspace irq balancer

--Dave Hansen <[email protected]> wrote (on Monday, May 19, 2003 22:03:50 -0700):

> On Mon, 2003-05-19 at 20:46, William Lee Irwin III wrote:
>> On Mon, May 19, 2003 at 08:25:31PM -0700, David S. Miller wrote:
>> > The in-kernel stuff MUST go. It went in because "some benchmark went
>> > faster", but with no "why" describing why it might have improved
>> > performance. We KNOW it absolutely sucks for routing and firewall
>> > applications. The in-kernel bits were all a shamans dance, with zero
>> > technical "here is why this makes things go faster" description
>> > attached. If I remember properly, the changelog message when the
>> > in-kernel irq balancing went in was of the form "this makes some
>> > specweb run go faster".
>>
>> Absolutely. Not to mention the code for the in-kernel algorithm has
>> historically broken i386 ports using certain modes of Intel's
>> interrupt controllers.
>
> OK, I just went and actually looked at the code again. After
> suppressing my gag reflex, I started to remember all of the problems
> we've had with it, including fixing it for Intel's own clustered APIC
> mode.
>
> Does anyone have a patch to tear it out already? Is the current proc
> interface acceptable, or do we want a syscall interface like wli
> suggests?

I have no frigging idea why you'd want to tear something out that works
well already, and has a shitload of work put into it.

Make it a config option if you don't like it, Keith has a patch to do
that already - it's trivial. That way everyone can have what they want.

M.

2003-05-20 06:01:41

by David Miller

[permalink] [raw]
Subject: Re: userspace irq balancer

From: "Martin J. Bligh" <[email protected]>
Date: Mon, 19 May 2003 22:53:11 -0700

I have no frigging idea why you'd want to tear something out that
works well already, and has a shitload of work put into it.

It's pretty fundamentally broken for having had so much work
put into it. Show me something other than "SpecWEB run for IBM
ran faster" as a reason for keeping this code in there. Can you
even do this?

It is crap, from the very beginning, would you like to know why?

How does the in-kernel IRQ load balancing measure "load" and
"busyness"? Herein lies the most absolutely fundamental problem with
this code, it fails to recognize that we end up with most of our
networking "load" from softint context.

We can process thousands of packets for one hardware interrupt. Are
you able to comprehend this?

Measuring hardware interrupts in some was as "load" is about
as far from the truth as you can get.

This is just the tip of the iceberg.

rm -rf in-kernel-irqbalance;

And hey, if _YOU_ want a broken system which uses this bogus algorithm,
YOU CAN DO THIS with the userland thing if you want.

2003-05-20 06:25:38

by Dave Hansen

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, 2003-05-19 at 23:13, David S. Miller wrote:
> From: "Martin J. Bligh" <[email protected]>
> Date: Mon, 19 May 2003 22:53:11 -0700
>
> I have no frigging idea why you'd want to tear something out that
> works well already, and has a shitload of work put into it.
>
> It's pretty fundamentally broken for having had so much work
> put into it. Show me something other than "SpecWEB run for IBM
> ran faster" as a reason for keeping this code in there. Can you
> even do this?

I don't even think we can do that. That code was being integrated
around the same time that our Specweb setup decided to go south on us
and start physically frying itself. We never got a chance to run it.
BTW, I don't think there are any other kernel developers running Specweb
on 2.5 kernels. If there are, please speak up!

Andrew Theurer posted some positive results here, which were quite
marginal in the case with 1 nic. 4.7% with two.
http://marc.theaimsgroup.com/?l=linux-kernel&m=104212930819212&w=2


--
Dave Hansen
[email protected]

2003-05-20 06:29:18

by David Miller

[permalink] [raw]
Subject: Re: userspace irq balancer

From: Dave Hansen <[email protected]>
Date: 19 May 2003 23:36:23 -0700

I don't even think we can do that. That code was being integrated
around the same time that our Specweb setup decided to go south on us
and start physically frying itself.

This gets more amusing by the second. Let's kill this code
already. People who like the current algorithms can push
them into the userspace solution.

2003-05-20 08:47:35

by Arjan van de Ven

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, May 19, 2003 at 10:03:50PM -0700, Dave Hansen wrote:
> Does anyone have a patch to tear it out already? Is the current proc
> interface acceptable, or do we want a syscall interface like wli
> suggests?

I have no problems with the proc interface; it's ascii so reasonably
extendible in the future for, say, when 64 cpus on
32 bit linux get supported. It's also not THAT inefficient since my code
only uses it when some binding changes, not all the time.

2003-05-20 09:01:42

by William Lee Irwin III

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, May 19, 2003 at 10:03:50PM -0700, Dave Hansen wrote:
>> Does anyone have a patch to tear it out already? Is the current proc
>> interface acceptable, or do we want a syscall interface like wli
>> suggests?

On Tue, May 20, 2003 at 09:00:18AM +0000, Arjan van de Ven wrote:
> I have no problems with the proc interface; it's ascii so reasonably
> extendible in the future for, say, when 64 cpus on
> 32 bit linux get supported. It's also not THAT inefficient since my code
> only uses it when some binding changes, not all the time.

Sorry about that; I forgot about the /proc/ part and thought the thing
was based on system calls as it stood. I wouldn't want a redundant
interface to be added.

My current cpumask_t patches handle extending the /proc/ interface to
handle an arbitrary-sized cpumask, so I should have realized this.


-- wli

2003-05-20 09:01:56

by Andrew Morton

[permalink] [raw]
Subject: Re: userspace irq balancer

Arjan van de Ven <[email protected]> wrote:
>
> On Mon, May 19, 2003 at 10:03:50PM -0700, Dave Hansen wrote:
> > Does anyone have a patch to tear it out already? Is the current proc
> > interface acceptable, or do we want a syscall interface like wli
> > suggests?
>
> I have no problems with the proc interface; it's ascii so reasonably
> extendible in the future for, say, when 64 cpus on
> 32 bit linux get supported. It's also not THAT inefficient since my code
> only uses it when some binding changes, not all the time.

Concerns have been expressed that the /proc interface may be a bit racy.
One thing we do need to do is to write a /proc stresstest tool which pokes
numbers into the /proc files at high rates, run that under traffic for a
few hours.

There is no need to pull out the existing balancer until the userspace
solution is proven - it can be turned off with `noirqbalance' until that
work has been performed.

Nobody has tried improving the current balancer. From a quick look it
appears that it could work reasonably for the problematic packet-forwarding
workload if the when-to-start-balancing threshold is reduced from 1000/sec
to (say) 10/sec. Don't know - I've never seen a description of how the
algorithm should be improved.



2003-05-20 13:48:57

by Martin J. Bligh

[permalink] [raw]
Subject: Re: userspace irq balancer

> How does the in-kernel IRQ load balancing measure "load" and
> "busyness"? Herein lies the most absolutely fundamental problem with
> this code, it fails to recognize that we end up with most of our
> networking "load" from softint context.

OK, that's a great observation, and probably fixable. What were the
author's comments when you told him that?

> rm -rf in-kernel-irqbalance;

It's *very* late in the day to be ripping out such chunks of code.
1. Prove new code works better for you => make it a config option.
2. Prove new code works better for everyone => rip it out.

I think we're at 1, not 2.

Note that the userspace stuff doesn't even require that the kernel
stuff be disabled ... it should just override it (I can believe
there maybe is a bug that needs fixing, but it works by design).

M.

2003-05-20 13:50:12

by Andrew Theurer

[permalink] [raw]
Subject: Re: userspace irq balancer

On Tuesday 20 May 2003 01:40, David S. Miller wrote:
> From: Dave Hansen <[email protected]>
> Date: 19 May 2003 23:36:23 -0700
>
> I don't even think we can do that. That code was being integrated
> around the same time that our Specweb setup decided to go south on us
> and start physically frying itself.
>
> This gets more amusing by the second. Let's kill this code
> already. People who like the current algorithms can push
> them into the userspace solution.

Remember this all started with some idea of "fairness" among cpus and very
little to do with performance. particularly on P4 with HT, where the first
logical cpu got all the ints and tasks running on that cpu were slower than
other cpus. This was in most cases the highest performing situation, -but-
it was unfair to the tasks running on cpu0. irq_balance fixed this with a
random target cpu that was in theory supposed to not change often enough to
preserve cache warmth. In practice is the target cpus changed too often
which thrashed cache and the HW overhead of changing the destination that
often was way way to high.

Although kirq was a step in the right direction (compared to irq_balance), I'd
rather see it in user space in the long term, too. That way we can make
policy changes much much easier. IMO, networking performance was always
better with all net card ints going to only one cpu, -until- that cpu would
be saturated. This situation point can come much sooner with HT since the
core is shared, and as far as I know, there is no way to bias the core to the
one sibling handling ints when int load is high. The only thing better than
all ints to cpu0 is aligning irq a process affinity together, which is 99%
unrealistic for all actual workloads.

Now, if someone can figure out how/when the first cpu is saturated, and
measure int load properly, maybe we can have a policy that keeps all ints on
cpu0, spills some ints to another cpu when that cpu is saturated, -and-
modifies find_busiest_queue to compensate nr_running on cpus with high int
load to make the process thingy more fair.

If kirq gets ripped out, at least have some default policy that is somewhat
harmless, like destination cpu = int_number % nr_cpus. I think Suse8 had
this, and it performed reasonably well.

-Andrew Theurer

2003-05-20 14:09:00

by Jeff Garzik

[permalink] [raw]
Subject: Re: userspace irq balancer

On Tue, May 20, 2003 at 09:07:41AM -0500, Andrew Theurer wrote:
> On Tuesday 20 May 2003 01:40, David S. Miller wrote:
> > From: Dave Hansen <[email protected]>
> > Date: 19 May 2003 23:36:23 -0700
> >
> > I don't even think we can do that. That code was being integrated
> > around the same time that our Specweb setup decided to go south on us
> > and start physically frying itself.
> >
> > This gets more amusing by the second. Let's kill this code
> > already. People who like the current algorithms can push
> > them into the userspace solution.
>
> Remember this all started with some idea of "fairness" among cpus and very
> little to do with performance. particularly on P4 with HT, where the first
> logical cpu got all the ints and tasks running on that cpu were slower than
> other cpus. This was in most cases the highest performing situation, -but-
> it was unfair to the tasks running on cpu0. irq_balance fixed this with a
> random target cpu that was in theory supposed to not change often enough to
> preserve cache warmth. In practice is the target cpus changed too often
> which thrashed cache and the HW overhead of changing the destination that
> often was way way to high.

You call that a fix? ;-) I call that working around a bug.

If tasks run slower on cpuX than cpuY because of a heavier int load,
that's the fault of the scheduler not the irqbalancer, be it in-kernel
or userspace. If there's a lesser-utilized cpu the task needs to be
migrated to that cpu from the irq-loaded one, when CPU accounting
notices the kernel interrupt handling having an impact.

Jeff


2003-05-20 14:24:07

by Andrew Theurer

[permalink] [raw]
Subject: Re: userspace irq balancer

On Tuesday 20 May 2003 09:21, Jeff Garzik wrote:
> On Tue, May 20, 2003 at 09:07:41AM -0500, Andrew Theurer wrote:
> > On Tuesday 20 May 2003 01:40, David S. Miller wrote:
> > > From: Dave Hansen <[email protected]>
> > > Date: 19 May 2003 23:36:23 -0700
> > >
> > > I don't even think we can do that. That code was being integrated
> > > around the same time that our Specweb setup decided to go south on
> > > us and start physically frying itself.
> > >
> > > This gets more amusing by the second. Let's kill this code
> > > already. People who like the current algorithms can push
> > > them into the userspace solution.
> >
> > Remember this all started with some idea of "fairness" among cpus and
> > very little to do with performance. particularly on P4 with HT, where
> > the first logical cpu got all the ints and tasks running on that cpu were
> > slower than other cpus. This was in most cases the highest performing
> > situation, -but- it was unfair to the tasks running on cpu0. irq_balance
> > fixed this with a random target cpu that was in theory supposed to not
> > change often enough to preserve cache warmth. In practice is the target
> > cpus changed too often which thrashed cache and the HW overhead of
> > changing the destination that often was way way to high.
>
> You call that a fix? ;-) I call that working around a bug.
>
> If tasks run slower on cpuX than cpuY because of a heavier int load,
> that's the fault of the scheduler not the irqbalancer, be it in-kernel
> or userspace. If there's a lesser-utilized cpu the task needs to be
> migrated to that cpu from the irq-loaded one, when CPU accounting
> notices the kernel interrupt handling having an impact.

On paper it sounds good but it may be more difficult (and not perfect) in
practice. For example, if you have one runnable task per cpu, moving it to
another cpu would not help. To preserve fairness, all you could do is swap
the tasks around, which would thrash the cache pretty well. Also, the int
load may be very dynamic, so you need to be really careful and make sure you
are measuring sustained int load.

However I do agree, we need something in the scheduler, in particular
something to modify max_load and load in find_busiest_queue, based on a int
load average for those cpus.

-Andrew

2003-05-20 16:14:57

by Nakajima, Jun

[permalink] [raw]
Subject: RE: userspace irq balancer

The in-kernel load balance does not move IRQs that are bound to particular CPUs. If the user-level can do it better, just set affinity, and I believe that is the current implementation.

The in-kernel one is simply trying to emulate functionality in the chipset, thus it's not so intelligent, of course. The major reason we need to do it in software is that x86 Linux does not update TPR(s).

Thanks,
Jun

> -----Original Message-----
> From: Martin J. Bligh [mailto:[email protected]]
> Sent: Tuesday, May 20, 2003 7:01 AM
> To: David S. Miller
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]
> Subject: Re: userspace irq balancer
>
> > How does the in-kernel IRQ load balancing measure "load" and
> > "busyness"? Herein lies the most absolutely fundamental problem with
> > this code, it fails to recognize that we end up with most of our
> > networking "load" from softint context.
>
> OK, that's a great observation, and probably fixable. What were the
> author's comments when you told him that?
>
> > rm -rf in-kernel-irqbalance;
>
> It's *very* late in the day to be ripping out such chunks of code.
> 1. Prove new code works better for you => make it a config option.
> 2. Prove new code works better for everyone => rip it out.
>
> I think we're at 1, not 2.
>
> Note that the userspace stuff doesn't even require that the kernel
> stuff be disabled ... it should just override it (I can believe
> there maybe is a bug that needs fixing, but it works by design).
>
> M.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2003-05-21 10:47:08

by Kai Bankett

[permalink] [raw]
Subject: Re: userspace irq balancer

David S. Miller wrote:

> From: Dave Hansen <[email protected]>
> Date: 19 May 2003 23:36:23 -0700
>
> I don't even think we can do that. That code was being integrated
> around the same time that our Specweb setup decided to go south on us
> and start physically frying itself.
>
>This gets more amusing by the second. Let's kill this code
>already. People who like the current algorithms can push
>them into the userspace solution.
>
>
>
That?s also my feeling.
After having more and more performance loss (at least for my
system/usage) and no real good solution in view to fix the problems at
this location in the kernel, I would prefer to get this thing ripped of
as soon as possible.
By now I do not see any gain with that piece of code (relative to
process/interrupt distribution). And please don?t show me benchmarks - I
know what I feel.
Userland would be first of all the right place to do any further testing.
And please - don?t tell me it?s that complicated to remove this piece of
code from source tree.

Kai.


Attachments:
smime.p7s (3.16 kB)
S/MIME Cryptographic Signature

2003-05-21 13:50:55

by James Cleverdon

[permalink] [raw]
Subject: Re: userspace irq balancer

On Tuesday 20 May 2003 08:41 am, Nakajima, Jun wrote:
> The in-kernel load balance does not move IRQs that are bound to particular
> CPUs. If the user-level can do it better, just set affinity, and I believe
> that is the current implementation.
>
> The in-kernel one is simply trying to emulate functionality in the chipset,
> thus it's not so intelligent, of course. The major reason we need to do it
> in software is that x86 Linux does not update TPR(s).
>
> Thanks,
> Jun

It may be time to think about using the TPRs again, and see if HW interrupt
routing helps Arjan's test case. Of course for any system using clustered
APIC mode, we will still need to decide which APIC cluster gets which IRQ....


> > -----Original Message-----
> > From: Martin J. Bligh [mailto:[email protected]]
> > Sent: Tuesday, May 20, 2003 7:01 AM
> > To: David S. Miller
> > Cc: [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]
> > Subject: Re: userspace irq balancer
> >
> > > How does the in-kernel IRQ load balancing measure "load" and
> > > "busyness"? Herein lies the most absolutely fundamental problem with
> > > this code, it fails to recognize that we end up with most of our
> > > networking "load" from softint context.
> >
> > OK, that's a great observation, and probably fixable. What were the
> > author's comments when you told him that?
> >
> > > rm -rf in-kernel-irqbalance;
> >
> > It's *very* late in the day to be ripping out such chunks of code.
> > 1. Prove new code works better for you => make it a config option.
> > 2. Prove new code works better for everyone => rip it out.
> >
> > I think we're at 1, not 2.
> >
> > Note that the userspace stuff doesn't even require that the kernel
> > stuff be disabled ... it should just override it (I can believe
> > there maybe is a bug that needs fixing, but it works by design).
> >
> > M.


--
James Cleverdon
IBM xSeries Linux Solutions
{jamesclv(Unix, preferred), cleverdj(Notes)} at us dot ibm dot com

2003-05-21 14:15:03

by James Cleverdon

[permalink] [raw]
Subject: Re: userspace irq balancer

On Tuesday 20 May 2003 05:22 pm, David S. Miller wrote:
> From: Andrew Morton <[email protected]>
> Date: Tue, 20 May 2003 02:17:12 -0700
>
> Concerns have been expressed that the /proc interface may be a bit racy.
> One thing we do need to do is to write a /proc stresstest tool which
> pokes numbers into the /proc files at high rates, run that under traffic
> for a few hours.
>
> This issue is %100 independant of whether policy belongs in the
> kernel or not. Also, the /proc race problem exists and should be
> fixed regardless.
>
> Nobody has tried improving the current balancer.
>
> Policy does not belong in the kernel. I don't care what algorithm
> people decide to use, but such decisions do NOT belong in the kernel.

You keep saying that, but suppose I want to try HW IRQ balancing using the TPR
registers. How could I do that from userspace? And if I could, wouldn't the
benefit of real time IRQ routing be lost?

It seems to me that only long term interrupt policy can be done from userland.
Anything that does fast responses to fluctuating load must be inside the
kernel.

At the moment we don't do any fast IRQ policy. Even the original irq_balance
only looked for idle CPUs after an interrupt was serviced. However, suppose
you had a P4 with hyperthreading turned on. If an IRQ is to be delivered to
the main thread but it is busy and its sibling is idle, why shouldn't we
deliver the interrupt to the idle sibling? They both share the same caches,
etc, so cache warmth isn't a problem.

--
James Cleverdon
IBM xSeries Linux Solutions
{jamesclv(Unix, preferred), cleverdj(Notes)} at us dot ibm dot com

2003-05-21 14:45:31

by Martin J. Bligh

[permalink] [raw]
Subject: Re: userspace irq balancer

> There is zero reason why IRQ balancing should be in any way
> different. It's POLICY, and POLICY == USERSPACE. It is the end
> of the argument.

Despite whatever political wrangling there is between userspace and
kernelspace implementations (and some very valid points about other
arches), there is still a dearth of testing, as far as I can see.

I can't see anything wrong with making it a config option for now,
and letting people choose what they want to do, until we have more
information as to which performs better under a variety of workloads.
That seems the most pragmatic way forward.

M.

2003-05-21 16:18:33

by James Bottomley

[permalink] [raw]
Subject: Re: userspace irq balancer

I'm interested in using this for voyager. However, I have a problem in
that voyager may have CPUs that can't accept interrupts (this is global
on voyager, but may be per-interrupt on NUMA like systems). I think
before we move to a userspace solution, some thought about how to cope
with this is needed.

I have several suggestions:

1. Place the masks into /proc/irq/<n>/smp_affinity at start of day and
have the userspace irqbalancer take this as the maximal mask

2. Have a separate file /proc/irq/<n>/mask(?) to expose the mask always

3. Some other method...

Comments would be welcome

James


2003-05-21 20:03:04

by Arjan van de Ven

[permalink] [raw]
Subject: Re: userspace irq balancer

On Wed, 2003-05-21 at 18:31, James Bottomley wrote:
> I'm interested in using this for voyager. However, I have a problem in
> that voyager may have CPUs that can't accept interrupts (this is global
> on voyager, but may be per-interrupt on NUMA like systems). I think
> before we move to a userspace solution, some thought about how to cope
> with this is needed.
>
> I have several suggestions:
>
> 1. Place the masks into /proc/irq/<n>/smp_affinity at start of day and
> have the userspace irqbalancer take this as the maximal mask
>
> 2. Have a separate file /proc/irq/<n>/mask(?) to expose the mask always
>
> 3. Some other method...

I would prefer the second method.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2003-05-21 21:30:23

by Nakajima, Jun

[permalink] [raw]
Subject: RE: userspace irq balancer

Again, since the userland is using /proc/irq/N/smp_affinity, the in-kernel one won't touch whatever settings done by the userlannd. So I don't think we have issues here - if the userland has more knowledge, then it simply uses binding. If not, use the generic but dumb one in the kernel. Same thing as scheduling. If the dumb one has a critical problem, we should fix it.

At the same time, I don't believe a single almighty userland policy exists, either. One might need to write or modify his program to do the best for the system anyway. Or a very simple script might just work fine.

Jun
> -----Original Message-----
> From: James Cleverdon [mailto:[email protected]]
> Sent: Wednesday, May 21, 2003 7:27 AM
> To: David S. Miller; [email protected]
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: userspace irq balancer
>
> On Tuesday 20 May 2003 05:22 pm, David S. Miller wrote:
> > From: Andrew Morton <[email protected]>
> > Date: Tue, 20 May 2003 02:17:12 -0700
> >
> > Concerns have been expressed that the /proc interface may be a bit
> racy.
> > One thing we do need to do is to write a /proc stresstest tool which
> > pokes numbers into the /proc files at high rates, run that under traffic
> > for a few hours.
> >
> > This issue is %100 independant of whether policy belongs in the
> > kernel or not. Also, the /proc race problem exists and should be
> > fixed regardless.
> >
> > Nobody has tried improving the current balancer.
> >
> > Policy does not belong in the kernel. I don't care what algorithm
> > people decide to use, but such decisions do NOT belong in the kernel.
>
> You keep saying that, but suppose I want to try HW IRQ balancing using the
> TPR
> registers. How could I do that from userspace? And if I could, wouldn't
> the
> benefit of real time IRQ routing be lost?
>
> It seems to me that only long term interrupt policy can be done from
> userland.
> Anything that does fast responses to fluctuating load must be inside the
> kernel.
>
> At the moment we don't do any fast IRQ policy. Even the original
> irq_balance
> only looked for idle CPUs after an interrupt was serviced. However,
> suppose
> you had a P4 with hyperthreading turned on. If an IRQ is to be delivered
> to
> the main thread but it is busy and its sibling is idle, why shouldn't we
> deliver the interrupt to the idle sibling? They both share the same
> caches,
> etc, so cache warmth isn't a problem.
>
> --
> James Cleverdon
> IBM xSeries Linux Solutions
> {jamesclv(Unix, preferred), cleverdj(Notes)} at us dot ibm dot com
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2003-05-21 22:43:52

by David Miller

[permalink] [raw]
Subject: Re: userspace irq balancer

From: "Martin J. Bligh" <[email protected]>
Date: Wed, 21 May 2003 07:58:11 -0700

Despite whatever political wrangling there is between userspace and
kernelspace implementations (and some very valid points about other
arches), there is still a dearth of testing, as far as I can see.

I've never in my life heard the argument that we kept something
in the kernel that didn't belong there due to "userland testing".
That's a bogus argument.

When I ripped RARP out of the kernel, we didn't immediately have
a replacement, but one showed up shortly. So what?

And in this ase we already have Arjan's stuff. So start testing
his code instead of whining about keeping the current stuff in
the tree.

2003-05-21 22:54:33

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: userspace irq balancer

On Wed, 21 May 2003, James Cleverdon wrote:

> It may be time to think about using the TPRs again, and see if HW interrupt
> routing helps Arjan's test case. Of course for any system using clustered
> APIC mode, we will still need to decide which APIC cluster gets which IRQ....

You can build cpu masks of capable clusters easily, even for NUMAQ

Zwane
--
function.linuxpower.ca

2003-05-22 00:17:33

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: userspace irq balancer

Yeah, I suppose this userland policy change means we should pull
the scheduler policy decisions out of the kernel and write user level
HT, NUMA, SMP and UP schedulers. Also, the IO schedulers should
probably be pulled out - I'm sure AS and CFQ and linus_scheduler
could be user land policies, as well as the elevator. Memory
placement and swapping policies, too.

Oh, wait, some people actually do this - they call it, what,
Workload Management or some such thing. But I don't know any
style of workload management that leaves *no* default, semi-sane
policy in the kernel.

gerrit

On Wed, 21 May 2003 14:43:18 PDT, "Nakajima, Jun" wrote:
> Again, since the userland is using /proc/irq/N/smp_affinity, the
> in-kernel one won't touch whatever settings done by the userlannd. So
> I don't think we have issues here - if the userland has more knowledge,
> then it simply uses binding. If not, use the generic but dumb one in the
> kernel. Same thing as scheduling. If the dumb one has a critical problem,
> we should fix it.
>
> At the same time, I don't believe a single almighty userland policy
> exists, either. One might need to write or modify his program to do the
> best for the system anyway. Or a very simple script might just work fine.
>
> Jun
> > -----Original Message-----
> > From: James Cleverdon [mailto:[email protected]]
> > Sent: Wednesday, May 21, 2003 7:27 AM
> > To: David S. Miller; [email protected]
> > Cc: [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]
> > Subject: Re: userspace irq balancer
> >
> > On Tuesday 20 May 2003 05:22 pm, David S. Miller wrote:
> > > From: Andrew Morton <[email protected]>
> > > Date: Tue, 20 May 2003 02:17:12 -0700
> > >
> > > Concerns have been expressed that the /proc interface may be a bit
> > racy.
> > > One thing we do need to do is to write a /proc stresstest tool which
> > > pokes numbers into the /proc files at high rates, run that under traffic
> > > for a few hours.
> > >
> > > This issue is %100 independant of whether policy belongs in the
> > > kernel or not. Also, the /proc race problem exists and should be
> > > fixed regardless.
> > >
> > > Nobody has tried improving the current balancer.
> > >
> > > Policy does not belong in the kernel. I don't care what algorithm
> > > people decide to use, but such decisions do NOT belong in the kernel.
> >
> > You keep saying that, but suppose I want to try HW IRQ balancing using the
> > TPR
> > registers. How could I do that from userspace? And if I could, wouldn't
> > the
> > benefit of real time IRQ routing be lost?
> >
> > It seems to me that only long term interrupt policy can be done from
> > userland.
> > Anything that does fast responses to fluctuating load must be inside the
> > kernel.
> >
> > At the moment we don't do any fast IRQ policy. Even the original
> > irq_balance
> > only looked for idle CPUs after an interrupt was serviced. However,
> > suppose
> > you had a P4 with hyperthreading turned on. If an IRQ is to be delivered
> > to
> > the main thread but it is busy and its sibling is idle, why shouldn't we
> > deliver the interrupt to the idle sibling? They both share the same
> > caches,
> > etc, so cache warmth isn't a problem.
> >
> > --
> > James Cleverdon
> > IBM xSeries Linux Solutions
> > {jamesclv(Unix, preferred), cleverdj(Notes)} at us dot ibm dot com
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
>

2003-05-22 01:16:07

by Martin J. Bligh

[permalink] [raw]
Subject: Re: userspace irq balancer

> Yeah, I suppose this userland policy change means we should pull
> the scheduler policy decisions out of the kernel and write user level
> HT, NUMA, SMP and UP schedulers. Also, the IO schedulers should
> probably be pulled out - I'm sure AS and CFQ and linus_scheduler
> could be user land policies, as well as the elevator. Memory
> placement and swapping policies, too.
>
> Oh, wait, some people actually do this - they call it, what,
> Workload Management or some such thing. But I don't know any
> style of workload management that leaves *no* default, semi-sane
> policy in the kernel.

I think the word you're groping for here is "microkernel".

M.

2003-05-22 01:31:51

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: userspace irq balancer

On Wed, 21 May 2003 18:28:56 PDT, "Martin J. Bligh" wrote:
> > Yeah, I suppose this userland policy change means we should pull
> > the scheduler policy decisions out of the kernel and write user level
> > HT, NUMA, SMP and UP schedulers. Also, the IO schedulers should
> > probably be pulled out - I'm sure AS and CFQ and linus_scheduler
> > could be user land policies, as well as the elevator. Memory
> > placement and swapping policies, too.
> >
> > Oh, wait, some people actually do this - they call it, what,
> > Workload Management or some such thing. But I don't know any
> > style of workload management that leaves *no* default, semi-sane
> > policy in the kernel.
>
> I think the word you're groping for here is "microkernel".
>
> M.

Oh, yeah. Page replacement policy in user level. That one was
a real winner.

gerrit

2003-05-22 01:51:52

by William Lee Irwin III

[permalink] [raw]
Subject: Re: userspace irq balancer

On Wed, May 21, 2003 at 05:29:45PM -0700, Gerrit Huizenga wrote:
> Yeah, I suppose this userland policy change means we should pull
> the scheduler policy decisions out of the kernel and write user level
> HT, NUMA, SMP and UP schedulers. Also, the IO schedulers should
> probably be pulled out - I'm sure AS and CFQ and linus_scheduler
> could be user land policies, as well as the elevator. Memory
> placement and swapping policies, too.
> Oh, wait, some people actually do this - they call it, what,
> Workload Management or some such thing. But I don't know any
> style of workload management that leaves *no* default, semi-sane
> policy in the kernel.

This is not the case. Interrupt arbitration for sane things generally
balances interrupt load automatically in-hardware. AIUI the TPR was
intended to enable the hardware to do such a thing for xAPIC. Linux
doesn't use the TPR now, which results in decisions made by the
hardware on xAPIC -based SMP systems that are highly detrimental to
performance.

Allowing userspace to exploit more specific knowledge and perform
either static or userspace-controlled dynamic interrupt affinity
is not equivalent to having an insane default policy in-kernel.

The task scheduler, the io scheduler, and memory entitlement policies
are very different issues. They deal entirely with managing software
constructs and resource allocation. Memory placement policies sit at
least two or three levels above anything hardware memory management
can do and it is safe to say it's infeasible to implement NUMA memory
placement policies in hardware.

Interrupt load balancing is very much doable in hardware and prior to
xAPIC it was done so in all cases; for xAPIC the hardware mechanism
became strictly bound to the TPR and had less optimal tiebreak
resolution decisions (something on the order of defaulting to the
lowest APIC ID in the event of a tie, which always occurs if the TPR
isn't frobbed). This naturally creates a problem, which these userspace
and kernel mechanisms are meant to address.

The difficulty with the in-kernel policy is that its decisions are not
optimal for all cases, and it has implementation issues that prevent it
from being fully generally used, i.e. it does not handle the physical
DESTMOD case for pre-xAPIC systems with multiple APIC buses, which
amounts to a very simple incompleteness of what to all outward
appearances is an already large and feature-rich implementation; the
kernel code merely refrains from calling it in that case as a brute-
force workaround. Furthermore the complexity of the decisions to be
made is inappropriate for the kernel. It needs unusual (and slow)
manipulation of hardware to be done in code requiring fast response
times in various cases and that is called at an uncontrollable rate. It
has heuristics which may be inaccurate or wrong for various cases.

IMHO Linux on Pentium IV should use the TPR in conjunction with _very_
simplistic interrupt load accounting by default and all more
sophisticated logic should be punted straight to userspace as an
administrative API.

To quote chapter and verse:

IA-32 Intel Architecture Software Developer's Manual
Volume 3: System Programming Guide
Section 8.6.2.4 "Lowest Priority Delivery Mode"

"In operating systems that use the lowest-priority delivery mode but do
not update the TPR, the TPR information saved in the chipset will
potentially cause the interrupt to always be delivered to the same
processor from the logical set. This behavior is functionally backward
compatible with the P6 family processor but may result in unexpected
performance implications."

i.e. frob the fscking TPR as recommended by the APIC docs every once in
a while by default, punt anything (and everything) fancier up to
userspace, and get the code that doesn't even understand what the fsck
DESTMOD means the Hell out of the kernel and the Hell away from my
IO-APIC RTE's.


-- wli

2003-05-22 01:51:02

by William Lee Irwin III

[permalink] [raw]
Subject: Re: userspace irq balancer

On Wed, 21 May 2003 18:28:56 PDT, "Martin J. Bligh" wrote:
>> I think the word you're groping for here is "microkernel".

On Wed, May 21, 2003 at 06:44:46PM -0700, Gerrit Huizenga wrote:
> Oh, yeah. Page replacement policy in user level. That one was
> a real winner.

That's incorrect. Page replacement policy at the user-level was
proposed but AIUI not included in Mach. You're thinking of external
pagers, which are grossly inefficient but have only to do with the
I/O to fetch and store pages belonging to a memory object, not with
page replacement. Mach's page replacement policy was global SEGQ.

This is specifically discussed by Vahalia.


-- wli

2003-05-22 02:09:38

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: userspace irq balancer

On Wed, 21 May 2003, William Lee Irwin III wrote:

> This is not the case. Interrupt arbitration for sane things generally
> balances interrupt load automatically in-hardware. AIUI the TPR was
> intended to enable the hardware to do such a thing for xAPIC. Linux
> doesn't use the TPR now, which results in decisions made by the
> hardware on xAPIC -based SMP systems that are highly detrimental to
> performance.

Well using the APIC arbitration round robin thing isn't all that smart
either unless you use the TPR, so TPR would be a win everywhere.

> IMHO Linux on Pentium IV should use the TPR in conjunction with _very_
> simplistic interrupt load accounting by default and all more
> sophisticated logic should be punted straight to userspace as an
> administrative API.
>
> i.e. frob the fscking TPR as recommended by the APIC docs every once in
> a while by default, punt anything (and everything) fancier up to
> userspace, and get the code that doesn't even understand what the fsck
> DESTMOD means the Hell out of the kernel and the Hell away from my
> IO-APIC RTE's.

Word... This is all rather tired, if we have a working irq affinity user
accessible interface this can all go away, so how about we just work
towards that means, and then remove kirqd when everyone is happy
(personally i like Arjan's/RH9 userland irqbalance).

Zwane
--
function.linuxpower.ca

2003-05-22 03:45:03

by Martin J. Bligh

[permalink] [raw]
Subject: Re: userspace irq balancer

> The task scheduler, the io scheduler, and memory entitlement policies
> are very different issues. They deal entirely with managing software
> constructs and resource allocation.

So we should expose low-level hardware stuff to userspace to manage,
but not higher level software constructs? I fail to see the abiding
logic there. If anything, the inverse ought to be true.

> IMHO Linux on Pentium IV should use the TPR in conjunction with _very_
> simplistic interrupt load accounting by default and all more
> sophisticated logic should be punted straight to userspace as an
> administrative API.

I'd be happy with that - sounds to me like you're arguing for the same
thing. Sane default in kernel, can override from userspace if you like.
However, I've yet to see an implementation of the TPR usage that got
good performance numbers ... I'd love to see that happen.

M.

2003-05-22 14:05:45

by James Cleverdon

[permalink] [raw]
Subject: Re: userspace irq balancer

On Wednesday 21 May 2003 07:04 pm, William Lee Irwin III wrote:
[ Snip! ]
> ...
> IMHO Linux on Pentium IV should use the TPR in conjunction with _very_
> simplistic interrupt load accounting by default and all more
> sophisticated logic should be punted straight to userspace as an
> administrative API.
>
[ Snip! ]
>
> i.e. frob the fscking TPR as recommended by the APIC docs every once in
> a while by default, punt anything (and everything) fancier up to
> userspace, and get the code that doesn't even understand what the fsck
> DESTMOD means the Hell out of the kernel and the Hell away from my
> IO-APIC RTE's.
>
>
> -- wli

Here's my old very stupid TPR patch . It lacks TPRing soft ints for kernel
preemption, etc. Because the xTPR logic only compares the top nibble of the
TPR and I don't want to mask out IRQs unnecessarily, it only tracks busy/idle
and IRQ/no-IRQ.

Simple enough for you, Bill? 8^)

--
James Cleverdon
IBM xSeries Linux Solutions
{jamesclv(Unix, preferred), cleverdj(Notes)} at us dot ibm dot com


Attachments:
tpr_dyn-2003-01-10_2.5.55 (3.40 kB)

2003-05-22 14:30:30

by William Lee Irwin III

[permalink] [raw]
Subject: Re: userspace irq balancer

On Thu, May 22, 2003 at 07:18:06AM -0700, James Cleverdon wrote:
> Here's my old very stupid TPR patch . It lacks TPRing soft ints for kernel
> preemption, etc. Because the xTPR logic only compares the top nibble of the
> TPR and I don't want to mask out IRQs unnecessarily, it only tracks busy/idle
> and IRQ/no-IRQ.
> Simple enough for you, Bill? 8^)

Simple enough, yes. But I hesitate to endorse it without making sure
it's not too simple.

It's much closer to the right direction, which is actually following
hardware docs and then punting the fancy (potentially more performant)
bits up into userspace. When properly tuned, it should actually have a
useful interaction with explicit irq balancing via retargeting IO-APIC
RTE destinations as interrupts targeted at a destination specifying
multiple cpus won't always target a single cpu when TPR's are adjusted.

The only real issue with the TPR is that it's an spl-like ranking of
interrupts, assuming a static prioritization based on vector number.
That doesn't really agree with the Linux model and is undesirable in
various scenarios; however, it's how the hardware works and so can't
be avoided (and the disastrous attempt to avoid it didn't DTRT anyway).


-- wli

2003-05-22 15:18:04

by James Cleverdon

[permalink] [raw]
Subject: Re: userspace irq balancer

On Thursday 22 May 2003 07:43 am, William Lee Irwin III wrote:
> On Thu, May 22, 2003 at 07:18:06AM -0700, James Cleverdon wrote:
> > Here's my old very stupid TPR patch . It lacks TPRing soft ints for
> > kernel preemption, etc. Because the xTPR logic only compares the top
> > nibble of the TPR and I don't want to mask out IRQs unnecessarily, it
> > only tracks busy/idle and IRQ/no-IRQ.
> > Simple enough for you, Bill? 8^)
>
> Simple enough, yes. But I hesitate to endorse it without making sure
> it's not too simple.
>
> It's much closer to the right direction, which is actually following
> hardware docs and then punting the fancy (potentially more performant)
> bits up into userspace. When properly tuned, it should actually have a
> useful interaction with explicit irq balancing via retargeting IO-APIC
> RTE destinations as interrupts targeted at a destination specifying
> multiple cpus won't always target a single cpu when TPR's are adjusted.
>
> The only real issue with the TPR is that it's an spl-like ranking of
> interrupts, assuming a static prioritization based on vector number.
> That doesn't really agree with the Linux model and is undesirable in
> various scenarios; however, it's how the hardware works and so can't
> be avoided (and the disastrous attempt to avoid it didn't DTRT anyway).
>
>
> -- wli

Serial APICs have always had a spl-like effect built into them. The effective
TPR value of a given local APIC is:
max(TPR, highest vector currently in progress) & 0xF0
Parallel APICs don't do that because they don't have serial priority
arbitration; instead they use the xTPRs in the bridge chips.

So, I suppose an argument could be made for setting the TPR to the vector
number on entry of do_IRQ. I don't think that would be a good idea. It
could interfere with IRQ nesting during a non-DMA IDE interrupt handler. And
of course, an IRQ's vector has little to do with the IRQ itself, thanks to
the vector hashing scheme used to avoid the (stupid) 2 latches per APIC level
HW limitation of most i586 and i686 CPUs.


--
James Cleverdon
IBM xSeries Linux Solutions
{jamesclv(Unix, preferred), cleverdj(Notes)} at us dot ibm dot com

2003-05-22 15:33:03

by William Lee Irwin III

[permalink] [raw]
Subject: Re: userspace irq balancer

On Thu, May 22, 2003 at 08:30:29AM -0700, James Cleverdon wrote:
> Serial APICs have always had a spl-like effect built into them.
> The effective > TPR value of a given local APIC is:
> max(TPR, highest vector currently in progress) & 0xF0
> Parallel APICs don't do that because they don't have serial priority
> arbitration; instead they use the xTPRs in the bridge chips.
> So, I suppose an argument could be made for setting the TPR to the vector
> number on entry of do_IRQ. I don't think that would be a good idea. It
> could interfere with IRQ nesting during a non-DMA IDE interrupt handler. And
> of course, an IRQ's vector has little to do with the IRQ itself, thanks to
> the vector hashing scheme used to avoid the (stupid) 2 latches per APIC level
> HW limitation of most i586 and i686 CPUs.

The code to deal with the 2 latches per APIC level is already
problematic in other ways. I'm not sure how much we can be allowed to
mix the issues. But I wouldn't mind at least hearing about alternative
methods of dealing with that that interact better with the rest of the
APIC mechanics. My interest in particular is vector exhaustion, but as
you point out rearrangements of that can also serve to make the TPR
more meaningful than it is now, and perhaps reduce mutual interference
between devices generating many interrupts and those generating few.

-- wli

2003-05-22 17:17:22

by Bill Davidsen

[permalink] [raw]
Subject: Re: userspace irq balancer

On 19 May 2003, Dave Hansen wrote:

> On Mon, 2003-05-19 at 15:11, Arjan van de Ven wrote:
[...snip...]
> But, do you see the need for ripping out the current code? For those of
> us that are still running a slightly more primitive distro, it would be
> nice to have some pretty effective default behavior, like what is in the
> kernel now.

Ripping out the current code and having useful default behaviour are
hopefully not mutually exclusive.

On 19 May 2003, David S. Miller wrote:


> You have to install new modutils to even use modules with the 2.5.x
> kernel, given that why are we even talking about the "inconvenience"
> of installing the usermode IRQ balancer as being a blocker for
> ripping out the in-kernel stuff?

But you don't have to use modules at all, while running without
interrupts isn't an option.

> The in-kernel stuff MUST go. It went in because "some benchmark went
> faster", but with no "why" describing why it might have improved
> performance. We KNOW it absolutely sucks for routing and firewall
> applications. The in-kernel bits were all a shamans dance, with zero
> technical "here is why this makes things go faster" description
> attached. If I remember properly, the changelog message when the
> in-kernel irq balancing went in was of the form "this makes some
> specweb run go faster".

Perhaps I misread Linus' recent post about not breaking things to the
user, and I know he was talking about executables in particular, but if
this new code is so great, why can't it start with some default values
which will be no worse than what we have? Because there will be people
who haven't installed the latest userspace int-diddler.

Deliberately making the initial tuning useless to promote use of the
user space software seems counter productive. There will be many users
who will find a way to make a tuned kernel worse than any default, so if
the default is usable it will avoid people shooting themselves in the
foot. Can you imagine someone leaving an interrupt unserviced at all? If
there's a way someone will :-(

I'm not defending the existing code, just speaking for the idea of
leaving something useful in its place.


On Tue, 20 May 2003, Andrew Theurer wrote:

[...snip...]
> If kirq gets ripped out, at least have some default policy that is somewhat
> harmless, like destination cpu = int_number % nr_cpus. I think Suse8 had
> this, and it performed reasonably well.

As long as it's something sane, it doesn't have to be optimal. If SuSE
used it, it's likely to be good enough.


On Wed, 21 May 2003, Martin J. Bligh wrote:

[...snip...]
> I'd be happy with that - sounds to me like you're arguing for the same
> thing. Sane default in kernel, can override from userspace if you like.
> However, I've yet to see an implementation of the TPR usage that got
> good performance numbers ... I'd love to see that happen.

I may be misreading the intent in David's messages, as long as the
default is useful it certainly doesn't have to be what is in place now.
Of couse it would be useful to actually see some numbers, just because
existing code is somewhat ugly, confusing, and ill-justified and the new
is a pretty algorithm with great flexibility, does not mean that the
actual benchmarks will reflect better performance.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-05-22 22:34:00

by David Miller

[permalink] [raw]
Subject: Re: userspace irq balancer

From: Bill Davidsen <[email protected]>
Date: Thu, 22 May 2003 13:24:26 -0400 (EDT)

> You have to install new modutils to even use modules with the 2.5.x
> kernel, given that why are we even talking about the "inconvenience"
> of installing the usermode IRQ balancer as being a blocker for
> ripping out the in-kernel stuff?

But you don't have to use modules at all, while running without
interrupts isn't an option.

Interrupts WORK without the IRQ balancing.

Come one people, get real already.

2003-05-24 00:57:54

by Nakajima, Jun

[permalink] [raw]
Subject: RE: userspace irq balancer

> So, I suppose an argument could be made for setting the TPR to the vector
> number on entry of do_IRQ. I don't think that would be a good idea.

I agree. If we start spl-like ranking of interrupts, we need to modify disable/enable_irq(), etc. as well, causing possible impacts to device derivers.

One thing that might be helpful here is to have 4-level of priorities, for example:
Idle (0)
User (0x10)
Kernel (0x20)
Interrupt (0x30)

Jun

> -----Original Message-----
> From: James Cleverdon [mailto:[email protected]]
> Sent: Thursday, May 22, 2003 8:30 AM
> To: William Lee Irwin III
> Cc: Gerrit Huizenga; Nakajima, Jun; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; Andrew Theurer
> Subject: Re: userspace irq balancer
>
> On Thursday 22 May 2003 07:43 am, William Lee Irwin III wrote:
> > On Thu, May 22, 2003 at 07:18:06AM -0700, James Cleverdon wrote:
> > > Here's my old very stupid TPR patch . It lacks TPRing soft ints for
> > > kernel preemption, etc. Because the xTPR logic only compares the top
> > > nibble of the TPR and I don't want to mask out IRQs unnecessarily, it
> > > only tracks busy/idle and IRQ/no-IRQ.
> > > Simple enough for you, Bill? 8^)
> >
> > Simple enough, yes. But I hesitate to endorse it without making sure
> > it's not too simple.
> >
> > It's much closer to the right direction, which is actually following
> > hardware docs and then punting the fancy (potentially more performant)
> > bits up into userspace. When properly tuned, it should actually have a
> > useful interaction with explicit irq balancing via retargeting IO-APIC
> > RTE destinations as interrupts targeted at a destination specifying
> > multiple cpus won't always target a single cpu when TPR's are adjusted.
> >
> > The only real issue with the TPR is that it's an spl-like ranking of
> > interrupts, assuming a static prioritization based on vector number.
> > That doesn't really agree with the Linux model and is undesirable in
> > various scenarios; however, it's how the hardware works and so can't
> > be avoided (and the disastrous attempt to avoid it didn't DTRT anyway).
> >
> >
> > -- wli
>
> Serial APICs have always had a spl-like effect built into them. The
> effective
> TPR value of a given local APIC is:
> max(TPR, highest vector currently in progress) & 0xF0
> Parallel APICs don't do that because they don't have serial priority
> arbitration; instead they use the xTPRs in the bridge chips.
>
> So, I suppose an argument could be made for setting the TPR to the vector
> number on entry of do_IRQ. I don't think that would be a good idea. It
> could interfere with IRQ nesting during a non-DMA IDE interrupt handler.
> And
> of course, an IRQ's vector has little to do with the IRQ itself, thanks to
> the vector hashing scheme used to avoid the (stupid) 2 latches per APIC
> level
> HW limitation of most i586 and i686 CPUs.
>
>
> --
> James Cleverdon
> IBM xSeries Linux Solutions
> {jamesclv(Unix, preferred), cleverdj(Notes)} at us dot ibm dot com

2003-05-26 22:38:58

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: userspace irq balancer

On Thu, May 22, 2003 at 03:44:10PM -0700, David S. Miller wrote:
> From: Bill Davidsen <[email protected]>
> Date: Thu, 22 May 2003 13:24:26 -0400 (EDT)
>
> > You have to install new modutils to even use modules with the 2.5.x
> > kernel, given that why are we even talking about the "inconvenience"
> > of installing the usermode IRQ balancer as being a blocker for
> > ripping out the in-kernel stuff?
>
> But you don't have to use modules at all, while running without
> interrupts isn't an option.
>
> Interrupts WORK without the IRQ balancing.
>
> Come one people, get real already.

this is not the point, the whole point of irq balancing is performance,
not functionality.

Tell me how to do what I'm doing in my 2.4 tree in userspace (this was my
partially rewritten implementation based on Ingo's original code that I found
very suboptimal while reading it):

#ifdef CONFIG_SMP

#define IRQ_BALANCE_INTERVAL (HZ/50)

typedef struct {
unsigned int cpu;
unsigned long timestamp;
} ____cacheline_aligned irq_balance_t;

static irq_balance_t irq_balance[NR_IRQS] __cacheline_aligned;
extern unsigned long irq_affinity [NR_IRQS];

#define IRQ_ALLOWED(cpu,allowed_mask) \
((1UL << cpu) & (allowed_mask))

static unsigned long move(unsigned int curr_cpu, unsigned long allowed_mask, unsigned long now, int direction)
{
unsigned int cpu = curr_cpu;
unsigned int phys_id;

phys_id = cpu_logical_map(cpu);
if (IRQ_ALLOWED(phys_id, allowed_mask) && idle_cpu(phys_id))
return cpu;

goto inside;

do {
if (unlikely(cpu == curr_cpu))
return cpu;
inside:
if (direction == 1) {
cpu++;
if (cpu >= smp_num_cpus)
cpu = 0;
} else {
cpu--;
if (cpu == -1)
cpu = smp_num_cpus-1;
}

phys_id = cpu_logical_map(cpu);
} while (!IRQ_ALLOWED(phys_id, allowed_mask) || !idle_cpu(phys_id));

return cpu;
}

#endif /* CONFIG_SMP */

static inline void balance_irq(int irq)
{
#if CONFIG_SMP
irq_balance_t *entry = irq_balance + irq;
unsigned long now = jiffies;

if (unlikely(time_after(now, entry->timestamp + IRQ_BALANCE_INTERVAL))) {
unsigned long allowed_mask;
unsigned int new_cpu;
int random_number;

entry->timestamp = now;

rdtscl(random_number);
random_number &= 1;

allowed_mask = cpu_online_map & irq_affinity[irq];
new_cpu = move(entry->cpu, allowed_mask, now, random_number);
if (entry->cpu != new_cpu) {
entry->cpu = new_cpu;
set_ioapic_affinity(irq, clustered_apic_mode == 0 ?
1UL << new_cpu : cpu_present_to_apicid(entry->cpu) );
}
}
#endif
}

this is much better than whatever you can do in userspace. You'll never be able
to react quickly to avoid wasting idle timeslices of cpus in userspace.

Unfortunately I didn't see number comparisons of this with the userspace
balancer, but the above in the numbers is an order of magnitude better
than all other 2.4 patches I seen floating around. the other patches are
even reprogramming the IO-apic when no irq routing change is done, and
they reprogram at an overkill frequency, so no surprise the above is
much faster and no surprise the irq balancer as well would be much
faster than the other patches.

Would be interesting to compare the above with the userspace balancer in
a system with some cpu sometime idle, like some network bound file
serving or similar.

To you it seems obvious no kernel-only optimization like the above (not
remotely doable in userspace) can ever exist, this isn't true as far as I can
tell, and to me those heuristics make lots of sense, especially on a 64way smp
not under 100% cpu utilization.

and if you put the userspace balancer on top of the above the above will
be only a branch cost (not even an extern call) (unlike the other
patches that were reprogramming the ioapic regardless).

And overall I believe it doesn't make much sense to leave irq balancing in
userspace, the advantages seems too few and the disavantages seems way too
many.

The portability argument is a plain symptom of the lack of an API to write the
algorithm in a portable way: the proof is the /proc/irq interface that for
istance is still duplicated across all ports. Grep for register_irq_proc,
you'll find it in sparc64 in i386 etc.. fix that and the portability argument
will go away. Infact I wish that the sparc64 port will finally go in sync with
the rest of the kernel with regards to irqs. x86, alpha x86-64 ia64 are
all just using the exact same logic for irq handling, sparc is the only
relevant arch that refuses to go in sync. If you miss functionalty from
the current irq API in all major ports just extend it and put sparc in
sync so finally irq.c can be moved to kernel and not under arch/. If
sparc64 is so much smarter than the rest of the archs in handling
interrupts then just make it the standard for all ports. Then be sure
there will be no risk of portability issues while writing an irq
balancing algorithm in kernel. I would like to see that cleanup
happening, that wouldn't only help stuff like irq balancing but it
should also help writing portable realtime hooks that needs to deal with
those bits in irq.c. Deferring the need of this further with the excuse
that irq balancing in userspace is better doesn't sound good to me. Sure
I also don't like to see the irq balancing code cut and pasted a dozen
of times in the kernel, but keeping it in userspace just hides the
problem, that the only common API is provided to userspace and not to
the kernel itself.

Maybe you leave sparc64 with its own implementation to avoid one
indirect call per irq? But do you really think those ugly if else are
faster than a indirect call? what when you'll have to support one more
chipset?

if (tlb_type == cheetah || tlb_type == cheetah_plus) {
/* We set it to our Safari AID. */
__asm__ __volatile__("ldxa [%%g0] %1, %0"
: "=r" (tid)
: "i" (ASI_SAFARI_CONFIG));
tid = ((tid & (0x3ffUL<<17)) << 9);
tid &= IMAP_AID_SAFARI;
} else if (this_is_starfire == 0) {
/* We set it to our UPA MID. */
__asm__ __volatile__("ldxa [%%g0] %1, %0"
: "=r" (tid)
: "i" (ASI_UPA_CONFIG));
tid = ((tid & UPA_CONFIG_MID) << 9);
tid &= IMAP_TID_UPA;
} else {
tid = (starfire_translate(imap, smp_processor_id()) << 26);
tid &= IMAP_TID_UPA;
}

Andrea

2003-05-26 23:13:03

by Andrew Morton

[permalink] [raw]
Subject: Re: userspace irq balancer

Andrea Arcangeli <[email protected]> wrote:
>
> if (IRQ_ALLOWED(phys_id, allowed_mask) && idle_cpu(phys_id))
> return cpu;

How hard would it be to make this HT-aware?

idle_cpu(phys_id) && idle_cpu_siblings(phys_id)

or whatever.

2003-05-26 23:21:38

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, May 26, 2003 at 04:26:16PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > if (IRQ_ALLOWED(phys_id, allowed_mask) && idle_cpu(phys_id))
> > return cpu;
>
> How hard would it be to make this HT-aware?
>
> idle_cpu(phys_id) && idle_cpu_siblings(phys_id)
>
> or whatever.

yeah! that was the obvious next step. as fast path the additional && is
sure good. Maybe that's enough after all, and we might search only for
fully idle cpus, however I wouldn't dislike to search for a fallback
(partially) logical idle cpu if none physical cpu is (fully) idle.

Andrea

2003-05-26 23:31:38

by David Miller

[permalink] [raw]
Subject: Re: userspace irq balancer

From: Andrea Arcangeli <[email protected]>
Date: Tue, 27 May 2003 01:34:46 +0200

On Mon, May 26, 2003 at 04:26:16PM -0700, Andrew Morton wrote:
> How hard would it be to make this HT-aware?

yeah! that was the obvious next step.

So what is idle defined as? How are you going to measure things like
softirq load? How much weight will softirq load get compared to
hardirq load? How will process load be factored into this and what
weight will this get?

All of these questions have no answer as far as the kernel is
concerned, because this is a policy decision and something the user
ought to be able to configure to suite his needs.

All you've said today is that IRQ balancing needs to be more like the
cpufreq drivers. The hardware programming and some of the delicate
time sensitive details are done in the kernel, but deciding how and
when to do these things belongs as some userspace action.

I still contend that Arjan's usermode irq balancer solves one realm of
those problems. And there is nothing that prevents his work from
being extended to upload policies for the things you have brought up
today.

Finally, claiming this is a performance issue is moot. We've already
shown that if the current IRQ load balancer in 2.5.x improves
performance for any network based things there is no reasonable reason
WHY this is the case bacause it's behavior is anti-networking in
nature in that it thinks hardware IRQ load equates to real load which
it does not.

2003-05-27 00:04:01

by David Miller

[permalink] [raw]
Subject: Re: userspace irq balancer

From: Andrea Arcangeli <[email protected]>
Date: Tue, 27 May 2003 02:06:39 +0200

On Mon, May 26, 2003 at 04:43:00PM -0700, David S. Miller wrote:
> softirq load? How much weight will softirq load get compared to

normally the softirq runs after the hardirq (commoncase) and you want to
run those softirq computations on the idle cpu too (i.e. no userspace
running so offload hardirq and in turn softqirq to it) No difference.
softirq is like hardirq just longer and with hardirq enabled. One more
reason to offload it to an idle cpu.

One hardirq can equate to thousands of packets worth of softirq load,
especially with NAPI. And you cannot even know what this ratio is
(packets processed per hardware IRQ load). It can be anywhere from
1 to 1000. And you absolutely do not want to behave identically
for all such values.

How can you even claim to be taking this into account in a logical
manner if you cannot even tell me how you will determine how much
softirq load is created by a hardware irq?

2003-05-27 00:28:09

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, May 26, 2003 at 05:15:27PM -0700, David S. Miller wrote:
> One hardirq can equate to thousands of packets worth of softirq load,
> especially with NAPI. And you cannot even know what this ratio is
> (packets processed per hardware IRQ load). It can be anywhere from
> 1 to 1000. And you absolutely do not want to behave identically
> for all such values.
>
> How can you even claim to be taking this into account in a logical
> manner if you cannot even tell me how you will determine how much
> softirq load is created by a hardware irq?

you brought a very funny case ;), agreed! certainly if it's ksoftirqd
with NAPI that is causing the irq move to the next cpu you don't want to
migrate it to the next cpu ksoftirqd will keep following it, rounding
back and forth across all idle cpus (assuming a firwall with many idle
cpus). So basically you're saying that we've to change it to idle() ||
== ksoftirqd (i'll fix it thanks!). The additional check should avoid
the misbehaviour. In 2.4 normally the softirq (of course w/o NAPI) are
served in irq context so we didn't face this yet.

But it doesn't change my basic argument about this topic, that there's
no way in userspace to do anything remotely as accurate as that to boost
system performance to the maximum, especially on big systems. My current
algorithm was a minimal attempt to do something better than the static
(or almost static) bindings with all the info we have in kernel (that
userspace could never use efficiently on a timely basis).

Andrea

2003-05-27 00:37:15

by David Miller

[permalink] [raw]
Subject: Re: userspace irq balancer

From: Andrea Arcangeli <[email protected]>
Date: Tue, 27 May 2003 02:41:15 +0200

In 2.4 normally the softirq (of course w/o NAPI) are
served in irq context so we didn't face this yet.

Andrea, whether ksoftirqd processes the softirq work or not has
nothing to do with what I'm talking about.

It is all about what does a hardware IRQ mean in terms of work
processed. And it can mean anything from 1 to 1000 packets worth
of work.

Therefore, any usage of hardware IRQ activity to determine "load" in
any sense is totally inaccurate.

So I'm asking you, again, how are you going to measure softirq load in
making hardware IRQ load balancing decisions? Watching the scheduling
and running of ksoftirqd is not an answer. Networking hardware
interrupts, with a simplistic and mindless algorithm like the one we
have currently in the 2.5.x IRQ balancing code, appear to be
contributing very little to "load" and that is wrong.

But it doesn't change my basic argument about this topic, that there's
no way in userspace to do anything remotely as accurate as that to boost
system performance to the maximum, especially on big systems.

You show that the measurements and reactions belong there. This I
totally understand. This is how cpufreq is implemented in 2.5.x
currently. It is a very similar situation.

But deciding how to intepret these measurements and what to do in
response is a userlevel policy decision. This also coincides with
how cpufreq works.

2003-05-27 00:55:50

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, May 26, 2003 at 05:48:41PM -0700, David S. Miller wrote:
> From: Andrea Arcangeli <[email protected]>
> Date: Tue, 27 May 2003 02:41:15 +0200
>
> In 2.4 normally the softirq (of course w/o NAPI) are
> served in irq context so we didn't face this yet.
>
> Andrea, whether ksoftirqd processes the softirq work or not has
> nothing to do with what I'm talking about.
>
> It is all about what does a hardware IRQ mean in terms of work
> processed. And it can mean anything from 1 to 1000 packets worth
> of work.
>
> Therefore, any usage of hardware IRQ activity to determine "load" in
> any sense is totally inaccurate.
>
> So I'm asking you, again, how are you going to measure softirq load in
> making hardware IRQ load balancing decisions? Watching the scheduling

rdtsc could do it very well, irqs and softirqs can't be rescheduled so
you can tick measure how long you take in each cpu, same goes for each
task before migrating to another cpu (I'm only assuming this is SMP and
not AMT, still if the difference between cpu frequency among cpus isn't
huge it could stil work with AMT, a multiplicator could be applied with
AMT). This "non idle" load could be accounted in a per-cpu array.

I'm not going to implement the above in 2.4, that sounds a 2.5 thing,
but my point is that by just ignoring ksoftirqd in the idle selection
should avoid the biggest of the NAPI issues. I'm approximating, i.e.
better than nothing approch (either that or nothing). I never claimed
that to be a final golden algorihm, just obviously better than the
total-trashing one and even w/o the ksoftirqd and HT last bits, numbers
confirmed that.

And for 2.5 there are many doors open for further optimizations of
course.

> But deciding how to intepret these measurements and what to do in
> response is a userlevel policy decision. This also coincides with
> how cpufreq works.

you mean you can have slightly different modes selectable by sysctl
right? or do you really want to generate a reschedule per second with
tlb flush and microkernel API between user and kernel in turn total
waste of resources just to avoid admitting irq balancing belongs to the
kernel?

Andrea

2003-05-27 01:01:42

by David Miller

[permalink] [raw]
Subject: Re: userspace irq balancer

From: Andrea Arcangeli <[email protected]>
Date: Tue, 27 May 2003 03:09:03 +0200

I'm not going to implement the above in 2.4, that sounds a 2.5 thing,

Then your 2.4.x load balancing is buggy for networking.
You simply cannot ignore this issue and act as if it
does not exist and does not have huge consequence for IRQ
load balancing decisions.

but my point is that by just ignoring ksoftirqd in the idle selection
should avoid the biggest of the NAPI issues.

On a properly functioning system, ksoftirqd should not be running.

> But deciding how to intepret these measurements and what to do in
> response is a userlevel policy decision. This also coincides with
> how cpufreq works.

you mean you can have slightly different modes selectable by sysctl
right?

One posibility. Another is a descriptor describing things like
how much to weight hardware vs. software IRQ load, vs. process
load etc.

or do you really want to generate a reschedule per second

No, nothing like this.

2003-05-27 01:02:41

by Dave Jones

[permalink] [raw]
Subject: Re: userspace irq balancer

On Tue, May 27, 2003 at 03:09:03AM +0200, Andrea Arcangeli wrote:

> > So I'm asking you, again, how are you going to measure softirq load in
> > making hardware IRQ load balancing decisions? Watching the scheduling
>
> rdtsc could do it very well, irqs and softirqs can't be rescheduled so
> you can tick measure how long you take in each cpu

On CPUs that vary frequency, this will break, unless TSC scales
with frequency. You cannot assume that this will be the case.

Dave

2003-05-27 01:05:46

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: userspace irq balancer

On Tue, May 27, 2003 at 01:34:46AM +0200, Andrea Arcangeli wrote:
> On Mon, May 26, 2003 at 04:26:16PM -0700, Andrew Morton wrote:
> > Andrea Arcangeli <[email protected]> wrote:
> > >
> > > if (IRQ_ALLOWED(phys_id, allowed_mask) && idle_cpu(phys_id))
> > > return cpu;
> >
> > How hard would it be to make this HT-aware?
> >
> > idle_cpu(phys_id) && idle_cpu_siblings(phys_id)
> >
> > or whatever.
>
> yeah! that was the obvious next step. as fast path the additional && is
> sure good. Maybe that's enough after all, and we might search only for
> fully idle cpus, however I wouldn't dislike to search for a fallback
> (partially) logical idle cpu if none physical cpu is (fully) idle.

I'm going to try this (if it compiles ;). the ksoftirqd check is the one
for the NAPI workload brought to attention by Dave. the idea is that
statistically the softirq load will follow the hardirq load. Both wants
to go into an idle cpu. But we don't want to mistake the softirq load
for unrelated cpu load. So we don't want to separate a ksoftirqd load
from the irq load or we could keep bouncing over and over again.

For HT I take the trivial approch you mentioned above that is to switch
only if the physical cpu is completely idle.

--- ./arch/i386/kernel/io_apic.c.~1~ 2003-05-27 02:45:34.000000000 +0200
+++ ./arch/i386/kernel/io_apic.c 2003-05-27 03:00:32.000000000 +0200
@@ -217,13 +217,18 @@ extern unsigned long irq_affinity [NR_IR
#define IRQ_ALLOWED(cpu,allowed_mask) \
((1UL << cpu) & (allowed_mask))

+#define ksoftirqd_is_running(phys_id) (cpu_curr(phys_id) == ksoftirqd_task(phys_id))
+#define __irq_idle_cpu(phys_id) (idle_cpu(phys_id) || ksoftirqd_is_running(phys_id))
+#define irq_idle_cpu(phys_id) (__irq_idle_cpu(phys_id) &&
+ (smp_num_siblings <= 1 || __irq_idle_cpu(cpu_sibling_map[phys_id])))
+
static unsigned long move(unsigned int curr_cpu, unsigned long allowed_mask, unsigned long now, int direction)
{
unsigned int cpu = curr_cpu;
unsigned int phys_id;

phys_id = cpu_logical_map(cpu);
- if (IRQ_ALLOWED(phys_id, allowed_mask) && idle_cpu(phys_id))
+ if (IRQ_ALLOWED(phys_id, allowed_mask) && irq_idle_cpu(phys_id))
return cpu;

goto inside;
@@ -243,7 +248,7 @@ inside:
}

phys_id = cpu_logical_map(cpu);
- } while (!IRQ_ALLOWED(phys_id, allowed_mask) || !idle_cpu(phys_id));
+ } while (!IRQ_ALLOWED(phys_id, allowed_mask) || !irq_idle_cpu(phys_id));

return cpu;
}

>
> Andrea


Andrea

2003-05-27 01:09:56

by David Miller

[permalink] [raw]
Subject: Re: userspace irq balancer

From: Andrea Arcangeli <[email protected]>
Date: Tue, 27 May 2003 03:17:50 +0200

I'm going to try this (if it compiles ;). the ksoftirqd check is
the one for the NAPI workload brought to attention by Dave.

Ksoftirqd should not be running on a properly functioning system.

In fact, I know lots of people who are simply making ksoftirqd
only run if we do the softirq loop N times where N is very large
in order to avoid the performance problems that result from ksoftirqd.

2003-05-27 01:09:02

by David Miller

[permalink] [raw]
Subject: Re: userspace irq balancer

From: Dave Jones <[email protected]>
Date: Tue, 27 May 2003 02:16:20 +0100

On Tue, May 27, 2003 at 03:09:03AM +0200, Andrea Arcangeli wrote:

> rdtsc could do it very well, irqs and softirqs can't be rescheduled so
> you can tick measure how long you take in each cpu

On CPUs that vary frequency, this will break, unless TSC scales
with frequency. You cannot assume that this will be the case.

This is an important issue, for another reason.

The networking packet scheduler layer wants an accurate (but
cheap) high frequency time source too.

I keep forgetting to go back and deal with fixing up all of
those hairy macros in pkt_sched.h, I've added this to my TODO
list.

2003-05-27 01:14:27

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, May 26, 2003 at 06:13:09PM -0700, David S. Miller wrote:
> From: Andrea Arcangeli <[email protected]>
> Date: Tue, 27 May 2003 03:09:03 +0200
>
> I'm not going to implement the above in 2.4, that sounds a 2.5 thing,
>
> Then your 2.4.x load balancing is buggy for networking.

it's not buggy, it's less performant than what 2.5 could be, it's not a
matter of bugs it's a matter of performance, this is an heuristic, it
can very well do the wrong thing sometime.

What I care about is if it is that it is less performant than any other
2.4 and any current 2.5. That is non obvious to me. The approximation
will never be as good as the perfect accounting, but it's still better
than no approximation at all IMHO, and for sure I don't want to waste
totally idle cpus on a 32way either.

> You simply cannot ignore this issue and act as if it
> does not exist and does not have huge consequence for IRQ
> load balancing decisions.

The only thing the ksoftirqd check can do is to generate less
conseguences now.

> but my point is that by just ignoring ksoftirqd in the idle selection
> should avoid the biggest of the NAPI issues.
>
> On a properly functioning system, ksoftirqd should not be running.

I argue with that, NAPI needs to poll somehow, either you hook into the
kernel slowing down every single schedule, or you need to offload this
work to a kernel thread.

The other cases of ksoftirqd are meant to avoid the 1msec latency shall
the cpu go idle or shall the irqs arrive faster than the network stack
can process the data. They're all legitimate usages IMHO. And we should
be fine to keep irqs running togeter with softirq, that's the point of
this new check.

> > But deciding how to intepret these measurements and what to do in
> > response is a userlevel policy decision. This also coincides with
> > how cpufreq works.
>
> you mean you can have slightly different modes selectable by sysctl
> right?
>
> One posibility. Another is a descriptor describing things like
> how much to weight hardware vs. software IRQ load, vs. process
> load etc.

this certainly sounds good to me.

>
> or do you really want to generate a reschedule per second
>
> No, nothing like this.

ok.

Andrea

2003-05-27 01:18:31

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: userspace irq balancer

On Tue, May 27, 2003 at 02:16:20AM +0100, Dave Jones wrote:
> On Tue, May 27, 2003 at 03:09:03AM +0200, Andrea Arcangeli wrote:
>
> > > So I'm asking you, again, how are you going to measure softirq load in
> > > making hardware IRQ load balancing decisions? Watching the scheduling
> >
> > rdtsc could do it very well, irqs and softirqs can't be rescheduled so
> > you can tick measure how long you take in each cpu
>
> On CPUs that vary frequency, this will break, unless TSC scales
> with frequency. You cannot assume that this will be the case.

those stats would per-second or similar anyways. so unless you change
frequency every second it won't matter. it's an heuristic. And
especially if you change frequency on all cpus at nearly the same time
as I expect it will matter even less since it would decrease the window
even more.

Andrea

2003-05-27 01:21:36

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, May 26, 2003 at 06:20:26PM -0700, David S. Miller wrote:
> From: Andrea Arcangeli <[email protected]>
> Date: Tue, 27 May 2003 03:17:50 +0200
>
> I'm going to try this (if it compiles ;). the ksoftirqd check is
> the one for the NAPI workload brought to attention by Dave.
>
> Ksoftirqd should not be running on a properly functioning system.
>
> In fact, I know lots of people who are simply making ksoftirqd
> only run if we do the softirq loop N times where N is very large
> in order to avoid the performance problems that result from ksoftirqd.

note that what those lots of people are doing is very legitimate. I'll
also change this in my current tree right now, so we'll see if these
lots of people will be fine with my next tree. would be interesting
which N those lots of people are using though.

I very well recall that we waited people to ask if the N == 1 was too
low when we introduced ksoftirqd, Linus just said he would increase it
if it was generating spurious reschedules during load peaks.

Note however that the larger N the more it will tend to hang the box
(i.e. less fair) with NAPI. I certainly don't doubt network performance
can increase slightly.

Andrea

2003-05-27 01:43:29

by William Lee Irwin III

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, May 26, 2003 at 05:48:41PM -0700, David S. Miller wrote:
> Andrea, whether ksoftirqd processes the softirq work or not has
> nothing to do with what I'm talking about.
> It is all about what does a hardware IRQ mean in terms of work
> processed. And it can mean anything from 1 to 1000 packets worth
> of work.
> Therefore, any usage of hardware IRQ activity to determine "load" in
> any sense is totally inaccurate.
> So I'm asking you, again, how are you going to measure softirq load in
> making hardware IRQ load balancing decisions? Watching the scheduling
> and running of ksoftirqd is not an answer. Networking hardware
> interrupts, with a simplistic and mindless algorithm like the one we
> have currently in the 2.5.x IRQ balancing code, appear to be
> contributing very little to "load" and that is wrong.

I should also point out that the cost of reprogramming the interrupt
controllers isn't taken into account by the kernel irq balancer. In
the userspace implementation the reprogramming is done infrequently
enough to make even significant cost negligible; in-kernel the cost
is entirely uncontrolled and the rate of reprogramming unlimited.

Also, Linux' i386 IO-APIC programming model is quite fragile and does
not properly distinguish between physical and logical destinations or
SAPIC vs. xAPIC (which differ in the physical destination format) to
keep it coherent with i386 IO-APIC's DESTMOD. I would very much like to
see that confusion corrected before any significant amount of online
i386 IO-APIC RTE reprogramming is considered "stable". For instance, I
know of one subarch that claims to use logical DESTMOD with clustered
hierarchical DFR, but is using what appears to be SAPIC physical
broadcast for the RTE's, and a couple of other confusions where the
types of APIC ID's are ambiguous depending on subarch and broken by
dynamic reprogramming. It furthermore assumes flat logical DFR by
virtue of attempting to form APIC destinations representing arbitrary
sets of cpus in addition to assuming at least logical with
cpumask_to_logical_apicid() and is one of the major reasons irqbalance
is either disabled or unusable in various subarches.

The story of APIC code tripping over itself is an even unfunnier comedy
of errors, as the lack of TPR adjustment means that within any APIC
destination at which IO-APIC RTE's are targeted on Pentium IV systems
there will always be just a single cpu at which all interrupts are
concentrated. In order to work around this, all of the buggy code
choking on the fact arbitrary sets of cpus aren't representable as APIC
destinations is actually unused except as a buggy translation layer
from cpu ID's to APIC destinations, and the irqbalancing code works
around this by forming singleton cpumasks, which have historically been
frequently confused with APIC destinations of all 4 different formats.

Basically, the kernel has yet to handle IO-APIC RTE programming
properly, and until there is a remote semblance of action moving it
toward the correct formation of IO-APIC RTE's, in-kernel irqbalancing
is a house of cards built on rapidly shifting sands. There is no point
in anything but a userspace driver where the complexity the kernel has
failed to handle thus far can be punted, or reliance on hardware
mechanisms like the TPR that insulate the kernel from its prior and
current embarrassments in handling this complexity, until something is
done to correct IO-APIC RTE formation.


-- wli

2003-05-27 01:46:05

by Andrew Morton

[permalink] [raw]
Subject: Re: userspace irq balancer

William Lee Irwin III <[email protected]> wrote:
>
> In
> the userspace implementation the reprogramming is done infrequently
> enough to make even significant cost negligible; in-kernel the cost
> is entirely uncontrolled and the rate of reprogramming unlimited.

eh?

#define MAX_BALANCED_IRQ_INTERVAL (5*HZ)
#define MIN_BALANCED_IRQ_INTERVAL (HZ/2)

2003-05-27 01:57:10

by William Lee Irwin III

[permalink] [raw]
Subject: Re: userspace irq balancer

William Lee Irwin III <[email protected]> wrote:
>> In
>> the userspace implementation the reprogramming is done infrequently
>> enough to make even significant cost negligible; in-kernel the cost
>> is entirely uncontrolled and the rate of reprogramming unlimited.

On Mon, May 26, 2003 at 06:59:20PM -0700, Andrew Morton wrote:
> eh?
> #define MAX_BALANCED_IRQ_INTERVAL (5*HZ)
> #define MIN_BALANCED_IRQ_INTERVAL (HZ/2)

The number of interrupt sources on a system ends up scaling this up to
numerous IO-APIC RTE reprograms and ioapic_lock acquisitions per-second
(granted, with a 5s timeout between reprogramming storms) where it
competes against IO-APIC interrupt acknowledgements.

Making the lock per- IO-APIC would at least put a bound on the number
of competitors mutually interfering with each other, but a tighter
bound on the amount of work than NR_IRQS would be more useful than that.


-- wli

2003-05-27 02:00:58

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, May 26, 2003 at 06:53:07PM -0700, William Lee Irwin III wrote:
> On Mon, May 26, 2003 at 05:48:41PM -0700, David S. Miller wrote:
> > Andrea, whether ksoftirqd processes the softirq work or not has
> > nothing to do with what I'm talking about.
> > It is all about what does a hardware IRQ mean in terms of work
> > processed. And it can mean anything from 1 to 1000 packets worth
> > of work.
> > Therefore, any usage of hardware IRQ activity to determine "load" in
> > any sense is totally inaccurate.
> > So I'm asking you, again, how are you going to measure softirq load in
> > making hardware IRQ load balancing decisions? Watching the scheduling
> > and running of ksoftirqd is not an answer. Networking hardware
> > interrupts, with a simplistic and mindless algorithm like the one we
> > have currently in the 2.5.x IRQ balancing code, appear to be
> > contributing very little to "load" and that is wrong.
>
> I should also point out that the cost of reprogramming the interrupt
> controllers isn't taken into account by the kernel irq balancer. In

do you want to take that into account in userspace? if there's a place to
take that into account that place is the kernel. You can even benchmark
it at boot.

> the userspace implementation the reprogramming is done infrequently
> enough to make even significant cost negligible; in-kernel the cost
> is entirely uncontrolled and the rate of reprogramming unlimited.

depends on the kernel algorithm.

I feel like the in kernel algorithm is considered to be the one floating
around that reprograms the apic even when it makes zero changes to the
routing, like if nothing else was possible to do in kernel.

start like this: put the userspace algorithm in kernel, then add a
few bytes of info to keep an average of the idle cpus every second, then
after 30 seconds a cpu is idle start to route the irqs to such idle cpu,
slowly, after 60 seconds more aggressively. etc... For such an algorithm
you don't care less about the reprogramming speed, just like with the
current "userspace" algorithm, but thanks to the kernel info it will be
able to do smarter decisions that would never be possible in userspace
(w/o tlb flushing waste, and w/o kernel->user microkernel protocol
implementation waste).

> Also, Linux' i386 IO-APIC programming model is quite fragile and does
> not properly distinguish between physical and logical destinations or
> SAPIC vs. xAPIC (which differ in the physical destination format) to
> keep it coherent with i386 IO-APIC's DESTMOD. I would very much like to
> see that confusion corrected before any significant amount of online
> i386 IO-APIC RTE reprogramming is considered "stable". For instance, I
> know of one subarch that claims to use logical DESTMOD with clustered
> hierarchical DFR, but is using what appears to be SAPIC physical
> broadcast for the RTE's, and a couple of other confusions where the
> types of APIC ID's are ambiguous depending on subarch and broken by
> dynamic reprogramming. It furthermore assumes flat logical DFR by
> virtue of attempting to form APIC destinations representing arbitrary
> sets of cpus in addition to assuming at least logical with
> cpumask_to_logical_apicid() and is one of the major reasons irqbalance
> is either disabled or unusable in various subarches.
>
> The story of APIC code tripping over itself is an even unfunnier comedy
> of errors, as the lack of TPR adjustment means that within any APIC
> destination at which IO-APIC RTE's are targeted on Pentium IV systems
> there will always be just a single cpu at which all interrupts are
> concentrated. In order to work around this, all of the buggy code
> choking on the fact arbitrary sets of cpus aren't representable as APIC
> destinations is actually unused except as a buggy translation layer
> from cpu ID's to APIC destinations, and the irqbalancing code works
> around this by forming singleton cpumasks, which have historically been
> frequently confused with APIC destinations of all 4 different formats.
>
> Basically, the kernel has yet to handle IO-APIC RTE programming
> properly, and until there is a remote semblance of action moving it
> toward the correct formation of IO-APIC RTE's, in-kernel irqbalancing
> is a house of cards built on rapidly shifting sands. There is no point

again, reading this I feel like there's the idea that the only possible
kernel algorithm is the one that bounces stuff and reprograms stuff as
quickly as it can like the hardware one did.

> in anything but a userspace driver where the complexity the kernel has
> failed to handle thus far can be punted, or reliance on hardware
> mechanisms like the TPR that insulate the kernel from its prior and
> current embarrassments in handling this complexity, until something is
> done to correct IO-APIC RTE formation.
>
>
> -- wli


Andrea

2003-05-27 02:02:07

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, May 26, 2003 at 06:59:20PM -0700, Andrew Morton wrote:
> William Lee Irwin III <[email protected]> wrote:
> >
> > In
> > the userspace implementation the reprogramming is done infrequently
> > enough to make even significant cost negligible; in-kernel the cost
> > is entirely uncontrolled and the rate of reprogramming unlimited.
>
> eh?
>
> #define MAX_BALANCED_IRQ_INTERVAL (5*HZ)
> #define MIN_BALANCED_IRQ_INTERVAL (HZ/2)

Yep.

Andrea

2003-05-27 02:16:32

by William Lee Irwin III

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, May 26, 2003 at 06:53:07PM -0700, William Lee Irwin III wrote:
>> I should also point out that the cost of reprogramming the interrupt
>> controllers isn't taken into account by the kernel irq balancer. In

On Tue, May 27, 2003 at 04:14:07AM +0200, Andrea Arcangeli wrote:
> do you want to take that into account in userspace? if there's a place to
> take that into account that place is the kernel. You can even benchmark
> it at boot.

Userspace is preemptable and schedulable, so it's inherently rate
limited.


On Mon, May 26, 2003 at 06:53:07PM -0700, William Lee Irwin III wrote:
>> the userspace implementation the reprogramming is done infrequently
>> enough to make even significant cost negligible; in-kernel the cost
>> is entirely uncontrolled and the rate of reprogramming unlimited.

On Tue, May 27, 2003 at 04:14:07AM +0200, Andrea Arcangeli wrote:
> depends on the kernel algorithm.
> I feel like the in kernel algorithm is considered to be the one floating
> around that reprograms the apic even when it makes zero changes to the
> routing, like if nothing else was possible to do in kernel.
> start like this: put the userspace algorithm in kernel, then add a
> few bytes of info to keep an average of the idle cpus every second, then
> after 30 seconds a cpu is idle start to route the irqs to such idle cpu,
> slowly, after 60 seconds more aggressively. etc... For such an algorithm
> you don't care less about the reprogramming speed, just like with the
> current "userspace" algorithm, but thanks to the kernel info it will be
> able to do smarter decisions that would never be possible in userspace
> (w/o tlb flushing waste, and w/o kernel->user microkernel protocol
> implementation waste).

No, I'm not assuming that level of naivete. My primary interest is that
the amount of work be properly rate limited, and running at fixed
intervals isn't quite enough; it needs to be bounded amounts of work at
fixed intervals. I failed to point this out, but something more
incremental than a NR_IRQS sweep across all IRQ's every 60s is needed
for proper rate limiting.


On Mon, May 26, 2003 at 06:53:07PM -0700, William Lee Irwin III wrote:
>> The story of APIC code tripping over itself is an even unfunnier comedy
>> of errors, as the lack of TPR adjustment means that within any APIC
[...]

On Tue, May 27, 2003 at 04:14:07AM +0200, Andrea Arcangeli wrote:
> again, reading this I feel like there's the idea that the only possible
> kernel algorithm is the one that bounces stuff and reprograms stuff as
> quickly as it can like the hardware one did.

This is actually a more general concern about correctness. Any
in-kernel algorithm must rely on the in-kernel IO-APIC RTE formation
code, which is highly problematic at best, as partially described by
all of the confusions and incorrect declarations mentioned above. Even
the "Wal-Mart" SMP subarch, used for the most common of i386 machines,
incorrectly declares its physical broadcast destination to be non-xAPIC
physical broadcast despite being used for Pentium IV and prior cpus.


-- wli

2003-05-27 02:14:56

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, 26 May 2003, William Lee Irwin III wrote:

> The number of interrupt sources on a system ends up scaling this up to
> numerous IO-APIC RTE reprograms and ioapic_lock acquisitions per-second
> (granted, with a 5s timeout between reprogramming storms) where it
> competes against IO-APIC interrupt acknowledgements.
>
> Making the lock per- IO-APIC would at least put a bound on the number
> of competitors mutually interfering with each other, but a tighter
> bound on the amount of work than NR_IRQS would be more useful than that.

Ok there are 16 IOAPICs on an 8quad, but really if we start banging on
that lock someone is doing way too much hardware access...

Zwane
--
function.linuxpower.ca

2003-05-27 02:31:27

by William Lee Irwin III

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, 26 May 2003, William Lee Irwin III wrote:
>> The number of interrupt sources on a system ends up scaling this up to
>> numerous IO-APIC RTE reprograms and ioapic_lock acquisitions per-second
>> (granted, with a 5s timeout between reprogramming storms) where it
>> competes against IO-APIC interrupt acknowledgements.
>> Making the lock per- IO-APIC would at least put a bound on the number
>> of competitors mutually interfering with each other, but a tighter
>> bound on the amount of work than NR_IRQS would be more useful than that.

On Mon, May 26, 2003 at 10:15:23PM -0400, Zwane Mwaikambo wrote:
> Ok there are 16 IOAPICs on an 8quad, but really if we start banging on
> that lock someone is doing way too much hardware access...

It's done to acknowledge every interrupt. Also, there is additional
cost associated with bouncing the lock's cacheline.


-- wli

2003-05-27 02:44:46

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, 26 May 2003, William Lee Irwin III wrote:

> On Mon, May 26, 2003 at 10:15:23PM -0400, Zwane Mwaikambo wrote:
> > Ok there are 16 IOAPICs on an 8quad, but really if we start banging on
> > that lock someone is doing way too much hardware access...
>
> It's done to acknowledge every interrupt. Also, there is additional
> cost associated with bouncing the lock's cacheline.

Bah, determining owning ioapic of an irq would get too ugly, you can have
the same irq connected to multiple ioapics so which to lock?

Zwane
--
function.linuxpower.ca

2003-05-27 04:12:31

by William Lee Irwin III

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, 26 May 2003, William Lee Irwin III wrote:
>> It's done to acknowledge every interrupt. Also, there is additional
>> cost associated with bouncing the lock's cacheline.

On Mon, May 26, 2003 at 10:45:20PM -0400, Zwane Mwaikambo wrote:
> Bah, determining owning ioapic of an irq would get too ugly, you can have
> the same irq connected to multiple ioapics so which to lock?

It tends to be all or one, detectable with IO_APIC_IRQ().


-- wli

2003-05-27 06:00:03

by David Miller

[permalink] [raw]
Subject: Re: userspace irq balancer

From: Andrea Arcangeli <[email protected]>
Date: Tue, 27 May 2003 03:26:17 +0200

I argue with that, NAPI needs to poll somehow, either you hook into the
kernel slowing down every single schedule, or you need to offload this
work to a kernel thread.

You've never shown what this "offloading work to a kernel thread"
actually accomplishes.

What I've seen it do is decrease the amount of total softirq work that
cpu can get done. And avoiding ksoftirqd actually running makes
performance get better.

2003-05-27 08:54:24

by Arjan van de Ven

[permalink] [raw]
Subject: Re: userspace irq balancer

On Tue, 2003-05-27 at 03:17, David S. Miller wrote:

> This is an important issue, for another reason.
>
> The networking packet scheduler layer wants an accurate (but
> cheap) high frequency time source too.
>
> I keep forgetting to go back and deal with fixing up all of
> those hairy macros in pkt_sched.h, I've added this to my TODO
> list.

I have a 2.4 patch to use the acpitimer for things like this. That (when
present) provides accurate high frequency time info.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2003-05-27 08:58:31

by David Miller

[permalink] [raw]
Subject: Re: userspace irq balancer

From: Arjan van de Ven <[email protected]>
Date: 27 May 2003 11:07:32 +0200

I have a 2.4 patch to use the acpitimer for things like this. That
(when present) provides accurate high frequency time info.

I don't expect many problems on the 2.5.x side (which is where I plan
on doing this) since x86 appears to have a driver-like architecture
to handle the various types of timers.

2003-05-27 11:40:02

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: userspace irq balancer

On Mon, May 26, 2003 at 11:11:20PM -0700, David S. Miller wrote:
> From: Andrea Arcangeli <[email protected]>
> Date: Tue, 27 May 2003 03:26:17 +0200
>
> I argue with that, NAPI needs to poll somehow, either you hook into the
> kernel slowing down every single schedule, or you need to offload this
> work to a kernel thread.
>
> You've never shown what this "offloading work to a kernel thread"
> actually accomplishes.

in case it wasn't obvious (that is the whole point of ksoftirqd) what
accomplishes in a single word is "fairness" and "not starving userspace
during networking".

> What I've seen it do is decrease the amount of total softirq work that
> cpu can get done. And avoiding ksoftirqd actually running makes
> performance get better.

sure, as far as you don't care about anything but the network load. I
mean, if you can't care less of the userspae progress and you don't want
the usual scheduler fariness guarantees, then you can hack the kernel
and replace the ksoftirqd with an infinite loop and networking will
certainly perform better since it will be able to stall indefinitely all
userspace computations in favour of pure irq driven networking I/O
running all in irq context and never ending.

I really thought this was obvious to everybody, otherwise there would be
no point for ksoftirqd at all if you can't care less to hang userspace
indefinitely.

Andrea

2003-05-27 21:53:36

by David Miller

[permalink] [raw]
Subject: Re: userspace irq balancer

From: Andrea Arcangeli <[email protected]>
Date: Tue, 27 May 2003 13:53:14 +0200

in case it wasn't obvious (that is the whole point of ksoftirqd) what
accomplishes in a single word is "fairness" and "not starving userspace
during networking".

The problem is that is gives up and goes to ksoftirqd far too easily.

Also, if a softirq is triggered between when we wake up ksoftirqd and
ksoftirqd actually runs, we just run the loop again in do_softirq().

This situation is even more likely if we are being "softirq bombed".
In fact in such a situation it is almost a certainty that do_softirq()
will execute multiple times before we schedule to any task.

In fact, and here is the important part, we probably won't run very
much userspace at all if we are being "softirq bombed". Every trap,
softirq causing or not, is going to cause us to drop into do_softirq()
again and again and again.

Perhaps even, we will drain the pending softirqs before ksoftirqd even
gets to execute. In this case the ksoftirqd wakeup and context switch
is a total waste of cpu cycles.

You are trying to apply flow control in an odd way to softirqs.
But the problem with such schemes is that they absolutely do not
make the problem go away. You are merely moving the work from one
place to another, and in many cases added more useless work. The one
thing you don't do when you are resource limited is take more of those
resources away and that is exactly what the ksoftirqd scheme does.

2003-05-27 22:13:33

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: userspace irq balancer

On Tue, May 27, 2003 at 03:04:49PM -0700, David S. Miller wrote:
> From: Andrea Arcangeli <[email protected]>
> Date: Tue, 27 May 2003 13:53:14 +0200
>
> in case it wasn't obvious (that is the whole point of ksoftirqd) what
> accomplishes in a single word is "fairness" and "not starving userspace
> during networking".
>
> The problem is that is gives up and goes to ksoftirqd far too easily.

I see your point, please try with 2.4.21rc4aa1 or with this patch:

http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21rc4aa1/00_ksoftirqd-max-loop-networking-1

you can put a printk in the ksoftirqd loop and tune the N until it
behaves as you want.

> Also, if a softirq is triggered between when we wake up ksoftirqd and
> ksoftirqd actually runs, we just run the loop again in do_softirq().

but the loop will do nothing, I mean ksoftirqd is checking if some work
has to be done, and that's achieved in do_softirq, exactly the same that
each irq does before returning to userspace. no difference.

> This situation is even more likely if we are being "softirq bombed".
> In fact in such a situation it is almost a certainty that do_softirq()
> will execute multiple times before we schedule to any task.
>
> In fact, and here is the important part, we probably won't run very
> much userspace at all if we are being "softirq bombed". Every trap,
> softirq causing or not, is going to cause us to drop into do_softirq()
> again and again and again.
>
> Perhaps even, we will drain the pending softirqs before ksoftirqd even
> gets to execute. In this case the ksoftirqd wakeup and context switch
> is a total waste of cpu cycles.
>
> You are trying to apply flow control in an odd way to softirqs.
> But the problem with such schemes is that they absolutely do not
> make the problem go away. You are merely moving the work from one
> place to another, and in many cases added more useless work. The one
> thing you don't do when you are resource limited is take more of those
> resources away and that is exactly what the ksoftirqd scheme does.

That's a purerly theorical case that also Ingo made once, but you should
do all the theory and also compute the probability of the irqs arriving
at the exact timing that forbids userspace to make progress with the
ksoftirqd design for a significant amount of time. That is too low to
care about it. While when you're flooded at max speed ksoftirqd
generates a fair load, plus it allows NAPI to work at all which is
needed on firewalls.

The only problem I can see is that the spikes of load may generate
spurious ksoftirqd reschedules, I acknowledge that, and for those we can
just change the #define in teh above patch. I flooded my boxes through
100mbit and ksoftirqd apparently isn't running at all (not that I ever
noticed it running significantly though ;). I don't have gigabit to test
that's why I ask.

Andrea

2003-05-27 23:43:47

by David Miller

[permalink] [raw]
Subject: Re: userspace irq balancer

From: Andrea Arcangeli <[email protected]>
Date: Wed, 28 May 2003 00:27:12 +0200

On Tue, May 27, 2003 at 03:04:49PM -0700, David S. Miller wrote:
> The problem is that is gives up and goes to ksoftirqd far too easily.

I see your point, please try with 2.4.21rc4aa1 or with this patch:

Thanks Andrea, I will go from theorist to real scientist over
the next few days and try to come up with some real experimental
evidence of how all of this behaves.

Thanks.

2003-06-13 06:13:50

by David Miller

[permalink] [raw]
Subject: Re: userspace irq balancer

From: Andrea Arcangeli <[email protected]>
Date: Wed, 28 May 2003 00:27:12 +0200

I see your point, please try with 2.4.21rc4aa1 or with this patch:

http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21rc4aa1/00_ksoftirqd-max-loop-networking-1

you can put a printk in the ksoftirqd loop and tune the N until it
behaves as you want.

Ingo's specweb testing indicated that a value somewhere between 8 and
10 appear optimal.

I've pushed this change into Andrew's -mm 2.5.x patch set.

2003-06-13 18:08:52

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: userspace irq balancer

On Thu, Jun 12, 2003 at 11:22:49PM -0700, David S. Miller wrote:
> From: Andrea Arcangeli <[email protected]>
> Date: Wed, 28 May 2003 00:27:12 +0200
>
> I see your point, please try with 2.4.21rc4aa1 or with this patch:
>
> http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21rc4aa1/00_ksoftirqd-max-loop-networking-1
>
> you can put a printk in the ksoftirqd loop and tune the N until it
> behaves as you want.
>
> Ingo's specweb testing indicated that a value somewhere between 8 and
> 10 appear optimal.

Sounds very good.

>
> I've pushed this change into Andrew's -mm 2.5.x patch set.

Thanks!

Andrea