2001-10-01 22:19:06

by Ingo Molnar

[permalink] [raw]
Subject: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


to sum things up, we have three main problem areas that are connected to
hardirq and softirq processing:

- a little utility written by Simon Kirby proved that no matter how much
softirq throttling, it's easy to lock up a pretty powerful Linux
box via a high rate of network interrupts, from relatively low-powered
clients as well. 2.4.6, 2.4.7, 2.4.10 all lock up. Alexey said it as
well that it's still easy to lock up low-powered Linux routers via more
or less normal traffic.

- prior 2.4.7 we used to 'leak' softirq handling => we ended up missing
softirqs in a number of circumstances. Stock 2.4.10 still has a number
of places that do this too.

- a number of people have reported gigabit performance problems (some
people reported a 10-20% drop in performance under load) since
ksoftirqd was added - which was added to fix some of the 2.4.6-
softirq-handling latency problems.

we also have another problem that often pops up when the BIOS goes bad or
a device driver does some mistake:

- Linux often 'locks up' if it gets into a 'interrupt storm' - when
interrupt sources that send a very high rate of interrupts. This can be
seen as boot-time hangs and module-insert-time hangs as well.

the attached patch, while a bit radical, is i believe a robust solution to
all four problems. It gives gigabit performance back, avoids the lockups
and attempts to reach as short softirq-processing latency as possible.

the new mechanizm:

- the irq handling code has been extended to support 'soft mitigation',
ie. to mitigate the rate of hardware interrupts, without support from
the actual hardware. There is a reasonable default, but the value can
also be decreased/increased on a per-irq basis via /proc/irq/NR/max_rate.

the method is the following. We count the number of interrupts serviced,
and if within a jiffy there are more than max_rate interrupts, the code
disables the IRQ source and marks it as IRQ_MITIGATED. On the next timer
interrupt the irq_rate_check() function is called, which makes sure that
'blocked' irqs are restarted & handled properly. The interrupt is disabled
in the interrupt controller, which has the nice side-effect of fixing and
blocking interrupt storms. (The support code for 'soft mitigation' is
designed to be very lightweight, it's a decrement and a test in the IRQ
handling hot path.)

(note that in case of shared interrupts, another 'innocent' device might
stay disabled for some short amount of time as well - but this is not an
issue because this mitigation does not make that device inoperable, it
just delays its interrupt by up to 10 msecs. Plus, modern systems have
properly distributed interrupts.)

- softirq code got simplified significantly. The concept is to 'handle all
pending softirqs' - just as the hardware IRQ code 'handles all hardware
interrupts that were passed to it'. Since most of the time there is a
direct relationship between softirq work and hardirq work, the
mitigation of hardirqs mitigates softirq load as well.

- ksoftirqd is gone, there is never any softirq pending while
softirq-unaware code is executing.

- the tasklet code needed some cleanup along the way, and it also won some
restart-on-enable and restart-on-unlock properties that it lacked
before. (but which is desired.)

due to these changes, the linecount in softirq.c got smaller by 25%.
[i dropped the unwakeup change - but that one could be useful in the VM,
to eg. unwakeup bdflush or kswapd.]

- drivers can optionally use the set_irq_rate(irq, new_rate) call to
change the current IRQ rate. Drivers are the ones who know best what
kind of loads to expect from the hardware, so they might want to
influence this value. Also, drivers that implement IRQ mitigation
themselves in hardware, can effectively disable the soft-mitigation code
by using a very high rate value.

what is the concept behind all this? Simplicity, and concept. We were
clearly heading in the wrong direction: putting more complexity into the
core softirq code to handle some really extreme and unusual cases. Also,
softirqs were slowly morphing into something process-ish - but in Linux we
already have a concept of processes, so we'd have two dualling concepts.
(We still have tasklets, which are not really processes - they are
single-threaded paths of execution.)

with this patch, softirqs can again be what they should be: lightweight
'interrupt code' that processes hard-IRQ events but still does this with
interrupts enabled, to allow for low hard-IRQ latencies. Anything that is
conceptually heavyweight IMO does not belong into softirqs, it should be
moved into process contexts. That will take care of CPU-time usage
accounting and CPU-time-limiting and priority issues as well.

(the patch also imports the latency and softirq-restart fixes from my
previous softirq patches.)

i've tested the patch on both UP, SMP, XT-PIC and APIC systems, it
correctly limits network interrupt rates (and other device interrupt
rates) to the given limit. I've done stress-testing as well. The patch is
against 2.4.11-pre1, but it applies just fine to the -ac tree as well.

with a high irq-rate limit set, ping flooding has this effect on the
test-system:

[root@mars /root]# vmstat 1
procs memory swap io
r b w swpd free buff cache si so bi bo in
0 0 0 0 877024 1140 11364 0 0 12 0 30960
0 0 0 0 877024 1140 11364 0 0 0 0 30950
0 0 0 0 877024 1140 11364 0 0 0 0 30520

ie. 30k interrupts/sec. With the max_rate set to 1000 interrupts/sec:

[root@mars /root]# echo 1000 > /proc/irq/21/max_rate
[root@mars /root]# vmstat 1
procs memory swap io
r b w swpd free buff cache si so bi bo in
0 0 0 0 877004 1144 11372 0 0 0 0 1112
0 0 0 0 877004 1144 11372 0 0 0 0 1111
0 0 0 0 877004 1144 11372 0 0 0 0 1111

so it works just fine here. Interactive tasks are still snappy over the
same interface.

Comments, reports, suggestions and testing feedback is more than welcome,

Ingo


Attachments:
irq-rewrite-2.4.11-B5 (25.98 kB)

2001-10-01 22:38:29

by Andreas Dilger

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Oct 02, 2001 00:16 +0200, Ingo Molnar wrote:
> - the irq handling code has been extended to support 'soft mitigation',
> ie. to mitigate the rate of hardware interrupts, without support from
> the actual hardware. There is a reasonable default, but the value can
> also be decreased/increased on a per-irq basis via /proc/irq/NR/max_rate.
>
> the method is the following. We count the number of interrupts serviced,
> and if within a jiffy there are more than max_rate interrupts, the code
> disables the IRQ source and marks it as IRQ_MITIGATED. On the next timer
> interrupt the irq_rate_check() function is called, which makes sure that
> 'blocked' irqs are restarted & handled properly.

How far is it to go from a mitigated IRQ (because of too high an interrupt
rate) to a polled interface (e.g. for network cards)? This was discussed
a number of times to improve overall performance on bust network systems.

Concievably, a network card could tune max_rate to a value where it is
more efficient (CPU wise) to poll the interface instead of using IRQs.
However, waiting for the next regular timer interrupt may be too long
(resulting in lost packets) as buffers overflowed. Would it also be
possible for a driver to register a "maximum delay" between servicing
interrupts (within reason, on a non-RT system) so that it can say "I
have X kB of buffers, and the maximum line rate is Y kB/s, so I need
to be serviced within X/Y s when polling without losing data".

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-10-01 22:45:29

by Tim Hockin

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

> - a little utility written by Simon Kirby proved that no matter how much
> softirq throttling, it's easy to lock up a pretty powerful Linux
> box via a high rate of network interrupts, from relatively low-powered
> clients as well. 2.4.6, 2.4.7, 2.4.10 all lock up. Alexey said it as
> well that it's still easy to lock up low-powered Linux routers via more
> or less normal traffic.

We proved this a year+ ago. We've got some code brewing to do fair sharing
of IRQs for heavy load situations. I don't have all the details, but
eventually...

> i've tested the patch on both UP, SMP, XT-PIC and APIC systems, it
> correctly limits network interrupt rates (and other device interrupt
> rates) to the given limit. I've done stress-testing as well. The patch is
> against 2.4.11-pre1, but it applies just fine to the -ac tree as well.

Our solution/needs are slightly different - we want to service as many
interrupts as possible and do as much network traffic as possible, and
interactive-tasks be damned.

2001-10-01 22:50:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Mon, 1 Oct 2001, Tim Hockin wrote:

> Our solution/needs are slightly different - we want to service as many
> interrupts as possible and do as much network traffic as possible, and
> interactive-tasks be damned.

i the patch in fact enables this too: you can more agressively get irqs
and softirqs executed by increasing max_rate just above the 'critical'
rate you can measure. (and the blocked-interrupts period of time will be
enough to let the softirq work to be finished.) So in fact you might even
end up having higher performance by blocking interrupts in a certain
portion of a timer tick - backlogged work will be processed. Via max_rate
you can partition the percentage of CPU time dedicated to softirq and
process work. (which in your case would be softirq-only work - which
should not be underestimated either.)

Ingo

2001-10-01 22:50:51

by Ben Greear

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

Ingo Molnar wrote:

> (note that in case of shared interrupts, another 'innocent' device might
> stay disabled for some short amount of time as well - but this is not an
> issue because this mitigation does not make that device inoperable, it
> just delays its interrupt by up to 10 msecs. Plus, modern systems have
> properly distributed interrupts.)

I'm all for anything that speeds up (and makes more reliable) high network
speeds, but I often run with 8+ ethernet devices, so IRQs have to be shared,
and a 10ms lockdown on an interface could lose lots of packets. Although
it's not a perfect solution, maybe you could (in the kernel) multiple the
max by the number of things using that IRQ? For example, if you have four
ethernet drivers on one IRQ, then let that IRQ fire 4 times faster than
normal before putting it in lockdown...

Do you have any idea how many packets-per-second you can get out of a
system (obviously, your system of choice) using your updated code?

(I'm running about 7k packets-per-second tx, and 7k rx, on 3 EEPRO ports
simultaneously on a 1Ghz PIII and 2.4.9-pre10... This is from user-space,
so much of the CPU is spent hauling my packets to and from the device..)

Ben

--
Ben Greear <[email protected]> <[email protected]>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear

2001-10-01 23:04:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Tue, 2 Oct 2001, Ingo Molnar wrote:
>
> - the irq handling code has been extended to support 'soft mitigation',
> ie. to mitigate the rate of hardware interrupts, without support from
> the actual hardware. There is a reasonable default, but the value can
> also be decreased/increased on a per-irq basis via /proc/irq/NR/max_rate.

Adn how do you select max_rate sanely? It depends on how heavy each
interrupt is, the speed of the CPU etc etc. A rate that works for a
network card with a certain packet size may be completely ineffective on
the same machine with the same network card but a different packet size.

When you select the wrong number, you slow the system down for no good
reason (too low a number) or your mitigation has zero effect because the
system can't do that many interrupts per tick anyway (too high a number).

Saying "hey, that's the users problem", is _not_ a solution. It needs to
have some automatic cut-off that finds the right sustainable rate
automatically, instead of hardcoding random default values and asking the
user to know the unknowable.

Automatically doing the right thing may be hard, but it should be
solvable. In particular, something like the following _may_ be a workable
approach, rather than having a hardcoded limit:

- have a notion of "made progress". Certain events count as progress, and
will reset the interrupt count.
Examples of "progress":
- idle task loop
- a context switch

- depend on the fact that on a PC, the timer interrupt has the highest
priority, and make the timer interrupt do something like

if (!made_progress) {
disable_next_irq = 1;
} else
made_progress = 0;

- have all other interrupts do something like

if (disable_next_irq)
goto mitigate;

which just says that we mitigate an irq _only_ if we didn't make any
progress at all. Rather than mitigating on some random count that can
never be perfect.

(Tweak to suit your own definition of "made progress" - maybe you'd like
to require more than just a context switch).

Linus

2001-10-02 00:44:10

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


>The new mechanizm:
>
>- the irq handling code has been extended to support 'soft mitigation',
> ie. to mitigate the rate of hardware interrupts, without support from
> the actual hardware. There is a reasonable default, but the value can
> also be decreased/increased on a per-irq basis via
> /proc/irq/NR/max_rate.

I am sorry, but this is bogus. There is no _reasonable value_. Reasonable
value is dependent on system load and has never been and never
will be measured by interupt rates. Even in non-work conserving schemes
There is already a feedback system that is built into 2.4 that
measures system load by the rate at which the system processes the backlog
queue. Look at netif_rx return values. The only driver that utilizes this
is currently the tulip. Look at the tulip code.
This in conjuction with h/ware flow control should give you sustainable
system.
[Granted that mitigation is a hardware specific solution; the scheme we
presented at the kernel summit is the next level to this and will be
non-dependednt on h/ware.]

>(note that in case of shared interrupts, another 'innocent' device might
>stay disabled for some short amount of time as well - but this is not an
>issue because this mitigation does not make that device inoperable, it
>just delays its interrupt by up to 10 msecs. Plus, modern systems have
>properly distributed interrupts.)

This is a _really bad_ idea. not just because you are punishing other
devices.
Lets take network devices as examples: we dont want to disable interupts;
we want to disable offending actions within the device. For example, it is
ok to disable/mitigate receive interupts because they are overloading the
system but not transmit completion because that will add to the overall
latency.

cheers,
jamal


PS: we have been testing what was presented at the kernel summit for the
last few months with very promising results. Both on live and setups which
are experimental where data is generated at very high rates with hardware
traffic generators

2001-10-02 01:04:56

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Mon, Oct 01, 2001 at 08:41:20PM -0400, jamal wrote:
>
> >The new mechanizm:
> >
> >- the irq handling code has been extended to support 'soft mitigation',
> > ie. to mitigate the rate of hardware interrupts, without support from
> > the actual hardware. There is a reasonable default, but the value can
> > also be decreased/increased on a per-irq basis via
> > /proc/irq/NR/max_rate.
>
> I am sorry, but this is bogus. There is no _reasonable value_. Reasonable
> value is dependent on system load and has never been and never
> will be measured by interupt rates. Even in non-work conserving schemes

It is not dependant on system load, but rather on the performance of the
CPU and the number of interrupt sources in the system.

> There is already a feedback system that is built into 2.4 that
> measures system load by the rate at which the system processes the backlog
> queue. Look at netif_rx return values. The only driver that utilizes this
> is currently the tulip. Look at the tulip code.
> This in conjuction with h/ware flow control should give you sustainable
> system.

Not quite. You're still ignoring the effect of interrupts on the users'
ability to execute instructions during their timeslice.

> [Granted that mitigation is a hardware specific solution; the scheme we
> presented at the kernel summit is the next level to this and will be
> non-dependednt on h/ware.]
>
> >(note that in case of shared interrupts, another 'innocent' device might
> >stay disabled for some short amount of time as well - but this is not an
> >issue because this mitigation does not make that device inoperable, it
> >just delays its interrupt by up to 10 msecs. Plus, modern systems have
> >properly distributed interrupts.)
>
> This is a _really bad_ idea. not just because you are punishing other
> devices.

I'm afraid I have to disagree with you on this statement. What I will
agree with is that 10msec is too much.

> Lets take network devices as examples: we dont want to disable interupts;
> we want to disable offending actions within the device. For example, it is
> ok to disable/mitigate receive interupts because they are overloading the
> system but not transmit completion because that will add to the overall
> latency.

Wrong. Let me introduce you to my 486DX/33. It has PCI. I'm putting my
gige card into the poor beast. transmitting full out, it can receive a
sufficiently high number of tx done interrupts that it has no CPU cycles left
to run, say, gated in userspace.

Falling back to polled operation is a well known technique in realtime and
reliable systems. By limiting the interrupt rate to a known safe limit,
the system will remain responsive to non-interrupt tasks even under heavy
interrupt loads. This is the point at which a thruput graph on a slow
machine shows a complete breakdown in performance, which is always possible
on a slow enough CPU with a high performance device that takes input from
a remotely controlled user. This is *required*, and is not optional, and
there is no way that a system can avoid it without making every interrupt
a task, but that's a mess nobody wants to see in Linux.

-ben

2001-10-02 01:57:28

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Mon, 1 Oct 2001, Benjamin LaHaise wrote:

> On Mon, Oct 01, 2001 at 08:41:20PM -0400, jamal wrote:
> >
> > >The new mechanizm:
> > >
> > >- the irq handling code has been extended to support 'soft mitigation',
> > > ie. to mitigate the rate of hardware interrupts, without support from
> > > the actual hardware. There is a reasonable default, but the value can
> > > also be decreased/increased on a per-irq basis via
> > > /proc/irq/NR/max_rate.
> >
> > I am sorry, but this is bogus. There is no _reasonable value_. Reasonable
> > value is dependent on system load and has never been and never
> > will be measured by interupt rates. Even in non-work conserving schemes
>
> It is not dependant on system load, but rather on the performance of the
> CPU and the number of interrupt sources in the system.

i am not sure what you are getting at. CPU load is of course a function of
the CPU capacity. assuming that interupts are the only source of system
load is just bad engineering.

>
> > There is already a feedback system that is built into 2.4 that
> > measures system load by the rate at which the system processes the backlog
> > queue. Look at netif_rx return values. The only driver that utilizes this
> > is currently the tulip. Look at the tulip code.
> > This in conjuction with h/ware flow control should give you sustainable
> > system.
>
> Not quite. You're still ignoring the effect of interrupts on the users'
> ability to execute instructions during their timeslice.
>

And how does /proc/irq/NR/max_rate solve this?
I have a feeling you are trying to say that varying /proc/irq/NR/max_rate
gives opportunity for user processes to execute;
note, although that is bad logic, you could also modify the high and low
watermarks for when we have congestion in the backlog queue
(This is already doable via /proc)

> > [Granted that mitigation is a hardware specific solution; the scheme we
> > presented at the kernel summit is the next level to this and will be
> > non-dependednt on h/ware.]
> >
> > >(note that in case of shared interrupts, another 'innocent' device might
> > >stay disabled for some short amount of time as well - but this is not an
> > >issue because this mitigation does not make that device inoperable, it
> > >just delays its interrupt by up to 10 msecs. Plus, modern systems have
> > >properly distributed interrupts.)
> >
> > This is a _really bad_ idea. not just because you are punishing other
> > devices.
>
> I'm afraid I have to disagree with you on this statement. What I will
> agree with is that 10msec is too much.
>

It is unfair to add any latency to a device that didnt cause or
contributre to the havoc.


> > Lets take network devices as examples: we dont want to disable interupts;
> > we want to disable offending actions within the device. For example, it is
> > ok to disable/mitigate receive interupts because they are overloading the
> > system but not transmit completion because that will add to the overall
> > latency.
>
> Wrong. Let me introduce you to my 486DX/33. It has PCI. I'm putting my
> gige card into the poor beast. transmitting full out, it can receive a
> sufficiently high number of tx done interrupts that it has no CPU cycles left
> to run, say, gated in userspace.
>

I think you missed my point. i am saying there is more than one source of
interupt for that same IRQ number that you are indiscrimately shutting
down in a network device.
So, assuming that tx complete interupts do actually shut you down
(although i doubt that very much given the classical Donald Becker tx
descriptor prunning) pick another interupt source; lets say MII link
status; why do you want to kill that when it is not causing any noise but
is a source of good asynchronous information (that could be used for
example in HA systems)?

> Falling back to polled operation is a well known technique in realtime and
> reliable systems. By limiting the interrupt rate to a known safe limit,
> the system will remain responsive to non-interrupt tasks even under heavy
> interrupt loads. This is the point at which a thruput graph on a slow
> machine shows a complete breakdown in performance, which is always possible
> on a slow enough CPU with a high performance device that takes input from
> a remotely controlled user. This is *required*, and is not optional, and
> there is no way that a system can avoid it without making every interrupt
> a task, but that's a mess nobody wants to see in Linux.
>

and what is this "known safe limit"? ;->
What we are providing is actually a scheme to exactly measure that "known
safe limit" you are refering to without depending on someone having to
tell you "here's a good number for that 8 way xeon"
If there is system capacity available why the fsck is it not being used?

cheers,
jamal

2001-10-02 05:14:20

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Mon, Oct 01, 2001 at 09:54:49PM -0400, jamal wrote:
> i am not sure what you are getting at. CPU load is of course a function of
> the CPU capacity. assuming that interupts are the only source of system
> load is just bad engineering.

Indeed. I didn't mean to exclude anything by omission.

> And how does /proc/irq/NR/max_rate solve this?
> I have a feeling you are trying to say that varying /proc/irq/NR/max_rate
> gives opportunity for user processes to execute;
> note, although that is bad logic, you could also modify the high and low
> watermarks for when we have congestion in the backlog queue
> (This is already doable via /proc)

The high and low watermarks are only sufficient if the task the machine is
performing is limited to bh mode operations. What I mean is that user space
can be starved by the cyclic nature of the network queues: they will
eventually be emptied, at which time more interrupts will be permitted.

> It is unfair to add any latency to a device that didnt cause or
> contributre to the havoc.

I disagree. When a machine is overloaded, everything gets slower. But a
side effect of delaying interrupts is that more work gets done for each
irq handler that is run and efficiency goes up. The hard part is balancing
the two in an attempt to achieve a steady rate of progress.

> I think you missed my point. i am saying there is more than one source of
> interupt for that same IRQ number that you are indiscrimately shutting
> down in a network device.

You're missing the effect that irq throttling has: it results in a system
that is effectively running in "polled" mode. Information does get
processed, and thruput remains high, it is just that some additional
latency is found in operations. Which is acceptable by definition as
the system is under extreme load.

> So, assuming that tx complete interupts do actually shut you down
> (although i doubt that very much given the classical Donald Becker tx
> descriptor prunning) pick another interupt source; lets say MII link
> status; why do you want to kill that when it is not causing any noise but
> is a source of good asynchronous information (that could be used for
> example in HA systems)?

That information will eventually be picked up. I doubt the extra latency
will be of significant note. If it is, you've got realtime concerns,
which is not our goal to address at this time.


> and what is this "known safe limit"? ;->

It's system dependant. It's load dependant. For a short list of the number
of factors that you have to include to compute this:

- number of cycles userspace needs to run
- number of cache misses that userspace is forced to
incur due to irq handlers running
- amount of time to dedicate to the irq handler
- variance due to error path handling
- increased system cpu usage due to higher memory load
- front side bus speed of cpu
- speed of cpu
- length of cpu pipelines
- time spent waiting on io cycles
.....

It is non-trivial to determine a limit. And trying to tune a system
automatically is just as hard: which factor do you choose for the system
to attempt to tune itself with? How does that choice affect users who
want to tune for other loads? What if latency is more important than
dropping data?

There are a lot of choices as to how we handle these situations. They
all involve tradeoffs of one kind or another. Personally, I have a
preference towards irq rate limiting as I have measured the tradeoff
between latency and thruput, and by putting that control in the hands of
the admin, the choice that is best for the real load of the system is
not made at compile time.

If you look at what other operating systems do to schedule interrupts
as tasks and then looks at the actual cost, is it really something we
want to do? Linux has made a point of keeping things as simple as
possible, and it has brought us great wins because we do not have the
overhead that other, more complicated systems have chosen. It might
be a loss in a specific case to rate limit interrupts, but if that is
so, just change the rate. What can you say about the dynamic self
tuning techniques that didn't take into account that particular type
of load? Recompiling is not always an option.

> What we are providing is actually a scheme to exactly measure that "known
> safe limit" you are refering to without depending on someone having to
> tell you "here's a good number for that 8 way xeon"
> If there is system capacity available why the fsck is it not being used?

That's a choice for the admin to make. Sometimes having reserves that aren't
used is a safety net that people are willing to pay for. ext2 has by
default a reserve that isn't normally used. Do people complain? No. It
buys several useful features (resistance against fragmentation, space for
daemon temporary files on disk full, ...) that pay dividends of the cost.

Is irq throttling the be all and end all? No. Can other techniques work
better? Yes. Always? No. And nothing prevents us from using this and
other techniques together. Please don't dismiss it solely because you
see cases that it doesn't handle.

-ben

2001-10-02 05:55:21

by Ben Greear

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

Benjamin LaHaise wrote:

> You're missing the effect that irq throttling has: it results in a system
> that is effectively running in "polled" mode. Information does get
> processed, and thruput remains high, it is just that some additional
> latency is found in operations. Which is acceptable by definition as
> the system is under extreme load.

So, when you turn off the IRQs, are the drivers somehow made
aware of this so that they can go into polling mode? That might
fix the 10ms latency/starvation problem that bothers me...

Assuming it is fairly easy to put a driver into polling mode, if
you are explicitly told to do so, maybe this generic IRQ coelescing
could be the thing that generically poked all drivers. Drivers
that are too primitive to understand or deal with polling can just
wait their 10ms, but smarter ones will happily poll away untill told
not to by the IRQ load limiter...

> That information will eventually be picked up. I doubt the extra latency
> will be of significant note. If it is, you've got realtime concerns,
> which is not our goal to address at this time.

I'm more worried about dropped pkts. If you can receive 10k packets per second,
then you can receive (lose) 100 packets in 10ms....

>
> > and what is this "known safe limit"? ;->
>
> It's system dependant. It's load dependant. For a short list of the number
> of factors that you have to include to compute this:
>
> - number of cycles userspace needs to run
> - number of cache misses that userspace is forced to
> incur due to irq handlers running
> - amount of time to dedicate to the irq handler
> - variance due to error path handling
> - increased system cpu usage due to higher memory load
> - front side bus speed of cpu
> - speed of cpu
> - length of cpu pipelines
> - time spent waiting on io cycles
> .....

Hopefully, at the very worst, you can have configurables like:
- User-space responsiveness v/s kernel IRQ handling,
range of 1 to 100, where 100 == userRules.
- Latency: 1 who cares, so long as work happens, to 100: Fast and furious, or not at all.

In otherwords, for gods sake don't make me have to understand how my
cache and CPU pipeline works!! :)


- Another Ben

--
Ben Greear <[email protected]> <[email protected]>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear

2001-10-02 06:50:19

by Marcus Sundberg

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

[email protected] (Ingo Molnar) writes:

> (note that in case of shared interrupts, another 'innocent' device might
> stay disabled for some short amount of time as well - but this is not an
> issue because this mitigation does not make that device inoperable, it
> just delays its interrupt by up to 10 msecs. Plus, modern systems have
> properly distributed interrupts.)

Guess my P3-based laptop doesn't count as modern then:

0: 7602983 XT-PIC timer
1: 10575 XT-PIC keyboard
2: 0 XT-PIC cascade
8: 1 XT-PIC rtc
11: 1626004 XT-PIC Toshiba America Info Systems ToPIC95 PCI to Cardbus Bridge with ZV Support, Toshiba America Info Systems ToPIC95 PCI to Cardbus Bridge with ZV Support (#2), usb-uhci, eth0, BreezeCom Card, Intel 440MX, irda0
12: 1342 XT-PIC PS/2 Mouse
14: 23605 XT-PIC ide0

I can't even imagine why they did it like this...

//Marcus
--
---------------------------------+---------------------------------
Marcus Sundberg | Phone: +46 707 452062
Embedded Systems Consultant | Email: [email protected]
Cendio Systems AB | http://www.cendio.com

2001-10-02 12:12:47

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Tue, 2 Oct 2001, Benjamin LaHaise wrote:

> On Mon, Oct 01, 2001 at 09:54:49PM -0400, jamal wrote:
>
> > And how does /proc/irq/NR/max_rate solve this?
> > I have a feeling you are trying to say that varying /proc/irq/NR/max_rate
> > gives opportunity for user processes to execute;
> > note, although that is bad logic, you could also modify the high and low
> > watermarks for when we have congestion in the backlog queue
> > (This is already doable via /proc)
>
> The high and low watermarks are only sufficient if the task the machine is
> performing is limited to bh mode operations. What I mean is that user space
> can be starved by the cyclic nature of the network queues: they will
> eventually be emptied, at which time more interrupts will be permitted.
>

Which hardware flow control has been doing since 2.1 days;

> > It is unfair to add any latency to a device that didnt cause or
> > contributre to the havoc.
>
> I disagree. When a machine is overloaded, everything gets slower. But a
> side effect of delaying interrupts is that more work gets done for each
> irq handler that is run and efficiency goes up. The hard part is balancing
> the two in an attempt to achieve a steady rate of progress.
>

Let me see if i understand this:
-scheme 1: shut down only actions (within a device) that are contributing
to the overload and only they get affected because they are misbehaving;
when things get better(and we know when they get better), we turn it on
again
- scheme two: shut down the IRQ which might infact include other devices
for a jiffy or two (which doesnt mean the condition got better)

Are you saying that you disagree scheme 1 is better?

> > I think you missed my point. i am saying there is more than one source of
> > interupt for that same IRQ number that you are indiscrimately shutting
> > down in a network device.
>
> You're missing the effect that irq throttling has: it results in a system
> that is effectively running in "polled" mode. Information does get
> processed, and thruput remains high, it is just that some additional
> latency is found in operations. Which is acceptable by definition as
> the system is under extreme load.

sure. Just like the giant bottom half lock is acceptable when you can do
fine grained locking ;->
Dont preach polling to me; i am already a convert and you attended the
presentation i gave. Weve had patches for months which have been running
on live system. We were just waiting for 2.5 ...

>
> > So, assuming that tx complete interupts do actually shut you down
> > (although i doubt that very much given the classical Donald Becker tx
> > descriptor prunning) pick another interupt source; lets say MII link
> > status; why do you want to kill that when it is not causing any noise but
> > is a source of good asynchronous information (that could be used for
> > example in HA systems)?
>
> That information will eventually be picked up. I doubt the extra latency
> will be of significant note. If it is, you've got realtime concerns,
> which is not our goal to address at this time.
>

You are still missing the point (by humping on the literal meaning of the
example i provide), the point is: fine grained vs shutting down the whole
IRQ.

>
> > and what is this "known safe limit"? ;->
>
> It's system dependant. It's load dependant. For a short list of the number
> of factors that you have to include to compute this:
>
> - number of cycles userspace needs to run
> - number of cache misses that userspace is forced to
> incur due to irq handlers running
> - amount of time to dedicate to the irq handler
> - variance due to error path handling
> - increased system cpu usage due to higher memory load
> - front side bus speed of cpu
> - speed of cpu
> - length of cpu pipelines
> - time spent waiting on io cycles
> .....
>
> It is non-trivial to determine a limit. And trying to tune a system
> automatically is just as hard: which factor do you choose for the system
> to attempt to tune itself with? How does that choice affect users who
> want to tune for other loads? What if latency is more important than
> dropping data?
>
> There are a lot of choices as to how we handle these situations. They
> all involve tradeoffs of one kind or another. Personally, I have a
> preference towards irq rate limiting as I have measured the tradeoff
> between latency and thruput, and by putting that control in the hands of
> the admin, the choice that is best for the real load of the system is
> not made at compile time.
>
> If you look at what other operating systems do to schedule interrupts
> as tasks and then looks at the actual cost, is it really something we
> want to do? Linux has made a point of keeping things as simple as
> possible, and it has brought us great wins because we do not have the
> overhead that other, more complicated systems have chosen. It might
> be a loss in a specific case to rate limit interrupts, but if that is
> so, just change the rate. What can you say about the dynamic self
> tuning techniques that didn't take into account that particular type
> of load? Recompiling is not always an option.
>

I am not sure where you are getting the opinion that there is recompiling
involved or how what we have is complex (the patch is much smaller than
what Ingo posted);
And no, you dont need to maintain any state of all those things in your
list; in 2.4, which is a good start, you have the system load being probed
via a second order effect i.e the growth rate of the backlog queue is a
good indicator that the system is not pulling packets off fast enough.
This is a very good measure of _all_ those items on your list. I am not
saying it is God's answer, merely pointing that it is a good indicator
which doesnt need to maintain state of 1000 items or cause additional
computations on the datapath.
We get a fairly early warning that we are about to be overloaded. We can then
shut off the offending device's _receive_ interupt source when it doesnt
heed the congestion notification advice weve been giving it. It heeds the
advice by mitigating.
In the 2.5 patch (i should say it is a clean patch to 2.4 actually and is
backward compatible) we worked around the fact that the 2.4 solution
requires a specific NIC feature (mitigation) among a lot of other things.
Infact we have already proven that mitigation is only good when you have
one or two NICs on the system.

> > What we are providing is actually a scheme to exactly measure that "known
> > safe limit" you are refering to without depending on someone having to
> > tell you "here's a good number for that 8 way xeon"
> > If there is system capacity available why the fsck is it not being used?
>
> That's a choice for the admin to make. Sometimes having reserves that aren't
> used is a safety net that people are willing to pay for. ext2 has by
> default a reserve that isn't normally used. Do people complain? No. It
> buys several useful features (resistance against fragmentation, space for
> daemon temporary files on disk full, ...) that pay dividends of the cost.
>

I am not sure whether you are trolling or not. We are talking about a
system conserving a work principle and you compare it a reservation
system.

> Is irq throttling the be all and end all? No. Can other techniques work
> better? Yes. Always? No. And nothing prevents us from using this and
> other techniques together. Please don't dismiss it solely because you
> see cases that it doesn't handle.
>

I am not dismising the whole patch. I most definetly dismiss those
two ideas i pointed out.

cheers,
jamal

2001-10-02 14:25:55

by Alan

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

> I'm all for anything that speeds up (and makes more reliable) high network
> speeds, but I often run with 8+ ethernet devices, so IRQs have to be shared,
> and a 10ms lockdown on an interface could lose lots of packets. Although
> it's not a perfect solution, maybe you could (in the kernel) multiple the
> max by the number of things using that IRQ? For example, if you have four
> ethernet drivers on one IRQ, then let that IRQ fire 4 times faster than
> normal before putting it in lockdown...

What you really care about is limiting the total amount of CPU time used for
interrupt processing so that usermode progress is made. The network layer
shows this up paticularly badly because (and its kind of hard to avoid this)
it frees resources on the hardware before userspace has processed them.

Silencing a specific target cannot be done by IRQ masking, you have to
ask the controller to shut up. It may be the default "shut up" handler
is disable_irq but that is non optimal.

Having driver callbacks as part of the irq handler also massively improves
the effect of the event, because faced with an IRQ storm a card can

- decide if it is the guily party

If so

- consider switching to polled mode
- change its ring buffer size to reduce IRQ load and up latency
as a tradeoff
- anything else magical the hardware has (like retuning irq
mitigation registers)

Alan


2001-10-02 17:00:52

by Robert Olsson

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



Hello!

Jamal mentioned some about the polling efforts for Linux. I can give some
experimental data here with GIGE. Motivation, implantation etc is in paper
to presented at USENIX Oakland.

Below a IP forwarding test. Injected 10 Million 64 byte packets into eth0 at
a speed of 890.000 p/s received and routed and TX:ed on eth1.

PIII @ 933 MHz. Kernel UP 2.4.10 with polling patch the are NIC's e1000
eth0 (irq=24) and eth1 (irq=26)


Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flags
eth0 1500 0 4031309 7803725 7803725 5968699 22 0 0 0 BRU
eth1 1500 0 18 0 0 0 4031305 0 0 0 BRU


The RX-ERR, RX-DRP are bugs from the e1000 driver. Anyway we getting 40% of
packet storm routed. With a estimated throughput is about 350.000 p/s

irq CPU0
24: 80652 IO-APIC-level e1000
26: 41 IO-APIC-level e1000

The for RX (polling) we use only 24 interrupts. TX is mitigated (not polled)
in this run. We see a lot more interrupts for same amount of packets. I think
we can actually tune this a bit... And I should also say that RxIntDelay=0
(e1000 driver). So there is no latency before the driver register for polling
by kernel.

USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND
root 3 3.0 0.0 0 0 ? SWN 12:51 0:11 (ksoftirqd_CPU0)

The polling (softirq) now handled by ksoftirqd but I have seen a path from
Ingo that schedules without the need for ksoftirqd.

Also note that during poll we disable only RX interrupts so all other device
interrupts/functions are handled properly.

And tulip variants of this is in production use and seems very solid. The
kernel code part holds ANK-trademark. :-)

Cheers.

--ro

2001-10-02 17:39:56

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



Some data on lesser worthy cards (i.e 10/100) on 2.4.7
can be found at:
http://www.cyberus.ca/~hadi/247-res/

cheers,
jamal

On Tue, 2 Oct 2001, Robert Olsson wrote:

>
>
> Hello!
>
> Jamal mentioned some about the polling efforts for Linux. I can give some
> experimental data here with GIGE. Motivation, implantation etc is in paper
> to presented at USENIX Oakland.
>
> Below a IP forwarding test. Injected 10 Million 64 byte packets into eth0 at
> a speed of 890.000 p/s received and routed and TX:ed on eth1.
>
> PIII @ 933 MHz. Kernel UP 2.4.10 with polling patch the are NIC's e1000
> eth0 (irq=24) and eth1 (irq=26)
>
>
> Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flags
> eth0 1500 0 4031309 7803725 7803725 5968699 22 0 0 0 BRU
> eth1 1500 0 18 0 0 0 4031305 0 0 0 BRU
>
>
> The RX-ERR, RX-DRP are bugs from the e1000 driver. Anyway we getting 40% of
> packet storm routed. With a estimated throughput is about 350.000 p/s
>
> irq CPU0
> 24: 80652 IO-APIC-level e1000
> 26: 41 IO-APIC-level e1000
>
> The for RX (polling) we use only 24 interrupts. TX is mitigated (not polled)
> in this run. We see a lot more interrupts for same amount of packets. I think
> we can actually tune this a bit... And I should also say that RxIntDelay=0
> (e1000 driver). So there is no latency before the driver register for polling
> by kernel.
>
> USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND
> root 3 3.0 0.0 0 0 ? SWN 12:51 0:11 (ksoftirqd_CPU0)
>
> The polling (softirq) now handled by ksoftirqd but I have seen a path from
> Ingo that schedules without the need for ksoftirqd.
>
> Also note that during poll we disable only RX interrupts so all other device
> interrupts/functions are handled properly.
>
> And tulip variants of this is in production use and seems very solid. The
> kernel code part holds ANK-trademark. :-)
>
> Cheers.
>
> --ro
>

2001-10-02 19:48:33

by Andreas Dilger

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Oct 02, 2001 19:03 +0200, Robert Olsson wrote:
> Jamal mentioned some about the polling efforts for Linux. I can give some
> experimental data here with GIGE. Motivation, implantation etc is in paper
> to presented at USENIX Oakland.

How do you determine the polling rate? I take it that this is a different
patch than Ingo's?

> Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flags
> eth0 1500 0 4031309 7803725 7803725 5968699 22 0 0 0 BRU
> eth1 1500 0 18 0 0 0 4031305 0 0 0 BRU
>
> The RX-ERR, RX-DRP are bugs from the e1000 driver. Anyway we getting 40% of
> packet storm routed. With a estimated throughput is about 350.000 p/s

Are you sure they are "bugs" and not dropped packets? It seems to me that
RX-ERR == RX-DRP, which would seem to me that the receive buffers are full
on the card and are not being emptied quickly enough (or maybe that is
indicated by RX-OVR...) I don't know whether it is _possible_ to empty
the buffers quickly enough, I suppose CPU usage info would also shed some
light on that.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-10-02 20:37:31

by Ingo Molnar

[permalink] [raw]
Subject: [patch] auto-limiting IRQ load, IRQ-polling, irq-rewrite-2.4.11-D9


On Mon, 1 Oct 2001, Linus Torvalds wrote:

> And how do you select max_rate sanely? [...]

> Saying "hey, that's the users problem", is _not_ a solution. It needs
> to have some automatic cut-off that finds the right sustainable rate
> automatically, instead of hardcoding random default values and asking
> the user to know the unknowable.

good point. I did not ignore this problem, i was just unable to find any
solution that felt robust, so i convinced myself that max_rate is the best
idea :-)

but a fresh start today, and a good idea from Arjan resulted in a pretty
good 'irq load' estimator that implements the above cut-off dynamically:

the method is to detect in do_IRQ() whether we have interrupted an
interrupt context of any sort or not. The number of 'total' and 'irq'
interruptions are counted, and the 'irq load' is "irq / total". The irq
code goes into per-irq and per-cpu 'overload mode' if the 'irq load' is
higher than ~97%.

There is one case of 'understimation': hardirqs that do not enable
interrupts (SA_INTERRUPT handlers) will not be interrupted. Fortunately,
99.9% of the network drivers (and other, high-rate irq drivers) enable
interrupts in their hardirq handlers. The only significant SA_INTERRUPT
user is the SCSI layer. Which is not an 'external' device.

but in the loads we care about, the irqs are irq-enabled and trigger
softirqs - both are measured precisely via the in_interrupt() method.

this load calculation method has a few 'false triggers': eg.
syscall-level code that uses disable_local_bh() will be triggered - but
such code is not very common and overestimating irq-load slightly is
always better than underestimating.

Other estimators, like context switches and rdtsc have more serious
problems i believe. Context switches are imo not a 'perfect' indicator of
'progress', in a number of important situations. Eg. when a userspace
process is not scheduling at all. There is no other indicator of progress
in this case but the fact that it's userspace that we interrupt.

RDTSC, while 'perfect' indicator of actual system, irq and user load, is
not generic and adds quite some overhead to the lowlevel code. The only
'generic' and accurate time measerment method, do_gettimeofday(), has way
too much overhead to be used in lowlevel irq code.

another advantage of the in_interrupt() method is finegrained metrics:
the counters are per-irq and per-cpu, so low-frequency interrupts like the
keyboard interrupt or mouse interrupt are much less likely to be mitigated
needlessly. Separate mitigation does not mean the global effects will not
be measured correctly: if eg. 16 devices all produce a 10k irqs/sec load
(which, in isolation, is not enough to trigger an overload), they together
will starve non-irq contexts and will cause an overload in all 16 active
irq handlers.


unfortunately there is also another, new problem, which got reported and
which i can reproduce as well: due to the millisecs long disabling of
ethernet interrupts, the receiver can overflow easily, and produces
overruns and lost packets. The result of this phenomenon is an effectively
frozen network: no incoming or outgoing TCP connection ever makes any
reasonable progress. So by auto-mitigation alone we only exchanged a 'box
lockup' against a 'network connection' lockup - a different kind of DoS
but still a DoS.

the new patch (attached) provides a solution for this problem too, by
introducing a hardirq-polling kernel thread: 'kpolld'. (kpolld is
significantly different from ksoftirqd: it gets only triggered in truly
hopeless situations, and it handles hardirq load in such cases. I've never
seen it run under any 'good' loads i care about. Plus, polling can have
significant performance advantages in dedicated networking environments.)

while this inevitably caused the introduction of an device-polling
framework, it's hard to separate the two things - auto-mitigation alone is
not useful without going into poll mode, unfortunately. Only the
networking code uses the polling framework currently.

Another option would be to use the interrupt handlers themselves to do the
polling - but this puts certain assumptions into existing IRQ handlers,
which we cannot do for 2.4 i believe. Plus, the ->poll_controller() driver
extension is also used by the netconsole, so we could get free testing for
it :-) Another reason is that i think subsystems should have close control
over the way they do polling. There are also a few examples of
'synchronous polling' points i added to the networking code: the device
will only be polled once, and only if we are in IRQ overload mode.

about performance: while it certainly can be tuned further, the estimator
works pretty well under the loads i tested. 97% proved to be a reasonable
limit, which i was unable to reach via 'normal' loads - it took dedicated
tools like updspam to trigger the overload. In overload mode, performance
is still pretty good, and TCP connections over the network are snappy. But
it would be very nice if those who reported packet drops and bad network
performance when using yesterday's patch could re-test the same load
situation with this patch applied to 2.4.11-pre2.

note: i still kept max_rate, but it's now scaled along cpu_khz - a 300 MHz
box will have a default value of 30k irqs/sec, a 1 GHZ box will have a
100k irqs/sec limit. This limit still has the advantage to potentially
catch runaway devices that cause irq storms. Another reason to keep
max_rate was to enable router & appliance vendors to set it to a low value
to force the system into 'polling mode'. For dedicated boxes this makes
perfect sense.

note2: the patch includes the eepro100 driver patches from the -ac tree as
well (Arjan's receiver(?)-hangup fix) - without those fixes i could easily
hang my eepro100 cards after a few minutes of extreme load.

the patch can also be downloaded from:

http://redhat.com/~mingo/irq-rewrite/

i've stress-tested the patch on 2.4.11-pre1, on UP-PIC, UP-APIC, and
SMP-APIC systems.

Comments, testing feedback welcome,

Ingo


Attachments:
irq-rewrite-2.4.11-D9.bz2 (12.90 kB)

2001-10-02 20:53:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Tue, 2 Oct 2001, Alan Cox wrote:

> What you really care about is limiting the total amount of CPU time
> used for interrupt processing so that usermode progress is made.
> [...]

exactly. The estimator in -D9 tries to achieve precisely this, both
hardirqs and softirqs are measured.

> Silencing a specific target cannot be done by IRQ masking, you have to
> ask the controller to shut up. It may be the default "shut up" handler
> is disable_irq but that is non optimal.

this could be done later on, but i think this is out of question for 2.4,
as it needs extensive changes in irq handler and network driver API.

Ingo

2001-10-02 22:02:38

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


Ingo Molnar <[email protected]> wrote:

>> Silencing a specific target cannot be done by IRQ masking, you have to
>> ask the controller to shut up. It may be the default "shut up" handler
>> is disable_irq but that is non optimal.

>this could be done later on, but i think this is out of question for 2.4,
>as it needs extensive changes in irq handler and network driver API.

This already is done in the current NAPI patch which you should have seen
by now. NAPI is backward compatible: It would work just fine with 2.4 and
drivers can be upgraded slowly.
If theres anything that should make it into 2.4 then NAPI it should be
(with some componets from your code that still needs to be proven under
different workloads).

>> And how do you select max_rate sanely? [...]

>> Saying "hey, that's the users problem", is _not_ a solution. It needs
>> to have some automatic cut-off that finds the right sustainable rate
>> automatically, instead of hardcoding random default values and asking
>> the user to know the unknowable.

>good point. I did not ignore this problem, i was just unable to find any
>solution that felt robust, so i convinced myself that max_rate is the
>best idea :-)

if you havent taken a look at NAPI please do so instead of creating these
nightly brainstorm patches. With all due respect, if you insist on doing
that please have the courtesy of at least posting results/numbers of how
this improved things and under what workloads and conditions.
I do believe that some of the pieces of what you have would help -- in
conjunction with NAPI.
A scenario where we have an appearing ksoftirqd, then disapearing and then
a new kpolld showing up just indicates very bad engineering/juju which
seems to be based on pulling tricks out of a hat.

Lets work together instead of creating chaos.

cheers,
jamal

2001-10-03 08:36:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Tue, 2 Oct 2001, jamal wrote:

> [...] please have the courtesy of at least posting results/numbers of
> how this improved things and under what workloads and conditions.
> [...]

500 MHz PIII UP server, 433 MHz client over a single 100 mbit ethernet
using Simon Kirby's udpspam tool to overload the server. Result: 2.4.10
locks up before the patch. 2.4.10 with the first generation irqrate patch
applied protects against the lockup (if max_rate is correct), but results
in dropped packets. The auto-tuning+polling patch results in a working
system and working network, no lockup and no dropped packets. Why this
happened and how it happened has been discussed extensively.

(the effect of polling-driven networking is just an extra and unintended
bonus side-effect.)

Ingo

2001-10-03 08:40:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Tue, 2 Oct 2001, jamal wrote:

> You are still missing the point (by humping on the literal meaning of
> the example i provide), the point is: fine grained vs shutting down
> the whole IRQ.

i'm convinced that this is a minor detail.

there are *tons* of disadvantages if IRQs are shared. In any
high-performance environment, not having enough interrupt sources is a
sizing or hw design mistake. You can have up to 200 interrupts even on a
PC, using multipe IO-APICs. Any decent server board distributes interrupt
sources properly. Shared interrupts are a legacy of the PC design, and we
are moving away from it slowly but surely. Especially under gigabit loads
there are several PCI busses anyway, so getting non-shared interrupts is
not only easy but a necessity as well. There is no law in physics that
somehow mandates or prefers the sharing of interrupt vectors: devices are
distinct, they use up distinct slots in the board. The PCI bus can get
multiple IRQ sources out of a single card, so even multi-controller cards
are covered.

i fully agree that both the irq code and drivers themselves have to handle
shared interrupts correctly, and we should not penalize shared interrupts
unnecesserily, but do they have to influence our design decisions too
much? Nope.

Ingo

2001-10-03 08:49:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On 2 Oct 2001, Marcus Sundberg wrote:

> Guess my P3-based laptop doesn't count as modern then:
>
> 0: 7602983 XT-PIC timer
> 1: 10575 XT-PIC keyboard
> 2: 0 XT-PIC cascade
> 8: 1 XT-PIC rtc
> 11: 1626004 XT-PIC Toshiba America Info Systems ToPIC95 PCI to
> Cardbus Bridge with ZV Support, Toshiba America Info Systems ToPIC95
> PCI to Cardbus Bridge with ZV Support (#2), usb-uhci, eth0, BreezeCom
> Card, Intel 440MX, irda0

ugh!

> I can't even imagine why they did it like this...

well, you arent going to be using it as a webserver i guess? :) But the
costs on desktops are minimal. It's the high-irq-rate server environments
that want separate irq sources.

Ingo

2001-10-03 09:24:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Mon, 1 Oct 2001, Ben Greear wrote:

> So, when you turn off the IRQs, are the drivers somehow made aware of
> this so that they can go into polling mode? That might fix the 10ms
> latency/starvation problem that bothers me...

the latest, -D9 patch does this. If drivers provide a (backwards
compatible) ->poll_controller() call then they can be polled by kpolld.
There are also a few points within the networking code that trigger a poll
pass, to make sure events are processed even if networking-intensive
applications take away all CPU time from kpolld. The device is only polled
if the IRQ is detected to be in overload mode. IRQ-overload protection
does not depend on the existence of the availability of the
->poll_controller(). The poll_controller() call is very simple for most
drivers. (It has to be per-driver, because not all drivers advance their
state purely via their device interrupts.)

but kpolld itself and auto-mitigation is not limited to networking - any
other driver framework that has high-irq-load problems can use it.

> I'm more worried about dropped pkts. If you can receive 10k packets
> per second, then you can receive (lose) 100 packets in 10ms....

yep - this does not happen anymore, at least under the loads i tested
which otherwise choke a purely irq-driven machine. (It will happen in a
gradual way if load is increased further, but that is natural.)

Ingo

2001-10-03 09:29:49

by Helge Hafting

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

Ingo Molnar wrote:

> 500 MHz PIII UP server, 433 MHz client over a single 100 mbit ethernet
> using Simon Kirby's udpspam tool to overload the server. Result: 2.4.10
> locks up before the patch. 2.4.10 with the first generation irqrate patch
> applied protects against the lockup (if max_rate is correct), but results
> in dropped packets. The auto-tuning+polling patch results in a working
> system and working network, no lockup and no dropped packets. Why this
> happened and how it happened has been discussed extensively.

I hope we get some variant of this in 2.4. A device callback
stopping rx interrupts only is of course even better, but
won't that be 2.5 stuff?

Helge Hafting

2001-10-03 09:41:01

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Tue, 2 Oct 2001, jamal wrote:

> This already is done in the current NAPI patch which you should have
> seen by now. [...]

(i searched the web and mailing list archives and havent found it (in fact
this is the first mention i saw) - could you give me a link so i can take
a look at it? I just found your slides but no link to actual code.
Thanks!)

but the objectives, judging from the description you gave, are i think
largely orthogonal, with some overlapping in the polling part. The polling
part of my patch is just a few quick lines here and there and it's not
intrusive at all. I needed it to make sure all problems are solved and
that the system & network is actually usable in overload situations.

you i think are concentrating on router performance (i'd add dedicated
networking appliances to the list), using cooperative drivers. I trying to
solve a DoS attack against 2.4 boxes, and i'm trying to guarantee the
uninterrupted (pun unintended) functioning of the system from the point of
the IRQ handler code.

Ingo

2001-10-03 12:52:37

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5




On Wed, 3 Oct 2001, Ingo Molnar wrote:

>
> On Tue, 2 Oct 2001, jamal wrote:
>
> > [...] please have the courtesy of at least posting results/numbers of
> > how this improved things and under what workloads and conditions.
> > [...]
>
> 500 MHz PIII UP server, 433 MHz client over a single 100 mbit ethernet
> using Simon Kirby's udpspam tool to overload the server. Result: 2.4.10
> locks up before the patch. 2.4.10 with the first generation irqrate patch
> applied protects against the lockup (if max_rate is correct), but results
> in dropped packets. The auto-tuning+polling patch results in a working
> system and working network, no lockup and no dropped packets. Why this
> happened and how it happened has been discussed extensively.
>
> (the effect of polling-driven networking is just an extra and unintended
> bonus side-effect.)
>

This is insufficient and, no pun intended, but you must be joking if you
intend on putting this patch into the kernel based on these observations.

For sample data look at: http://www.cyberus.ca/~hadi/247-res/
Weve been collecting data for about a year and fixing the patchs and we
still dont think we cover the full range (hopefully other people will help
in that when we merge).

You dont need the patch for 2.4 to work against any lockups. And
infact i am suprised that you observe _any_ lockups on a PIII which are
not observed on my PII. Linux as is, without any tuneups can handle
upto about 40000 packets/sec input before you start observing user space
startvations. This is about 30Mbps at 64 byte packets; its about 60Mbps at
128 byte packets and comfortable at 100Mbps with byte size of 256. We
really dont have a problem at 100Mbps.

There are several solutions in 2.4 and i suggest you try those first

1) has been around since 2.1 is hardware flow control.
First you need to register callbacks to throttle on/off your device.
Typically the xoff() callbacks will involve the driver turning off the
receive and receive_nobuf interupt sources and the xon() callback will
undo this. The network subsytem observes congestion levels by the size of
the backlog queue. it shuts off devices with when its overloaded and
unthrottles them when the conditions get better

2) and upgrade to the above introduced in 2.4:
Instead of waiting until you get shut off because of an overloaded
syatem, you could do something about it... use the return values from
netif_rx to make decisions. The return value indicates whether the system
is getting congested or not. The value is computed based on a moving
window averaging of the backlog queue and so is a pretty good reflection
of congestion levels. Typical uses of the return value are to tune the
mitigation registers. If the congestion thresholds are approaching a high
watermark, you back off and if they indicate things are getting
better, you increase you packet rate to the stack.

since you seem to be unaware of the above, i would suggest you try them
out first.

NAPI builds upon the above and introduces a more generic solution.

cheers,
jamal


2001-10-03 13:05:57

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Wed, 3 Oct 2001, Ingo Molnar wrote:

>
> On Tue, 2 Oct 2001, jamal wrote:
>
> > This already is done in the current NAPI patch which you should have
> > seen by now. [...]
>

The paper is at: http://www.cyberus.ca/~hadi/usenix-paper.tgz
Robert can point you to the latest patches.

>
> but the objectives, judging from the description you gave, are i think
> largely orthogonal, with some overlapping in the polling part.

yes. Weve done a lot of thoroughly thought work in that area and i think
it will be a sin to throw it out.

> The polling
> part of my patch is just a few quick lines here and there and it's not
> intrusive at all.

NAPI is not intrusive either, it is backward compatible.

> I needed it to make sure all problems are solved and
> that the system & network is actually usable in overload situations.
>

And you can; look at my previous email. I would rather patch 2.4 to use
NAPI than see your patch in there.

> you i think are concentrating on router performance (i'd add dedicated
> networking appliances to the list), using cooperative drivers. I trying to
> solve a DoS attack against 2.4 boxes, and i'm trying to guarantee the
> uninterrupted (pun unintended) functioning of the system from the point of
> the IRQ handler code.

No. NAPI is for any type of network activities not just for routers or
sniffers. It works just fine with servers. What do you see in there that
will make it not work with servers?

cheers,
jamal

2001-10-03 13:28:08

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Wed, 3 Oct 2001, jamal wrote:

>
>
> On Wed, 3 Oct 2001, Ingo Molnar wrote:
>
> >
> > but the objectives, judging from the description you gave, are i think
> > largely orthogonal, with some overlapping in the polling part.
>
> yes. Weve done a lot of thoroughly thought work in that area and i think
> it will be a sin to throw it out.
>

I hit the send button to fast..
The dynamic irq limiting (it must not be set by a system admin to conserve
the principle of work) could be used as a last resort. The point is, if
you are not generating a lot of interupts to begin with (as is the case
with NAPI), i dont see the irq rate limiting kicking in at all. Maybe for
broken drivers and perhaps for other devices other than those within the
network subsystem (i think weve pretty much taken care of the network
subsystem). But you must fix the irq sharing issue first and be
able to precisely isolate and punish the rude devices.

cheers,
jamal

2001-10-03 13:36:08

by Robert Olsson

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



jamal writes:

> The paper is at: http://www.cyberus.ca/~hadi/usenix-paper.tgz
> Robert can point you to the latest patches.


Current code... there are still some parts we like to better.

Available via ftp from robur.slu.se:/pub/Linux/net-development/NAPI/
2.4.10-poll.pat

The original code:

ANK-NAPI-tulip-only.pat
ANK-NAPI-kernel-only.pat

And for GIGE there is a e1000 driver in test.

Cheers.

--ro



2001-10-03 14:07:43

by David Brownell

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

USB 2.0 host controllers (EHCI) support a kind of hardware
level interrupt mitigation, whereby a register controls interrupt
latency. The controller can delay interrupts from 1-64 microframes,
where microframe = 125usec, and the current driver defaults that
latency to 1 microframe (best overall performance) but sets that
from a module parameter.

I've only read the discussion via archive, so I might have missed
something, but I didn't see what I had hoped to see: a feedback
mechanism so drivers (PCI in the case of EHCI) can learn that
decreasing the IRQ rate would be good, or later that it's OK to
increase it again. (Seems like Alan Cox suggested as much too ...)

I saw several suggestions specific to the networking layer,
but I'd sure hope to see mechanisms in place that work for
non-network drivers. Someday; right now highspeed USB
devices (480 MBit/sec) aren't common yet, mostly disks, and
motherboard chipsets don't yet support it.

- Dave



2001-10-03 14:16:13

by Manfred Spraul

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

> On Wed, 3 Oct 2001, jamal wrote:
> > On Wed, 3 Oct 2001, Ingo Molnar wrote:
> > >
> > > but the objectives, judging from the description you gave, are i
> > > think largely orthogonal, with some overlapping in the polling
> > > part.
> >
> > yes. Weve done a lot of thoroughly thought work in that area and i
> > think it will be a sin to throw it out.
> >
>
> I hit the send button to fast..
> The dynamic irq limiting (it must not be set by a system admin to
> conserve the principle of work) could be used as a last resort.
> The point is, if you are not generating a lot of interupts to begin
> with (as is the case with NAPI), i dont see the irq rate limiting
> kicking in at all.

A few notes as seen for low-end nics:

Forcing an irq limit without asking the driver is bad - it must be the
opposite way around.
e.g. the winbond nic contains a bug that forces it to 1 interrupt/packet
tx, but I can switch to rx polling/mitigation.
I'm sure the ne2k-pci users would also complain if a fixed irq limit is
added - I bet the majority of the drivers perform worse with a fixed
limit, only some perform better, and most perform best if they are given
a notice that they should reduce their irq rate. (e.g. disable
rx_packet, tx_packet. Leave the error interrupts on, and do the
rx_packet, tx_packet work in the poll handler)

But a hint for the driver ("now switch mitigation on/off") seems to be a
good idea. And that hint should not be the return value of netif_rx -
what if the driver is only sending packets?
What if it's not even a network driver?

NAPI seems to be very promising to fix the total system overload case
(so many packets arrive that despite irq mitigation the system is still
overloaded).

But the implementation of irq mitigation is driver specific, and a 10
millisecond stop is far too long.

--
Manfred



2001-10-03 14:53:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Wed, 3 Oct 2001, jamal wrote:

> You dont need the patch for 2.4 to work against any lockups. And
> infact i am suprised that you observe _any_ lockups on a PIII which
> are not observed on my PII. [...]

as mentioned before, it's dead easy to lock up current kernels via high
enough networking irq/softirq load:

box:~> wc -l udpspam.c
131 udpspam.c

box:~> ./udpspam 10.0.3.4

10.0.3.4 is running vanilla 2.4.11-pre2 UP, a 466 MHz PII box with enough
RAM, using eepro100. The system effectively locks up - even in the full
knowledge of what is happening, i can hardly switch consoles, let alone do
anything like ifconfig eth0 down to fix the lockup. Until this kind of
load is present the only option is to power-cycle the box. SysRq does not
work.

(ask Simon for the code.)

and frankly, this has been well-known for a long time - it's just since
Simon sent me this testcode that i realized how trivial it is. Alexey told
me about Linux routers effectively locking up if put under 100 mbit IRQ
load more than a year ago, when i first tried to fix softirq latencies. I
think if you are doing networking patches then you should be aware of it
as well.

your refusal to accept this problem as an existing and real problem is
really puzzling me.

Ingo

2001-10-03 14:54:04

by Ingo Molnar

[permalink] [raw]
Subject: [patch] auto-limiting IRQ load take #2, irq-rewrite-2.4.11-F4


the attached patch contains a cleaned up version of IRQ auto-mitigation.

- i've removed the max_rate limit and have streamlined the impact of the
load-estimator on do_IRQ() to this piece of code:

desc->total_contexts++;
if (unlikely(in_interrupt()))
goto mitigate_irqload;

i dont think we can get much cheaper than this. (We could perhaps avoid
the total_contexts counter by saving a 'snapshot' of the existing
kstat.irqs array of counters in every timer tick and comparing the
snapshot to the current kstat.irqs values. That looked pretty unclean
though.)

- the per-cpu irq counting in -D9 was incorrect as it collapsed all irq
handlers into a single counter.

- i've removed the net-polling hacks - they are unrelated to this problem.

the patch is against 2.4.11-pre2. (the eepro100.c fixes from the -ac tree
are already included in -pre2, i only included them in this patch to make
patching & testing against 2.4.10 easier.).

(i'd like to stress the point again that the goal of this approach is
*not* to be nice. This is an airbag mechanizm, it can and will hurt
performance. But my box does not lock up anymore.)

Ingo


Attachments:
irq-rewrite-2.4.11-F4.bz2 (10.15 kB)

2001-10-03 15:12:13

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Wed, 3 Oct 2001, Manfred Spraul wrote:

> > On Wed, 3 Oct 2001, jamal wrote:
> > > On Wed, 3 Oct 2001, Ingo Molnar wrote:
> > > >
> > > > but the objectives, judging from the description you gave, are i
> > > > think largely orthogonal, with some overlapping in the polling
> > > > part.
> > >
> > > yes. Weve done a lot of thoroughly thought work in that area and i
> > > think it will be a sin to throw it out.
> > >
> >
> > I hit the send button to fast..
> > The dynamic irq limiting (it must not be set by a system admin to
> > conserve the principle of work) could be used as a last resort.
> > The point is, if you are not generating a lot of interupts to begin
> > with (as is the case with NAPI), i dont see the irq rate limiting
> > kicking in at all.
>
> A few notes as seen for low-end nics:
>
> Forcing an irq limit without asking the driver is bad - it must be the
> opposite way around.
> e.g. the winbond nic contains a bug that forces it to 1 interrupt/packet
> tx, but I can switch to rx polling/mitigation.

Indeed this is a weird case that we have not encountered but it does make
the point that the driver knows best what to do.

> I'm sure the ne2k-pci users would also complain if a fixed irq limit is
> added - I bet the majority of the drivers perform worse with a fixed
> limit, only some perform better, and most perform best if they are given
> a notice that they should reduce their irq rate. (e.g. disable
> rx_packet, tx_packet. Leave the error interrupts on, and do the
> rx_packet, tx_packet work in the poll handler)
>

agreed. The reaction should be left to the driver's policy.

> But a hint for the driver ("now switch mitigation on/off") seems to be a
> good idea. And that hint should not be the return value of netif_rx -
> what if the driver is only sending packets?
> What if it's not even a network driver?

For 2.4, unfortunately there was no other way to pass that feedback
without the driver sending a packet up the stack. Our system feedback
probe is based on sampling the backlog queue.

> NAPI seems to be very promising to fix the total system overload case
> (so many packets arrive that despite irq mitigation the system is still
> overloaded).
>
> But the implementation of irq mitigation is driver specific, and a 10
> millisecond stop is far too long.
>

violent agreement.

cheers,
jamal


2001-10-03 15:17:23

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Wed, 3 Oct 2001, Ingo Molnar wrote:

Robert has a driver extension (part of Alexey's iputils) that cna
generate in the 140Kpps (for 100Mbps) and about 900Kpps for the e1000, but
i'll take a look at Simons stuff if it is available. Marc Boucher has
something that is an in-kernel client/server as well.

> 10.0.3.4 is running vanilla 2.4.11-pre2 UP, a 466 MHz PII box with enough
> RAM, using eepro100. The system effectively locks up - even in the full
> knowledge of what is happening, i can hardly switch consoles, let alone do
> anything like ifconfig eth0 down to fix the lockup. Until this kind of
> load is present the only option is to power-cycle the box. SysRq does not
> work.

use the netif_rx() return code and hardware flowcontrol to fix it.

> and frankly, this has been well-known for a long time - it's just since
> Simon sent me this testcode that i realized how trivial it is. Alexey told
> me about Linux routers effectively locking up if put under 100 mbit IRQ
> load more than a year ago, when i first tried to fix softirq latencies. I
> think if you are doing networking patches then you should be aware of it
> as well.
>

I am fully aware of it. We have progessed extensively since then. Look at
NAPI.

> your refusal to accept this problem as an existing and real problem is
> really puzzling me.
>

I must have miscommunicated. I am not saying there is no problem otherwise
i wouldnt be working on this to begin with. I am just against your shotgun
approach.

cheers,
jamal

2001-10-03 15:19:13

by jamal

[permalink] [raw]
Subject: Re: [patch] auto-limiting IRQ load take #2, irq-rewrite-2.4.11-F4



Your approach is still wrong. Please do not accept this patch.

cheers,
jamal

On Wed, 3 Oct 2001, Ingo Molnar wrote:

>
> the attached patch contains a cleaned up version of IRQ auto-mitigation.
>
> - i've removed the max_rate limit and have streamlined the impact of the
> load-estimator on do_IRQ() to this piece of code:
>
> desc->total_contexts++;
> if (unlikely(in_interrupt()))
> goto mitigate_irqload;
>
> i dont think we can get much cheaper than this. (We could perhaps avoid
> the total_contexts counter by saving a 'snapshot' of the existing
> kstat.irqs array of counters in every timer tick and comparing the
> snapshot to the current kstat.irqs values. That looked pretty unclean
> though.)
>
> - the per-cpu irq counting in -D9 was incorrect as it collapsed all irq
> handlers into a single counter.
>
> - i've removed the net-polling hacks - they are unrelated to this problem.
>
> the patch is against 2.4.11-pre2. (the eepro100.c fixes from the -ac tree
> are already included in -pre2, i only included them in this patch to make
> patching & testing against 2.4.10 easier.).
>
> (i'd like to stress the point again that the goal of this approach is
> *not* to be nice. This is an airbag mechanizm, it can and will hurt
> performance. But my box does not lock up anymore.)
>
> Ingo
>

2001-10-03 15:30:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Wed, 3 Oct 2001, jamal wrote:

> No. NAPI is for any type of network activities not just for routers or
> sniffers. It works just fine with servers. What do you see in there
> that will make it not work with servers?

eg. such solutions in tulip-NAPI-010910:

/* For now we do this to avoid getting into
IRQ mode too quickly */

if( jiffies - dev->last_rx == 0 ) goto not_done;
[...]
not_done:
[...]
return 1;

combined with this code in the net_rx_action softirq handler:

+ while (!list_empty(&queue->poll_list)) {
+ struct net_device *dev;
[...]
+ if (dev->quota <= 0 || dev->poll(dev, &budget)) {
+ local_irq_disable();
+ list_del(&dev->poll_list);
+ list_add_tail(&dev->poll_list, &queue->poll_list);

while the stated goal of NAPI is to do 'intelligent, feedback based
polling', apparently the code is not trusting its own metrics, and is
forcing the interface into polling mode if we are still within the same 10
msec period of time, or if we have looped 300 times (default
netdev_max_backlog value). Not very intelligent IMO.

In a generic computing environment i want to spend cycles doing useful
work, not polling. Even the quick kpolld hack [which i dropped, so please
dont regard it as a 'competitor' patch] i consider superior to this, as i
can renice kpolld to reduce polling. (plus kpolld sucks up available idle
cycles as well.) Unless i royally misunderstand it, i cannot stop the
above code from wasting my cycles, and if that is true i do not want to
see it in the kernel proper in this form.

if the only thing done by a system is processing network packets, then
polling is a very nice solution for high loads. So do not take my comments
as an attack against polling.

*if* you can make polling a success in ~90% of the time we enter
tulip_poll() under non-specific server load (ie. not routing), then i
think you have really good metrics.

Ingo

2001-10-03 15:42:43

by Ben Greear

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

jamal wrote:

> No. NAPI is for any type of network activities not just for routers or
> sniffers. It works just fine with servers. What do you see in there that
> will make it not work with servers?

Will NAPI patch, as it sits today, fix all IRQ lockup problems for
all drivers (as Ingo's patch claims to do), or will it just fix
drivers (eepro, tulip) that have been integrated with it?

--
Ben Greear <[email protected]> <[email protected]>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear

2001-10-03 16:01:39

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Wed, 3 Oct 2001, Ben Greear wrote:

> jamal wrote:
>
> > No. NAPI is for any type of network activities not just for routers or
> > sniffers. It works just fine with servers. What do you see in there that
> > will make it not work with servers?
>
> Will NAPI patch, as it sits today, fix all IRQ lockup problems for
> all drivers (as Ingo's patch claims to do), or will it just fix
> drivers (eepro, tulip) that have been integrated with it?

Unfortunately amongst the three of us tulip seemed to be the most common.
Robert has a gige intel. So patches appear only for those two drivers. I
could write up a document on how to change drivers.

cheers,
jamal

2001-10-03 15:59:19

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Wed, 3 Oct 2001, Ingo Molnar wrote:

>
> On Wed, 3 Oct 2001, jamal wrote:
>
> > No. NAPI is for any type of network activities not just for routers or
> > sniffers. It works just fine with servers. What do you see in there
> > that will make it not work with servers?
>
> eg. such solutions in tulip-NAPI-010910:
>
> /* For now we do this to avoid getting into
> IRQ mode too quickly */
>
> if( jiffies - dev->last_rx == 0 ) goto not_done;
> [...]
> not_done:
> [...]
> return 1;

this code was added by Robert to check something; cant remember the
details on that specific date. The goal is to test variuos workloads and
conditions before reaching conclusions; so it might have been valid on
that day only
Take it out and things should work just fine.

> combined with this code in the net_rx_action softirq handler:
>
> + while (!list_empty(&queue->poll_list)) {
> + struct net_device *dev;
> [...]
> + if (dev->quota <= 0 || dev->poll(dev, &budget)) {
> + local_irq_disable();
> + list_del(&dev->poll_list);
> + list_add_tail(&dev->poll_list, &queue->poll_list);
>
> while the stated goal of NAPI is to do 'intelligent, feedback based
> polling', apparently the code is not trusting its own metrics, and is
> forcing the interface into polling mode if we are still within the same 10
> msec period of time, or if we have looped 300 times (default
> netdev_max_backlog value). Not very intelligent IMO.
>

You misunderstood. This is to enforce fairness. Read the paper. When
you have one device sending 100Kpps and another sending 1pps to the stack,
you wanna make sure that the 1pps doesnt get starved -- thats what the
purpose of the above code is (hence the Round robin scheduling and the
quota per device).

> In a generic computing environment i want to spend cycles doing useful
> work, not polling. Even the quick kpolld hack [which i dropped, so please
> dont regard it as a 'competitor' patch] i consider superior to this, as i
> can renice kpolld to reduce polling. (plus kpolld sucks up available idle
> cycles as well.) Unless i royally misunderstand it, i cannot stop the
> above code from wasting my cycles, and if that is true i do not want to
> see it in the kernel proper in this form.
>

Again, you misunderstood. Please spend a few more minutes reading the code
and i should insist you read the paper ;->
The interupt just flags "i, netdev, have work to do"; the the poll thread
grabs packets off it when the softirq gets scheduled. So we dont do
unecessary polling; we only poll when there is work to be done.
In the low-load case this solution reduces to the same as interupt driven
system and scales to the system/CPU capacity.

> if the only thing done by a system is processing network packets, then
> polling is a very nice solution for high loads. So do not take my comments
> as an attack against polling.
>

The poll thread is run as softirq, just as the other half of networking is
today. And as should be because networking is extremely important as a
subsyestem.

> *if* you can make polling a success in ~90% of the time we enter
> tulip_poll() under non-specific server load (ie. not routing), then i
> think you have really good metrics.

we can make it 100% successful; i mentioned that we only do work, if there
is work to be done.

cheers,
jamal

2001-10-03 16:09:31

by Ben Greear

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

jamal wrote:
>
> On Wed, 3 Oct 2001, Ben Greear wrote:
>
> > jamal wrote:
> >
> > > No. NAPI is for any type of network activities not just for routers or
> > > sniffers. It works just fine with servers. What do you see in there that
> > > will make it not work with servers?
> >
> > Will NAPI patch, as it sits today, fix all IRQ lockup problems for
> > all drivers (as Ingo's patch claims to do), or will it just fix
> > drivers (eepro, tulip) that have been integrated with it?
>
> Unfortunately amongst the three of us tulip seemed to be the most common.
> Robert has a gige intel. So patches appear only for those two drivers. I
> could write up a document on how to change drivers.
>

So, couldn't your NAPI patch be used by drivers that are updated, and
let Ingo's patch be a catch-all for un-fixed drivers? As we move foward,
more and more drivers support your version, and Ingo's patch becomes less
utilized. So long as the patches are tuned such that yours keeps Ingo's
from being triggered on devices you support, there should be no real
conflict, eh?

Ben

> cheers,
> jamal

--
Ben Greear <[email protected]> <[email protected]>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear

2001-10-03 16:16:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Wed, 3 Oct 2001, Ben Greear wrote:

> So, couldn't your NAPI patch be used by drivers that are updated, and
> let Ingo's patch be a catch-all for un-fixed drivers? As we move
> foward, more and more drivers support your version, and Ingo's patch
> becomes less utilized. So long as the patches are tuned such that
> yours keeps Ingo's from being triggered on devices you support, there
> should be no real conflict, eh?

exactly. auto-mitigation will not hurt NAPI-enabled devices the least.
Also, auto-mitigation is device-independent.

perhaps Jamal misunderstood the nature of my patch, so i'd like to state
it again: auto-mitigation is a feature that is not triggered normally. I
did a quick hack yesterday to include kpolld - that was a mistake, i was
wrong, and i've removed it. kpolld was mostly an experiment to prove that
TCP network connections can be fully functional during extreme overload
situations as well. Also, auto-mitigation will be a nice mechanizm to make
people more aware of the NAPI patch: if they ever notice 'Possible IRQ
overload:' messages then they can be told to try the NAPI patches.

Ingo

2001-10-03 16:20:51

by Jeff Garzik

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Wed, 3 Oct 2001, Ben Greear wrote:
> jamal wrote:
> > On Wed, 3 Oct 2001, Ben Greear wrote:
> > > jamal wrote:
> > > > No. NAPI is for any type of network activities not just for routers or
> > > > sniffers. It works just fine with servers. What do you see in there that
> > > > will make it not work with servers?
> > >
> > > Will NAPI patch, as it sits today, fix all IRQ lockup problems for
> > > all drivers (as Ingo's patch claims to do), or will it just fix
> > > drivers (eepro, tulip) that have been integrated with it?
> >
> > Unfortunately amongst the three of us tulip seemed to be the most common.
> > Robert has a gige intel. So patches appear only for those two drivers. I
> > could write up a document on how to change drivers.
>
> So, couldn't your NAPI patch be used by drivers that are updated, and
> let Ingo's patch be a catch-all for un-fixed drivers? As we move foward,
> more and more drivers support your version, and Ingo's patch becomes less
> utilized. So long as the patches are tuned such that yours keeps Ingo's
> from being triggered on devices you support, there should be no real
> conflict, eh?

The main thing for me is that jamal/robert/ANK's work has been
undergoing research and refinement for a while now, with very promising
results combined with minimal impact on network drivers.

Any of Ingo's solutions need to be tested in a variety of situations
before we can jump on it with any confidence.

For example, although Ingo dismisses shared-irq situations as
an uninteresting case, we need to take that case into account as well,
because starvation can definitely occur.

I'm all for trying out ideas and test patches, but something as core as
hard IRQ handling needs a lot of testing and research in many different
real world situations before we use it.

So far I do not agree that there is a magic bullet...

Jeff



2001-10-03 16:23:21

by Manfred Spraul

[permalink] [raw]
Subject: Re: [patch] auto-limiting IRQ load take #2, irq-rewrite-2.4.11-F4

On Wed, 3 Oct 2001, Ingo Molnar wrote:
>
> the attached patch contains a cleaned up version of IRQ
> auto-mitigation.
>

What's the purpose of the patch?
Should it enable itself under load, or is it an emergency switch if a
broken driver (or broken hardware) causes an IRQ storm that makes the
computer unusable?

As an emergency switch it's a good idea.
But it should never enable itself unless the box is nearly dead, and it
can't replace NAPI and interrupt mitigation.

> (i'd like to stress the point again that the goal of this approach
> is *not* to be nice. This is an airbag mechanizm, it can and
> will hurt performance. But my box does not lock up
> anymore.)
>
Ok, then I like the patch.

--
Manfred



2001-10-03 16:34:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Wed, 3 Oct 2001, Ben Greear wrote:
>
> Will NAPI patch, as it sits today, fix all IRQ lockup problems for
> all drivers (as Ingo's patch claims to do), or will it just fix
> drivers (eepro, tulip) that have been integrated with it?

Note that the big question here is WHO CARES?

There are two issues, and they are independent:
(a) handling of network packet flooding nicely
(b) handling screaming devices nicely.

First off, some comments:
(a) is not a major security issue. If you allow untrusted users full
100/1000Mbps access to your internal network, you have _other_
security issues, like packet sniffing etc that are much much MUCH
worse. So the packet flooding thing is very much a corner case, and
claiming that we have a big problem is silly.

HOWEVER, (a) _can_ be a performance issue under benchmark load.
Benchmarks (unlike real life) are almost always set up to have full
network bandwidth access, and can show this issue.

(b) is to a large degree due to a stupid driver interface. I've wanted to
change the IRQ handler functions to return a flag mask for about
three years, but with hundreds of drivers it's always been a bit too
painful.

Why do we want to return a flag mask? Because we want the _driver_ to
be able to say "shut me up" (if the driver cannot shut itself up and
wants to throttle), and we want the _driver_ to be able to say "Hmm,
that interrupt was not for me", so that the higher levels can quickly
figure out if we have the case of us having two drivers but three
devices, and the third device screaming its head off.

Ingo tries to fix both of these with a sledgehammer. I'd rather use a bit
more finesse, and as I do not actually agree with the people who seem to
think that this is a major problem TODAY, I'll be more than happy to have
people think about it. The NAPI people have thought about it - but it has
obviously not been descussed _nearly_ widely enough.

I personally am very nervous about Ingo's approach. I do not believe that
it will work well over a wide range of machines, and I suspect that the
"tunables" have been tuned for one load and one machine. I would not be
surprised if Ingo finds that trying to put the machine under heavy disk
load with multiple disk controllers might also cause interrupt mitigation,
which would be unacceptably BAD.

Linus

2001-10-03 16:51:46

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch] auto-limiting IRQ load take #2, irq-rewrite-2.4.11-F4

On Wed, 3 Oct 2001, jamal wrote:

> Your approach is still wrong. Please do not accept this patch.

I rather like the fact that Ingo's approach will keep the
system alive regardless of what driver is used.

Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/ (volunteers needed)

http://www.surriel.com/ http://distro.conectiva.com/

2001-10-03 16:54:06

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

Hello!

> In a generic computing environment i want to spend cycles doing useful
> work, not polling.

Ingo, "polling" is wrong name. It does not poll. :-)
Actually, this misnomer is the worst thing whic I worried about.

Citing my old explanation:

>"Polling" is not a real polling in fact, it just accepts irqs as
>events waking rx softirq with blocking subsequent irqs.
>Actual receive happens at softirq.
>
>Seems, this approach solves the worst half of livelock problem completely:
>irqs are throttled and tuned to load automatically.
>Well, and drivers become cleaner.

Alexey

2001-10-03 16:54:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Wed, 3 Oct 2001, jamal wrote:

> this code was added by Robert to check something; cant remember the
> details on that specific date. [...]

ok.

> > + while (!list_empty(&queue->poll_list)) {
> > + struct net_device *dev;
> > [...]
> > + if (dev->quota <= 0 || dev->poll(dev, &budget)) {
> > + local_irq_disable();
> > + list_del(&dev->poll_list);
> > + list_add_tail(&dev->poll_list, &queue->poll_list);

> You misunderstood. This is to enforce fairness. [...]

(i did not criticize the list_add/list_del in any way, it's obviously
correct to cycle the polled devices. I highlited that code only to show
that the current patch as-is polls too agressively for generic server
load.)

> Read the paper.

(i prefer source code. Can i assume the 'authorative' patch to be the one
with the "goto not_done;" line removed, correct?)

> > In a generic computing environment i want to spend cycles doing useful
> > work, not polling. Even the quick kpolld hack [which i dropped, so please
> > dont regard it as a 'competitor' patch] i consider superior to this, as i
> > can renice kpolld to reduce polling. (plus kpolld sucks up available idle
> > cycles as well.) Unless i royally misunderstand it, i cannot stop the
> > above code from wasting my cycles, and if that is true i do not want to
> > see it in the kernel proper in this form.

> The interupt just flags "i, netdev, have work to do"; [...]

(and the only thing i pointed out was that the patch as-is did not limit
the amount of polling done.)

> > *if* you can make polling a success in ~90% of the time we enter
> > tulip_poll() under non-specific server load (ie. not routing), then i
> > think you have really good metrics.
>
> we can make it 100% successful; i mentioned that we only do work, if
> there is work to be done.

can you really make it 100% successful for rx? Ie. do you only ever call
the ->poll() function if there is a new packet waiting? How do you know
with a 100% probability that someone on the network just sent a new packet
waiting? (without receiving an interrupt to begin with that is.)

Ingo

2001-10-03 17:08:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Wed, 3 Oct 2001 [email protected] wrote:

> Ingo, "polling" is wrong name. It does not poll. :-)

ok. i was also mislead by a quick hack in the source code :)

> Actually, this misnomer is the worst thing whic I worried about.

i think something like: 'offloading hardirq work into softirqs' covers the
concept better, right?

> Citing my old explanation:
>
> > "Polling" is not a real polling in fact, it just accepts irqs as
> > events waking rx softirq with blocking subsequent irqs.
> > Actual receive happens at softirq.
> >
> > Seems, this approach solves the worst half of livelock problem
> > completely: irqs are throttled and tuned to load automatically.
> > Well, and drivers become cleaner.

i like this approach very much, and indeed this is not polling in any way.

i'm worried by the dev->quota variable a bit. As visible now in the
2.4.10-poll.pat and tulip-NAPI-010910.tar.gz code, it keeps calling the
->poll() function until dev->quota is gone. I think it should only keep
calling the function until the rx ring is fully processed - and it should
re-enable the receiver afterwards, when exiting net_rx_action.

Ingo

2001-10-03 17:27:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Wed, 3 Oct 2001, Linus Torvalds wrote:

> [...] I would not be surprised if Ingo finds that trying to put the
> machine under heavy disk load with multiple disk controllers might
> also cause interrupt mitigation, which would be unacceptably BAD.

well, just tested my RAID testsystem as well. I have not tested heavy
IO-related IRQ load with the patch before (so it was not tuned for that
test in any way), but did so now: an IO test running on 12 disks, (5 IO
interfaces: 3 SCSI cards and 2 IDE interfaces) producing 150 MB/sec block
IO load and a fair number of SCSI and IDE interrupts, did not trigger the
overload code. I started the network overload utility during this test,
and the code detected overload on the network interrupt (and only on the
network interrupt). IO load is still high (down to 130 MB/sec), while a
fair amount of networking load is handled as well. (While there certainly
are higher IO loads on some Linux boxes, mine should be above the average
IO traffic.)

Ingo

2001-10-03 17:30:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Wed, 3 Oct 2001, jamal wrote:

> use the netif_rx() return code and hardware flowcontrol to fix it.

i'm using hardware flowcontrol in the patch, but at a different, higher
level. This part of the do_IRQ() code disables the offending IRQ source:

[...]
desc->status |= IRQ_MITIGATED|IRQ_PENDING;
__disable_irq(desc, irq);

which in turn stops that device as well sooner or later. Optionally, in
the future, this can be made more finegrained for chipsets that support
device-independent IRQ mitigation features, like the USB 2.0 EHCI feature
mentioned by David Brownell.

i'd prefer it if all subsystems and drivers in the kernel behaved properly
and limited their IRQ load - but this does not always happen and users are
hit by irq overload situations.

Your NAPI patch, or any driver/subsystem that does flowcontrol accurately
should never be affected by it in any way. No overhead, no performance
hit.

Ingo

2001-10-03 18:12:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

In article <[email protected]>,
Ingo Molnar <[email protected]> wrote:
>
>well, just tested my RAID testsystem as well. I have not tested heavy
>IO-related IRQ load with the patch before (so it was not tuned for that
>test in any way), but did so now: an IO test running on 12 disks, (5 IO
>interfaces: 3 SCSI cards and 2 IDE interfaces) producing 150 MB/sec block
>IO load and a fair number of SCSI and IDE interrupts, did not trigger the
>overload code.

Now test it again with the disk interrupt being shared with the network
card.

Doesn't happen? It sure does. It happens more often especially on
slightly lower-end machines (on laptops it's downright disgusting how
often _every_ single PCI device ends up sharing the same interrupt).

And as the lower-end machines are the ones that probably can be forced
to trigger the whole thing more often, this is a real issue.

And on my "high-end" machine, I actually have USB and ethernet on the
same interrupt. It would be kind of nasty if heavy network traffic
makes my camera stop working...

The fact is, there is never any good reason for limiting "trusted"
interrupts, ie anything that is internal to the box. Things like disks,
graphics controllers etc.

Which is why I like the NAPI approach. If somebody overloads my network
card, my USB camera doesn't stop working.

I don't disagree with your patch as a last resort when all else fails,
but I _do_ disagree with it as a network load limiter.

Linus

2001-10-03 18:25:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Wed, 3 Oct 2001, Linus Torvalds wrote:

> Now test it again with the disk interrupt being shared with the
> network card.
>
> Doesn't happen? It sure does. [...]

yes, disk IRQs might be delayed in that case. Without this mechanizm there
is a lockup.

> Which is why I like the NAPI approach. If somebody overloads my
> network card, my USB camera doesn't stop working.

i agree that NAPI is a better approach. And IRQ overload does not happen
on cards that have hardware-based irq mitigation support already. (and i
should note that those cards will likely perform even faster with NAPI.)

> I don't disagree with your patch as a last resort when all else fails,
> but I _do_ disagree with it as a network load limiter.

okay - i removed those parts already (kpolld) in today's patch. (It
initially was an experiment to prove that this is the only problem we are
facing under such loads.)

Ingo

2001-10-03 18:33:12

by Davide Libenzi

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Wed, 3 Oct 2001, jamal wrote:
> > NAPI seems to be very promising to fix the total system overload case
> > (so many packets arrive that despite irq mitigation the system is still
> > overloaded).
> >
> > But the implementation of irq mitigation is driver specific, and a 10
> > millisecond stop is far too long.
> >
>
> violent agreement.

The Ingo's solution move the mitigation control into the kernel with the
immediate advantage that it'll work right now with existing drivers.
I think that the idea of kirqpoll is good but the long term solution
should be the move of the mitigation knowledge inside the drivers that
will register their own kirqpoll callbacks when they're going to mask
irqs.
In this case the "intelligence" about irq rates is left in the place where
there's more knowledge about the I/O traffic nature.



- Davide


2001-10-03 19:04:09

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Wed, Oct 03, 2001 at 08:53:58PM +0400, [email protected] wrote:
> Citing my old explanation:
>
> >"Polling" is not a real polling in fact, it just accepts irqs as
> >events waking rx softirq with blocking subsequent irqs.
> >Actual receive happens at softirq.
> >
> >Seems, this approach solves the worst half of livelock problem completely:
> >irqs are throttled and tuned to load automatically.
> >Well, and drivers become cleaner.

Well, this sounds like a 2.5 patch. When do we get to merge it?

-ben

2001-10-03 20:02:22

by Simon Kirby

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Wed, Oct 03, 2001 at 09:33:12AM -0700, Linus Torvalds wrote:

> Note that the big question here is WHO CARES?
>
> There are two issues, and they are independent:
> (a) handling of network packet flooding nicely
> (b) handling screaming devices nicely.
>
> First off, some comments:
> (a) is not a major security issue. If you allow untrusted users full
> 100/1000Mbps access to your internal network, you have _other_
> security issues, like packet sniffing etc that are much much MUCH
> worse. So the packet flooding thing is very much a corner case, and
> claiming that we have a big problem is silly.
>
> HOWEVER, (a) _can_ be a performance issue under benchmark load.
> Benchmarks (unlike real life) are almost always set up to have full
> network bandwidth access, and can show this issue.

Actually, the way I first started looking at this problem is the result
of a few attacks that have happened on our network. It's not just a
while(1) sendto(); UDP spamming program that triggers it -- TCP SYN
floods show the problem as well, and _there is no way_ to protect against
this without using syncookies or some similar method that can only be
done on the receiving TCP stack only.

At one point, one of our webservers received 30-40Mbit/sec of SYN packets
sustained for almost 24 hours. Needless to say, the machine was not
happy.

Simon-

[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]

2001-10-03 20:41:47

by Jeremy Hansen

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


I better go check my pants...

Thanks
-jeremy

On Wed, 3 Oct 2001, Linus Torvalds wrote:

> In article <[email protected]>,
> Ingo Molnar <[email protected]> wrote:
> >
> >well, just tested my RAID testsystem as well. I have not tested heavy
> >IO-related IRQ load with the patch before (so it was not tuned for that
> >test in any way), but did so now: an IO test running on 12 disks, (5 IO
> >interfaces: 3 SCSI cards and 2 IDE interfaces) producing 150 MB/sec block
> >IO load and a fair number of SCSI and IDE interrupts, did not trigger the
> >overload code.
>
> Now test it again with the disk interrupt being shared with the network
> card.
>
> Doesn't happen? It sure does. It happens more often especially on
> slightly lower-end machines (on laptops it's downright disgusting how
> often _every_ single PCI device ends up sharing the same interrupt).
>
> And as the lower-end machines are the ones that probably can be forced
> to trigger the whole thing more often, this is a real issue.
>
> And on my "high-end" machine, I actually have USB and ethernet on the
> same interrupt. It would be kind of nasty if heavy network traffic
> makes my camera stop working...
>
> The fact is, there is never any good reason for limiting "trusted"
> interrupts, ie anything that is internal to the box. Things like disks,
> graphics controllers etc.
>
> Which is why I like the NAPI approach. If somebody overloads my network
> card, my USB camera doesn't stop working.
>
> I don't disagree with your patch as a last resort when all else fails,
> but I _do_ disagree with it as a network load limiter.
>
> Linus
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
The trouble with being poor is that it takes up all your time.

2001-10-03 21:06:06

by Robert Olsson

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


Ingo Molnar writes:

> (i did not criticize the list_add/list_del in any way, it's obviously
> correct to cycle the polled devices. I highlited that code only to show
> that the current patch as-is polls too agressively for generic server
> load.)

Yes I think we need some data here...

> can you really make it 100% successful for rx? Ie. do you only ever call
> the ->poll() function if there is a new packet waiting? How do you know
> with a 100% probability that someone on the network just sent a new packet
> waiting? (without receiving an interrupt to begin with that is.)

Well we need RX-interrupts not to spin away the CPU or exhaust the the PCI-
bus. The NAPI scheme is simple, turn off RX-interrupts when the first packet
comes and have the kernel to pull packets from the RX-ring.

I tried have pure polling... it easy do just have your driver return
"not_done" all the time. Not a good idea. :-) Maybe as sofirq test.

If the device has more packets to deliver than is "allowed" we put this
back on list and the polling process can give the next device its share
and so on. So we handle screaming network devices and packet flooding
nicely a deliver a decent performance even under those circumstances.

As you seen from some code fragments we have played with some mechanisms
to delay the transition from polling to irq-enable. I think I accepted
a not_done polls for jiffies in some of the tests. Agree other variants
are possible and hopefully better.

SMP is another area, robustness and performance of course but in case
of SMP we also have to deal with packet reordering which is something
we really like to minimize. Even here I think the NAPI polling scheme
is interesting. During consecutive polls the device is bound to the same
CPU and no packet reordering should occur.

And from data we have now we can see packet load is even distributed
among different CPU's and should follow the smp_affinity.

Cheers.

--ro



2001-10-03 22:25:17

by Andreas Dilger

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Oct 03, 2001 23:08 +0200, Robert Olsson wrote:
> Ingo Molnar writes:
> > (i did not criticize the list_add/list_del in any way, it's obviously
> > correct to cycle the polled devices. I highlited that code only to show
> > that the current patch as-is polls too agressively for generic server
> > load.)
>
> Yes I think we need some data here...
>
> > can you really make it 100% successful for rx? Ie. do you only ever call
> > the ->poll() function if there is a new packet waiting? How do you know
> > with a 100% probability that someone on the network just sent a new packet
> > waiting? (without receiving an interrupt to begin with that is.)
>
> Well we need RX-interrupts not to spin away the CPU or exhaust the the PCI-
> bus. The NAPI scheme is simple, turn off RX-interrupts when the first packet
> comes and have the kernel to pull packets from the RX-ring.
>
> I tried have pure polling... it easy do just have your driver return
> "not_done" all the time. Not a good idea. :-) Maybe as sofirq test.

I think it is rather easy to make this self-regulating (I may be wrong).

If you get to the stage where you are turning off IRQs and going to a
polling mode, then don't turn IRQs back on until you have a poll (or
two or whatever) that there is no work to be done. This will at worst
give you 50% polling success, but in practise you wouldn't start polling
until there is lots of work to be done, so the real success rate will
be much higher.

At this point (no work to be done when polling) there are clearly no
interrupts would be generated (because no packets have arrived), so it
should be reasonable to turn interrupts back on and stop polling (assuming
non-broken hardware). You now go back to interrupt-driven work until
the rate increases again. This means you limit IRQ rates when needed,
but only do one or two excess polls before going back to IRQ-driven work.

Granted, I don't know what the overhead of turning the IRQs on and off
is, but since we do it all the time already (for each ISR) it can't be
that bad.

If you are always having work to do when polling, then interrupts will
never be turned on again, but who cares at that point because the work
is getting done? Similarly, if you have IRQs disabled, but are sharing
IRQs there is nothing wrong in polling all devices sharing that IRQ
(at least conceptually).

I don't know much about IRQ handlers, but I assume that this is already
what happens if you are sharing an IRQ - you don't know which of many
sources it comes from, so you poll all of them to see if they have any
work to be done. If you are polling some of the shared-IRQ devices too
frequently (i.e. they never have work to do), you could have some sort
of progressive backoff, so you skip polling those for a growing number
of polls (this could also be set by the driver if it knows that it could
only generate real work every X ms, so we skip about X/poll_rate polls).

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-10-04 00:47:15

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Wed, 3 Oct 2001, Ingo Molnar wrote:

> i like this approach very much, and indeed this is not polling in any way.
>
> i'm worried by the dev->quota variable a bit. As visible now in the
> 2.4.10-poll.pat and tulip-NAPI-010910.tar.gz code, it keeps calling the
> ->poll() function until dev->quota is gone. I think it should only keep
> calling the function until the rx ring is fully processed - and it should
> re-enable the receiver afterwards, when exiting net_rx_action.

This would result in an unfairness. Think of one device which receives
packets really fast that it takes most of the CPU capacity just processing
it.

cheers,
jamal


2001-10-04 00:49:24

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Wed, 3 Oct 2001, Ingo Molnar wrote:

>
> On Wed, 3 Oct 2001, jamal wrote:
>
> (and the only thing i pointed out was that the patch as-is did not limit
> the amount of polling done.)

you mean in the softirq or the one line in the driver?

>
> > > *if* you can make polling a success in ~90% of the time we enter
> > > tulip_poll() under non-specific server load (ie. not routing), then i
> > > think you have really good metrics.
> >
> > we can make it 100% successful; i mentioned that we only do work, if
> > there is work to be done.
>
> can you really make it 100% successful for rx? Ie. do you only ever call
> the ->poll() function if there is a new packet waiting? How do you know
> with a 100% probability that someone on the network just sent a new packet
> waiting? (without receiving an interrupt to begin with that is.)
>

Take a look at what i think is the NAPI state machine pending a nod
from Alexey and Robert:
http://www.cyberus.ca/~hadi/NAPI-SM.ps.gz

cheers,
jamal


2001-10-04 00:56:15

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Wed, 3 Oct 2001, Ingo Molnar wrote:

>
> On Wed, 3 Oct 2001, jamal wrote:
>
> > use the netif_rx() return code and hardware flowcontrol to fix it.
>
> i'm using hardware flowcontrol in the patch, but at a different, higher
> level. This part of the do_IRQ() code disables the offending IRQ source:
>
> [...]
> desc->status |= IRQ_MITIGATED|IRQ_PENDING;
> __disable_irq(desc, irq);
>
> which in turn stops that device as well sooner or later. Optionally, in
> the future, this can be made more finegrained for chipsets that support
> device-independent IRQ mitigation features, like the USB 2.0 EHCI feature
> mentioned by David Brownell.
>

I think each subsytem should be in charge of its own fate. USB applies in
whatever subsystem it belongs to. Cooperating subsystems doing what os
best for the system.

> i'd prefer it if all subsystems and drivers in the kernel behaved properly
> and limited their IRQ load - but this does not always happen and users are
> hit by irq overload situations.
>

Your patch with Linus' idea of "flag mask" would be more acceptable as a
last resort. All subsytems should be cooperative and we resort to this to
send misbehaving kids to their room.

> Your NAPI patch, or any driver/subsystem that does flowcontrol accurately
> should never be affected by it in any way. No overhead, no performance
> hit.

so far your appraoch is that of a shotgun i.e "let me fire in
that crowd and i'll hit my target but dont care if i take down a few
more"; regardless of how noble the reasoning is, it's as Linus described
it -- a sledge hammer.

cheers,
jamal


2001-10-04 01:07:09

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Wed, 3 Oct 2001, Simon Kirby wrote:

> On Wed, Oct 03, 2001 at 09:33:12AM -0700, Linus Torvalds wrote:
>
> Actually, the way I first started looking at this problem is the result
> of a few attacks that have happened on our network. It's not just a
> while(1) sendto(); UDP spamming program that triggers it -- TCP SYN
> floods show the problem as well, and _there is no way_ to protect against
> this without using syncookies or some similar method that can only be
> done on the receiving TCP stack only.
>
> At one point, one of our webservers received 30-40Mbit/sec of SYN packets
> sustained for almost 24 hours. Needless to say, the machine was not
> happy.
>

I think you can save yourself a lot of pain today by going to a "better
driver"/hardware. Switch to a tulip based board; in particular one which
is based on the 21143 chipset. Compile in hardware traffic control and
save yourself some pain.
The interface was published but so far only the tulip conforms to it.
It can sustain upto about 90% of the wire rate before it starts
dropping. And at those rates you still have plenty of CPU available.
The ingress policer in the traffic control code might also be able to
help, however CPU cycles are already wasted by the time that code is hit;
with NAPI you should be able to push the filtering much lower in the
stack.

cheers,
jamal

2001-10-04 01:12:49

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Wed, 3 Oct 2001, Benjamin LaHaise wrote:

> On Wed, Oct 03, 2001 at 08:53:58PM +0400, [email protected] wrote:
> > Citing my old explanation:
> >
> > >"Polling" is not a real polling in fact, it just accepts irqs as
> > >events waking rx softirq with blocking subsequent irqs.
> > >Actual receive happens at softirq.
> > >
> > >Seems, this approach solves the worst half of livelock problem completely:
> > >irqs are throttled and tuned to load automatically.
> > >Well, and drivers become cleaner.
>
> Well, this sounds like a 2.5 patch. When do we get to merge it?


It is backward compatible to 2.4 netif_rx() which means it can go in now.
The problem is netdrivers that want to use the interface have to be
morphed.
As a general disclaimer, i really dont mean to put down Ingo's efforts i
just think the irq mitigation idea as is now is wrong for both 2.4 and 2.5

cheers,
jamal

2001-10-04 01:30:14

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Wed, Oct 03, 2001 at 09:10:10PM -0400, jamal wrote:
> > Well, this sounds like a 2.5 patch. When do we get to merge it?
>
>
> It is backward compatible to 2.4 netif_rx() which means it can go in now.
> The problem is netdrivers that want to use the interface have to be
> morphed.

I'm alluding to the fact that we need a place to put in-development patches.

> As a general disclaimer, i really dont mean to put down Ingo's efforts i
> just think the irq mitigation idea as is now is wrong for both 2.4 and 2.5

What is your solution to the problem? Leaving it up to the driver authors
doesn't work as they're not perfect. Yes, drivers should attempt to do a
good job at irq mitigation, but sometimes a safety net is needed.

-ben

2001-10-04 01:42:36

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Wed, 3 Oct 2001, Benjamin LaHaise wrote:

> On Wed, Oct 03, 2001 at 09:10:10PM -0400, jamal wrote:
> > > Well, this sounds like a 2.5 patch. When do we get to merge it?
> >
> >
> > It is backward compatible to 2.4 netif_rx() which means it can go in now.
> > The problem is netdrivers that want to use the interface have to be
> > morphed.
>
> I'm alluding to the fact that we need a place to put in-development patches.
>

Sorry ;-> Yes, where is is 2.5 again? ;->

> > As a general disclaimer, i really dont mean to put down Ingo's efforts i
> > just think the irq mitigation idea as is now is wrong for both 2.4 and 2.5
>
> What is your solution to the problem? Leaving it up to the driver authors
> doesn't work as they're not perfect. Yes, drivers should attempt to do a
> good job at irq mitigation, but sometimes a safety net is needed.
>

To be honest i am getting a little nervous with what i saw in something
that seems to be a stable kernel. I was nervous when i saw ksoftirq, but
its already in there. I think we can use the ksoftirq replacement pending
testing to show if latency is improved. I have time this weekend, if that
patch can be isolated it can be tested with NAPI etc.
As for the irq mitigation, in its current form it is insufficient; but
would be OK to go into 2.5 with plans to go and implement the isolation
feature. I would put NAPI into this same category. We can then backport
both back to 2.4.
With current 2.4, i say yes, we leave it to the drivers (and infact claim
we have a sustainable solution if conformed to)

cheers,
jamal

2001-10-04 02:32:07

by Rob Landley

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Wednesday 03 October 2001 21:30, Benjamin LaHaise wrote:
> On Wed, Oct 03, 2001 at 09:10:10PM -0400, jamal wrote:
> > > Well, this sounds like a 2.5 patch. When do we get to merge it?
> >
> > It is backward compatible to 2.4 netif_rx() which means it can go in now.
> > The problem is netdrivers that want to use the interface have to be
> > morphed.
>
> I'm alluding to the fact that we need a place to put in-development
> patches.

Such as a 2.5 kernel tree? :)

Sorry, couldn't resist. It was just hanging there... *Sniff* I tried. I
was weak...!

Rob

2001-10-04 03:49:56

by Bill Davidsen

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

In article <[email protected]>
[email protected] wrote:

>there are *tons* of disadvantages if IRQs are shared. In any
>high-performance environment, not having enough interrupt sources is a
>sizing or hw design mistake. You can have up to 200 interrupts even on a
>PC, using multipe IO-APICs. Any decent server board distributes interrupt
>sources properly. Shared interrupts are a legacy of the PC design, and we
>are moving away from it slowly but surely. Especially under gigabit loads
>there are several PCI busses anyway, so getting non-shared interrupts is
>not only easy but a necessity as well. There is no law in physics that
>somehow mandates or prefers the sharing of interrupt vectors: devices are
>distinct, they use up distinct slots in the board. The PCI bus can get
>multiple IRQ sources out of a single card, so even multi-controller cards
>are covered.

Sharing irq between unrelated devices is probably evil in all cases,
but for identical devices like multiple NICs, the shared irq results in
*one* irq call, followed by polling the devices connected, which can be
lower overhead than servicing N interrupts on a multi-NIC system.

Shared interrupts predate the PC by a decade (or more), so the comment
about the "PC design" is not relevant. In general polling multiple
devices is less CPU than servicing the same i/o by a larger number of
entries to the interrupt handler. The polling offers the possibility of
lower the number of context switches, far more expensive than checking a
device.

In serial and network devices the poll is often unavoidable, unless
you use one irq for send and one for receive you will be doing a bit of
polling in any case.

--
bill davidsen <[email protected]>
"If I were a diplomat, in the best case I'd go hungry. In the worst
case, people would die."
-- Robert Lipe

2001-10-04 04:12:49

by Bill Davidsen

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

In article <[email protected]>
[email protected] wrote:
>Note that the big question here is WHO CARES?
>
>There are two issues, and they are independent:
> (a) handling of network packet flooding nicely
> (b) handling screaming devices nicely.
>
>First off, some comments:
> (a) is not a major security issue. If you allow untrusted users full
> 100/1000Mbps access to your internal network, you have _other_
> security issues, like packet sniffing etc that are much much MUCH
> worse. So the packet flooding thing is very much a corner case, and
> claiming that we have a big problem is silly.

Did something give you the idea that this only happens on internal
networks? Generally we have untrusted users on external networks, and
lots of them. I have seen problems on heavily loaded DNS and news
servers, and can easily imagine that routers would get it as well. It
doesn't take someone running a load generator to generate load! I have a
syslog server which gets packets from the cluster, and the irq rate on
that gets high enough to worry me, although that tends to be spike load.

| HOWEVER, (a) _can_ be a performance issue under benchmark load.
| Benchmarks (unlike real life) are almost always set up to have full
| network bandwidth access, and can show this issue.

| Ingo tries to fix both of these with a sledgehammer. I'd rather use a bit
| more finesse, and as I do not actually agree with the people who seem to
| think that this is a major problem TODAY, I'll be more than happy to have
| people think about it. The NAPI people have thought about it - but it has
| obviously not been descussed _nearly_ widely enough.

It is a problem which happens today, on production servers in use
today, and is currently solved by using more servers than would be
needed if the system didn't fall over under this type of load.

| I personally am very nervous about Ingo's approach. I do not believe that
| it will work well over a wide range of machines, and I suspect that the
| "tunables" have been tuned for one load and one machine. I would not be
| surprised if Ingo finds that trying to put the machine under heavy disk
| load with multiple disk controllers might also cause interrupt mitigation,
| which would be unacceptably BAD.

I will agree that some care is going to be needed to avoid choking the
system, but honestly I doubt that there will be a rush of people going
out and bothering with the feature unless they neeed it. There is some
rate limiting stuff in iptables, and I would bet a six pack of good beer
very few people bother to use them at all unless they are having a
problem. I don't recall any posts saying "I shot myself in the foot with
packet rate limiting."

As I understand the patch, it applies to individual irq and not to the
system as a whole. I admit I read the description and not the source.
But even with multiple SCSI controllers, I can't imagine hitting 20k
irq/sec, which you can with a few NICs. I am amazed that Linux can
function at 70k context switches/sec, but it sure doesn't function well!

I think the potential for harm is pretty small, and generally when you
have the problem you run vmstat (or vmstat2) to see what's happening,
and if the system melts just after irq rate hits N, you might start with
80% of N as a first guess. The performance of a locked-up system is
worse than one dropping packets.

The full fix you want is probably a good thing for 2.5, I think it's
just too radical to drop into a stable serveis (my opinion only).

--
bill davidsen <[email protected]>
"If I were a diplomat, in the best case I'd go hungry. In the worst
case, people would die."
-- Robert Lipe

2001-10-04 06:31:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Wed, 3 Oct 2001, jamal wrote:

> > which in turn stops that device as well sooner or later. Optionally,
> > in the future, this can be made more finegrained for chipsets that
> > support device-independent IRQ mitigation features, like the USB 2.0
> > EHCI feature mentioned by David Brownell.

> I think each subsytem should be in charge of its own fate. USB applies
> in whatever subsystem it belongs to. Cooperating subsystems doing what
> os best for the system.

this is a claim that is nearly perverse and shows a fundamental
misunderstanding of how Linux handles error situations. Perhaps we should
never check NULL pointer dereference in the networking code? Should the
NMI oopser not debug networking related lockups? Should we never print a
warning message on a double enable_irq() in a bad networking driver?

*of course* if a chipset supports IRQ mitigation then the generic IRQ code
can be enabled to use it. We can have networking devices over USB as well.
USB is a bus protocol that provides access to devices, not just a
'subsystem'. And *of course*, the IRQ code is completely right to do
various sanity checks - as it does today.

Linux has various safety nets in various places - always had. It's always
the history of problems in a certain area, the seriousness and impact of
the problem, and the intrusiveness of the safety approach that decides
whether some safety net is added or not, whether it's put under
CONFIG_KERNEL_DEBUG or not. While everybody is free to disagree about the
importance of this particular safety net, just saying 'do not mess with
*our* interrupts' sounds rather childish. Especially considering that
tools are available to trigger lockups via broadband access. Especially
considering that just a few mails earlier you claimed that such lockups do
not even exist. To quote that paragraph of yours:

# Date: Wed, 3 Oct 2001 08:49:51 -0400 (EDT)
# From: jamal <[email protected]>

[...]
# You dont need the patch for 2.4 to work against any lockups. And
# infact i am suprised that you observe _any_ lockups on a PIII which
# are not observed on my PII. Linux as is, without any tuneups can
# handle upto about 40000 packets/sec input before you start observing
# user space startvations. This is about 30Mbps at 64 byte packets; its
# about 60Mbps at 128 byte packets and comfortable at 100Mbps with byte
# size of 256. We really dont have a problem at 100Mbps.

so you should never see any lockups.

> Your patch with Linus' idea of "flag mask" would be more acceptable as
> a last resort. All subsytems should be cooperative and we resort to
> this to send misbehaving kids to their room.

i have nothing against it in 2.5, of course. Until then => my patch adds
an irq.c daddy that sends the bad kids to their room.

> > Your NAPI patch, or any driver/subsystem that does flowcontrol accurately
> > should never be affected by it in any way. No overhead, no performance
> > hit.
>
> so far your appraoch is that of a shotgun [...]

i'm not sure what this has to do with your NAPI patch. You should never
see the code trigger. It's an unused sledgehammer (or shotgun) put into
the garage, as far as NAPI is concerned. And besides, there are lots of
people on your continent that believe in spare shotguns ;)

i'd rather compare this approach to an airbag, or perhaps shackles.
Interrupt auto-limiting, despite your absurd and misleading analogy, does
not 'destroy' or 'kill' anything. It merely limits an IRQ source for up to
10 msecs (if HZ is 1000 then it's only 1 msec), if that IRQ source has
been detected to be critically misbehaving.

Ingo

2001-10-04 06:37:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Wed, 3 Oct 2001, jamal wrote:

> > i'm worried by the dev->quota variable a bit. As visible now in the
> > 2.4.10-poll.pat and tulip-NAPI-010910.tar.gz code, it keeps calling the
> > ->poll() function until dev->quota is gone. I think it should only keep
> > calling the function until the rx ring is fully processed - and it should
> > re-enable the receiver afterwards, when exiting net_rx_action.
>
> This would result in an unfairness. Think of one device which receives
> packets really fast that it takes most of the CPU capacity just
> processing it.

no, i asked something else.

i'm asking the following thing. dev->quota, as i read the patch now, can
cause extra calls to ->poll() even though the RX ring of that particular
device is empty and the driver has indicated it's done processing RX
packets. (i'm now assuming that the extra-polling-for-a-jiffy line in the
current patch is removed - that one is a showstopper to begin with.) Is
this claim of mine correct?

Ingo

2001-10-04 06:50:07

by Ben Greear

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

jamal wrote:

> Your patch with Linus' idea of "flag mask" would be more acceptable as a
> last resort. All subsytems should be cooperative and we resort to this to
> send misbehaving kids to their room.

That requires re-writing all the drivers, right? Seems a very bad
thing to do in 2.4

>
> > Your NAPI patch, or any driver/subsystem that does flowcontrol accurately
> > should never be affected by it in any way. No overhead, no performance
> > hit.
>
> so far your appraoch is that of a shotgun i.e "let me fire in
> that crowd and i'll hit my target but dont care if i take down a few
> more"; regardless of how noble the reasoning is, it's as Linus described
> it -- a sledge hammer.

Aye, but by shooting this target and getting a few bystanders, you save
everyone else... (And it's only a flesh wound!!)

Ben

--
Ben Greear <[email protected]> <Ben_Greear AT excite.com>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear

2001-10-04 06:47:07

by Ben Greear

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

jamal wrote:
>
> On Wed, 3 Oct 2001, Simon Kirby wrote:
>
> > On Wed, Oct 03, 2001 at 09:33:12AM -0700, Linus Torvalds wrote:
> >
> > Actually, the way I first started looking at this problem is the result
> > of a few attacks that have happened on our network. It's not just a
> > while(1) sendto(); UDP spamming program that triggers it -- TCP SYN
> > floods show the problem as well, and _there is no way_ to protect against
> > this without using syncookies or some similar method that can only be
> > done on the receiving TCP stack only.
> >
> > At one point, one of our webservers received 30-40Mbit/sec of SYN packets
> > sustained for almost 24 hours. Needless to say, the machine was not
> > happy.
> >
>
> I think you can save yourself a lot of pain today by going to a "better
> driver"/hardware. Switch to a tulip based board; in particular one which
> is based on the 21143 chipset. Compile in hardware traffic control and
> save yourself some pain.

The tulip driver only started working for my DLINK 4-port NIC
after about 2.4.8, and last I checked the ZYNX 4-port still refuses
to work, so I wouldn't consider it a paradigm of
stability and grace quite yet. Regardless of that, it is often
impossible to trade NICS (think built-in 1U servers), and claiming
to only work correctly on certain hardware (and potentially lock up
hard on other hardware) is a pretty sorry state of affairs...

Ben

--
Ben Greear <[email protected]> <Ben_Greear AT excite.com>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear

2001-10-04 06:52:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Wed, 3 Oct 2001, jamal wrote:

> I think you can save yourself a lot of pain today by going to a
> "better driver"/hardware. Switch to a tulip based board; [...]

This is not an option in many cases. (eg. where a company standardizes on
something non-tulip, or due to simple financial/organizational reasons.)
What you say is the approach i see in the FreeBSD camp frequently: "use
these [limited set of] wonderful cards and drivers, the rest sucks
hardware-design-wise and we dont really care about them", which elitist
attitude i strongly disagree with.

Ingo

2001-10-04 06:55:27

by Jeff Garzik

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Wed, 3 Oct 2001, Ben Greear wrote:
> That requires re-writing all the drivers, right?

NAPI? no. You move some existing code into a separate function, mainly

Jeff



2001-10-04 06:55:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Wed, 3 Oct 2001, Ben Greear wrote:

> > so far your appraoch is that of a shotgun i.e "let me fire in
> > that crowd and i'll hit my target but dont care if i take down a few
> > more"; regardless of how noble the reasoning is, it's as Linus described
> > it -- a sledge hammer.
>
> Aye, but by shooting this target and getting a few bystanders, you save
> everyone else... (And it's only a flesh wound!!)

especially considering that the current code nukes the whole city ;)

Ingo

2001-10-04 06:58:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Thu, 4 Oct 2001, Jeff Garzik wrote:

> On Wed, 3 Oct 2001, Ben Greear wrote:
> > That requires re-writing all the drivers, right?
>
> NAPI? [...]

Ben is talking about the long-planned "irq_action->handler() returns a
code that indicates progress" approach Linus talked about. *that* needs
the changing of every driver, since every IRQ handler prototype that is
'void' now needs to be changed to return 'int'. (the change is trivial,
but intrusive.)

Ingo

Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

Ben Greear <[email protected]> writes:

>jamal wrote:
>>
>> I think you can save yourself a lot of pain today by going to a "better
>> driver"/hardware. Switch to a tulip based board; in particular one which
>> is based on the 21143 chipset. Compile in hardware traffic control and
>> save yourself some pain.

>The tulip driver only started working for my DLINK 4-port NIC
>after about 2.4.8, and last I checked the ZYNX 4-port still refuses
>to work, so I wouldn't consider it a paradigm of
>stability and grace quite yet. Regardless of that, it is often
>impossible to trade NICS (think built-in 1U servers), and claiming
>to only work correctly on certain hardware (and potentially lock up
>hard on other hardware) is a pretty sorry state of affairs...

Does it finally do speed and duplex auto negotiation with Cisco
Catalyst Switches? Something I never ever got to work with various 2.0
and 2.2 drivers, mode settings, Catalyst settings, IOS versions and
almost anything else that I ever tried.

Regards
Henning
--
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH [email protected]

Am Schwabachgrund 22 Fon.: 09131 / 50654-0 [email protected]
D-91054 Buckenhof Fax.: 09131 / 50654-20

2001-10-04 08:25:17

by Magnus Redin

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


Linus writes:
> Note that the big question here is WHO CARES?

Everybody building firewalls, routers, high performance web servers
and broadband content servers with a Linux kernel.
Everyody having a 100 Mbit/s external connection.

100 Mbit/s access is not uncommon for broadband access, at least in
Sweden. There are right now a few hundred thousand twisted pair Cat 5
and 5E installations into peoples homes with 100 Mbit/s
equipment. Most of them are right now throttled to 10 Mbit/s to save
upstream bandwidth but that will change as soon as we get more TV
channels on the broadband nets. Cat 5E cabling is specified to be able
to get gigabit into the homes to minimise the risk of the cabling
becoming worthless in 10 or 20 years.

A 100 Mbit/s untrusted connection is a reality for quite some people
and its not unreasonable for linux users when it cost $20-$30 per
month. The peering connection will probably be too weak with that
price but you still get thousandss of untrusted neighbours with a full
100 Mbit/s to your computer.

Btw, I work with production and customer support at a company building
linux based firewalls. I am unfortunately not a developer but it is
great fun to read the kernel mailinglist and watch misfeatures and
bugs being discovered, discussed and eradicated. Who need to watch
football when there is Linux VM battle of wits and engineering?

Best regards,
---
Magnus Redin <[email protected]> Ingate - Firewall with SIP & NAT
Ingate System AB +46 13 214600 http://www.ingate.com/

2001-10-04 08:45:20

by Simon Kirby

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Wed, Oct 03, 2001 at 09:04:22PM -0400, jamal wrote:

> I think you can save yourself a lot of pain today by going to a "better
> driver"/hardware. Switch to a tulip based board; in particular one which
> is based on the 21143 chipset. Compile in hardware traffic control and
> save yourself some pain.

Or an Acenic-based card, but that's more expensive.

The problem we had with Tulip-based cards is that it's hard to find a
good model (variant) that is supported with different kernel versions and
stock drivers, doesn't change internally with time, and is easily
distinguishable by our hardware suppliers. "Intel EtherExpress PRO100+"
is difficult to get wrong, and there are generally less issues with
driver compatibility because there are many fewer (no) clones, just a few
different board revisions. The same goes with 3COM 905/980s, etc.

I'm not saying Tulips aren't better (they probably are, competition is
good), but eepro100s are quite simple (and have been reliable for our
servers much more than 3com 905s and other cards have been in the past).

Simon-

[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]

2001-10-04 09:24:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Thu, 4 Oct 2001, BALBIR SINGH wrote:

> Ingo, is it possible to provide an interface (optional interface) to
> drivers, so that they can decide how many interrupts are too much.

well, it existed, and i can add it back - i dont have any strong feelings
either.

> Drivers who feel that they should go in for interrupt mitigation have
> the option of deciding to go for it.

in those cases the 'irq overload' code should not trigger. It's not the
rate of interrupts that matters, it's the amount of time we spend in irq
contexts. The code counts the number of times we 'interrupt and interrupt
context'. Interrupting an irq-context is a sign of irq overload. The code
goes into 'overload mode' (and disables that particular interrupt source
for the rest of the current timer tick) only if more than 97% of all
interrupts from that source 'interrupt and irq context'. (ie. irq load is
really high.) As any statistical method it has some inaccuracy, but
'statistically' it gets things right.

Ingo

2001-10-04 09:19:27

by BALBIR SINGH

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

Ingo, is it possible to provide an interface (optional interface) to drivers,
so that they can decide how many interrupts are too much. Drivers who feel
that they should go in for interrupt mitigation have the option of deciding
to go for it.

Ofcourse, you could also have a ceiling on the maximun number of interrupts,
but the ceiling should be user configurable (using sysctl or /proc), this
would enable administrators to config their systems depending on what kind
of devices (with shared interrupts or not) they have.

Just my 2cents,
Balbir


Ingo Molnar wrote:

>On Wed, 3 Oct 2001, Linus Torvalds wrote:
>
>>Now test it again with the disk interrupt being shared with the
>>network card.
>>
>>Doesn't happen? It sure does. [...]
>>
>
>yes, disk IRQs might be delayed in that case. Without this mechanizm there
>is a lockup.
>
>>Which is why I like the NAPI approach. If somebody overloads my
>>network card, my USB camera doesn't stop working.
>>
>
>i agree that NAPI is a better approach. And IRQ overload does not happen
>on cards that have hardware-based irq mitigation support already. (and i
>should note that those cards will likely perform even faster with NAPI.)
>
>>I don't disagree with your patch as a last resort when all else fails,
>>but I _do_ disagree with it as a network load limiter.
>>
>
>okay - i removed those parts already (kpolld) in today's patch. (It
>initially was an experiment to prove that this is the only problem we are
>facing under such loads.)
>
> Ingo
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>




Attachments:
Wipro_Disclaimer.txt (853.00 B)

2001-10-04 09:49:11

by BALBIR SINGH

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

Sorry, if I missed something in the patch, but here is a question.

Shouldn't the interrupt mitigation be on a per CPU basis?
What I mean is that if a particular CPU is hogged due to some
interrupt, that interrupt should be mitigated on that particular CPU
and not on all CPUs in the system. So, unless an interrupt ends up
taking a lot of time on all CPUs it should still have a chance to
do something.

This could probably help in distributing the interrupts more evenly and
fairly on an SMP system or vice-versa.

Balbir



Ingo Molnar wrote:

>On Thu, 4 Oct 2001, BALBIR SINGH wrote:
>
>>Ingo, is it possible to provide an interface (optional interface) to
>>drivers, so that they can decide how many interrupts are too much.
>>
>
>well, it existed, and i can add it back - i dont have any strong feelings
>either.
>
>>Drivers who feel that they should go in for interrupt mitigation have
>>the option of deciding to go for it.
>>
>
>in those cases the 'irq overload' code should not trigger. It's not the
>rate of interrupts that matters, it's the amount of time we spend in irq
>contexts. The code counts the number of times we 'interrupt and interrupt
>context'. Interrupting an irq-context is a sign of irq overload. The code
>goes into 'overload mode' (and disables that particular interrupt source
>for the rest of the current timer tick) only if more than 97% of all
>interrupts from that source 'interrupt and irq context'. (ie. irq load is
>really high.) As any statistical method it has some inaccuracy, but
>'statistically' it gets things right.
>
> Ingo
>
>




Attachments:
Wipro_Disclaimer.txt (853.00 B)

2001-10-04 10:27:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On Thu, 4 Oct 2001, BALBIR SINGH wrote:

> Shouldn't the interrupt mitigation be on a per CPU basis? [...]

this was done by an earlier version of the patch, but it's wrong. An IRQ
cannot arrive to multiple CPUs at once (well, normal device interrupts at
least) - it will arrive either to some random CPU, or can be bound via
/proc/irq/N/smp_affinity. (there are architectures that do
soft-distribution of interrupts, but that can be considered pseudo-random)

But in both cases, it's the actual, per-irq IRQ load that matters. If one
CPU is hogged by IRQ handlers that is not an issue - other CPUs can still
take over the work. If *all* CPUs are hogged then the patch detects the
overload.

Ingo

2001-10-04 11:44:29

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Thu, 4 Oct 2001, Ingo Molnar wrote:

> i'm asking the following thing. dev->quota, as i read the patch now, can
> cause extra calls to ->poll() even though the RX ring of that particular
> device is empty and the driver has indicated it's done processing RX
> packets. (i'm now assuming that the extra-polling-for-a-jiffy line in the
> current patch is removed - that one is a showstopper to begin with.) Is
> this claim of mine correct?

There should be no extra calls to ->poll() and if there are we should
fix them. Take a look at the state machine i posted earlier.
The one liner is removed.

cheers,
jamal

2001-10-04 11:37:39

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Thu, 4 Oct 2001, Ingo Molnar wrote:

>
> On Wed, 3 Oct 2001, jamal wrote:
>
> > > which in turn stops that device as well sooner or later. Optionally,
> > > in the future, this can be made more finegrained for chipsets that
> > > support device-independent IRQ mitigation features, like the USB 2.0
> > > EHCI feature mentioned by David Brownell.
>
> > I think each subsytem should be in charge of its own fate. USB applies
> > in whatever subsystem it belongs to. Cooperating subsystems doing what
> > os best for the system.
>
> this is a claim that is nearly perverse and shows a fundamental
> misunderstanding of how Linux handles error situations. Perhaps we should
> never check NULL pointer dereference in the networking code? Should the
> NMI oopser not debug networking related lockups? Should we never print a
> warning message on a double enable_irq() in a bad networking driver?
>
> *of course* if a chipset supports IRQ mitigation then the generic IRQ code
> can be enabled to use it. We can have networking devices over USB as well.
> USB is a bus protocol that provides access to devices, not just a
> 'subsystem'. And *of course*, the IRQ code is completely right to do
> various sanity checks - as it does today.
>
> Linux has various safety nets in various places - always had. It's always
> the history of problems in a certain area, the seriousness and impact of
> the problem, and the intrusiveness of the safety approach that decides
> whether some safety net is added or not, whether it's put under
> CONFIG_KERNEL_DEBUG or not. While everybody is free to disagree about the
> importance of this particular safety net, just saying 'do not mess with
> *our* interrupts' sounds rather childish. Especially considering that
> tools are available to trigger lockups via broadband access. Especially
> considering that just a few mails earlier you claimed that such lockups do
> not even exist. To quote that paragraph of yours:
>

Your scheme is definetely a safety net, no doubt. But is incomplete.
Whatever subsytem/softirq/process in charge of the USB devices is the next
level of delegation. And of course the driver knows best what is good for
the goose. We delegate at each level of the hierachy and and the ultimate
authority is your code, when it is done right.
But i think we are deviating. We started this with network drivers, which
is where the real proven issue is.

> # Date: Wed, 3 Oct 2001 08:49:51 -0400 (EDT)
> # From: jamal <[email protected]>
>
> [...]
> # You dont need the patch for 2.4 to work against any lockups. And
> # infact i am suprised that you observe _any_ lockups on a PIII which
> # are not observed on my PII. Linux as is, without any tuneups can
> # handle upto about 40000 packets/sec input before you start observing
> # user space startvations. This is about 30Mbps at 64 byte packets; its
> # about 60Mbps at 128 byte packets and comfortable at 100Mbps with byte
> # size of 256. We really dont have a problem at 100Mbps.
>
> so you should never see any lockups.
>

For your P3 is what i meant since i see none on my P2 at the 100Mbps with
256 bytes. But you probably meant user-space starvation under maybe
twice that rate to which i agree and apologize for misunderstanding.

> > Your patch with Linus' idea of "flag mask" would be more acceptable as
> > a last resort. All subsytems should be cooperative and we resort to
> > this to send misbehaving kids to their room.
>
> i have nothing against it in 2.5, of course. Until then => my patch adds
> an irq.c daddy that sends the bad kids to their room.

until then change the eepro to use at least hardware flow control.
If it has mitigation, use the return codes from netif_rx(). Lets see if
that doesnt help you; yes, its a pain, but avoids a lot of unknowns which
your patch introduces.

> > > Your NAPI patch, or any driver/subsystem that does flowcontrol accurately
> > > should never be affected by it in any way. No overhead, no performance
> > > hit.
> >
> > so far your appraoch is that of a shotgun [...]
>
> i'm not sure what this has to do with your NAPI patch. You should never
> see the code trigger. It's an unused sledgehammer (or shotgun) put into
> the garage, as far as NAPI is concerned. And besides, there are lots of
> people on your continent that believe in spare shotguns ;)
>
> i'd rather compare this approach to an airbag, or perhaps shackles.
> Interrupt auto-limiting, despite your absurd and misleading analogy, does
> not 'destroy' or 'kill' anything. It merely limits an IRQ source for up to
> 10 msecs (if HZ is 1000 then it's only 1 msec), if that IRQ source has
> been detected to be critically misbehaving.

Well, i meant two things:
1) you shut down shared interupts; take a look at this posting by Marcus
Sundberg <[email protected]>

---------------

0: 7602983 XT-PIC timer
1: 10575 XT-PIC keyboard
2: 0 XT-PIC cascade
8: 1 XT-PIC rtc
11: 1626004 XT-PIC Toshiba America Info Systems ToPIC95 PCI
\
to Cardbus Bridge with ZV Support, Toshiba America Info Systems ToPIC95
PCI \
to Cardbus Bridge with ZV Support (#2), usb-uhci, eth0, BreezeCom Card, \
Intel 440MX, irda0 12: 1342 XT-PIC PS/2 Mouse
14: 23605 XT-PIC ide0

-----------------------------

Now you go and shut down IRQ 11 and punish all devices there. If you can
avoid that, it is acceptable as a temporary replacement to be upgraded to
a better scheme.

2) By not being granular enough and shutting down sources of noise, you
are actually not being effective in increasing system utilization. weve
beat this to death.

cheers,
jamal


2001-10-04 11:39:09

by Trever L. Adams

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Thu, 2001-10-04 at 04:25, Magnus Redin wrote:
>
> Linus writes:
> > Note that the big question here is WHO CARES?
>
> Everybody building firewalls, routers, high performance web servers
> and broadband content servers with a Linux kernel.
> Everyody having a 100 Mbit/s external connection.
>
> 100 Mbit/s access is not uncommon for broadband access, at least in
> Sweden. There are right now a few hundred thousand twisted pair Cat 5
> and 5E installations into peoples homes with 100 Mbit/s
> equipment. Most of them are right now throttled to 10 Mbit/s to save
> upstream bandwidth but that will change as soon as we get more TV
> channels on the broadband nets. Cat 5E cabling is specified to be able
> to get gigabit into the homes to minimise the risk of the cabling
> becoming worthless in 10 or 20 years.

For businesses in some parts of the country, this is also becoming more
common (though it is usually 10 Mbit/s. I believe that this will become
more and more common.

I do not agree with Linus's concept that you are foolish to allow people
"untrusted direct access", in so far as it applies to "no one would/will
allow high speed connections their machines." Linus, dial-up
connections may not be a thing of the past for years to come, but what
we call high-speed is indeed changing. Let us not let Linux fall
behind. (AirSwitch in Utah offers 10 Mbits/s to the home in at least
Utah County.)

As for the technical debate of how to do this load limiting or
performance enhancement... I say do what is best on technical grounds...
not on bad assumptions. This may mean that the other set of patches
going around may be best, or it may mean Ingo's is best or maybe
something entirely different. I personally do not know!

Trever Adams

2001-10-04 11:50:19

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Wed, 3 Oct 2001, Ben Greear wrote:

> The tulip driver only started working for my DLINK 4-port NIC after
> about 2.4.8, and last I checked the ZYNX 4-port still refuses to work,
> so I wouldn't consider it a paradigm of stability and grace quite yet.

The tests in http://www.cyberus.ca/~hadi/247-res/ were done with 4-port znyx
cards using 2.4.7.
What kind of problems are you having? Maybe i can help.

> Regardless of that, it is often impossible to trade NICS (think
> built-in 1U servers), and claiming to only work correctly on certain
> hardware (and potentially lock up hard on other hardware) is a pretty
> sorry state of affairs...

My point is that the API exists. Driver owners could use it; this
discussion seems to have at least helped to point in the existence of the
API. Alexey had the hardware flow control in there since 2.1.x .., us
that at least. In my opinion, Ingos patch is radical enough to be allowed
in when we are approaching stability. And it is a lazy way of solving the
problem

cheers,
jamal

2001-10-04 11:53:19

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Thu, 4 Oct 2001, Ingo Molnar wrote:

>
> On Wed, 3 Oct 2001, Ben Greear wrote:
>
> > > so far your appraoch is that of a shotgun i.e "let me fire in
> > > that crowd and i'll hit my target but dont care if i take down a few
> > > more"; regardless of how noble the reasoning is, it's as Linus described
> > > it -- a sledge hammer.
> >
> > Aye, but by shooting this target and getting a few bystanders, you save
> > everyone else... (And it's only a flesh wound!!)
>
> especially considering that the current code nukes the whole city ;)
>

Ingo, cut down on the bad mushrooms ;->

cheers,
jamal

2001-10-04 11:52:29

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Thu, 4 Oct 2001, Ingo Molnar wrote:

>
> On Wed, 3 Oct 2001, jamal wrote:
>
> > I think you can save yourself a lot of pain today by going to a
> > "better driver"/hardware. Switch to a tulip based board; [...]
>
> This is not an option in many cases. (eg. where a company standardizes on
> something non-tulip, or due to simple financial/organizational reasons.)
> What you say is the approach i see in the FreeBSD camp frequently: "use
> these [limited set of] wonderful cards and drivers, the rest sucks
> hardware-design-wise and we dont really care about them", which elitist
> attitude i strongly disagree with.
>

It is not elitist. Maybe we can force people to use the API now. it
exists. And hardware flow control does not require special hardware
features. As well NAPI kills the requirement for mitigation in the future.

cheers,
jamal

2001-10-04 11:57:09

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Thu, 4 Oct 2001, Simon Kirby wrote:

> On Wed, Oct 03, 2001 at 09:04:22PM -0400, jamal wrote:
>
> > I think you can save yourself a lot of pain today by going to a "better
> > driver"/hardware. Switch to a tulip based board; in particular one which
> > is based on the 21143 chipset. Compile in hardware traffic control and
> > save yourself some pain.
>
> Or an Acenic-based card, but that's more expensive.
>
> The problem we had with Tulip-based cards is that it's hard to find a
> good model (variant) that is supported with different kernel versions and
> stock drivers, doesn't change internally with time, and is easily
> distinguishable by our hardware suppliers. "Intel EtherExpress PRO100+"
> is difficult to get wrong, and there are generally less issues with
> driver compatibility because there are many fewer (no) clones, just a few
> different board revisions. The same goes with 3COM 905/980s, etc.
>
> I'm not saying Tulips aren't better (they probably are, competition is
> good), but eepro100s are quite simple (and have been reliable for our
> servers much more than 3com 905s and other cards have been in the past).
>

Has nothing to do with specific hardware although i see your point.
send me an eepro and i'll at least add hardware flow control for you.
The API is simple, its up to the driver maintainers to use. This
discussion is good to make people aware of those drivers.

cheers,
jamal

2001-10-04 13:02:48

by Robert Olsson

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


Ingo Molnar writes:
>
> i'm asking the following thing. dev->quota, as i read the patch now, can
> cause extra calls to ->poll() even though the RX ring of that particular
> device is empty and the driver has indicated it's done processing RX
> packets. (i'm now assuming that the extra-polling-for-a-jiffy line in the
> current patch is removed - that one is a showstopper to begin with.) Is
> this claim of mine correct?

Hello!

Well I'm the one to blame... :-) This comes from my experiments to delay
to polling before going into RX-irq-enable mode. This is one of the areas
to be addressed further with NAPI. And this code was not in any of the
files that I announced I think..?

Cheers.

--ro

2001-10-04 15:22:34

by Tim Hockin

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

> Has nothing to do with specific hardware although i see your point.
> send me an eepro and i'll at least add hardware flow control for you.
> The API is simple, its up to the driver maintainers to use. This
> discussion is good to make people aware of those drivers.


is there a place where this is explained? I'd be happy to make drivers on
which I work support this. It's like ethtool - easy to do, but no one has
done it because they didn't know.

2001-10-04 15:56:11

by Ben Greear

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

jamal wrote:
>
> On Wed, 3 Oct 2001, Ben Greear wrote:
>
> > The tulip driver only started working for my DLINK 4-port NIC after
> > about 2.4.8, and last I checked the ZYNX 4-port still refuses to work,
> > so I wouldn't consider it a paradigm of stability and grace quite yet.
>
> The tests in http://www.cyberus.ca/~hadi/247-res/ were done with 4-port znyx
> cards using 2.4.7.
> What kind of problems are you having? Maybe i can help.

Mostly problems with auto-negotiation it seems. Earlier 2.4 kernels
just would never go 100bt/FD. Later (broken) versions would claim to
be 100bt/FD, but they still showed lots of collisions and frame errors.

I'll try the ZYNX on the latest kernel in the next few days and let you
know what I find...

> My point is that the API exists. Driver owners could use it; this
> discussion seems to have at least helped to point in the existence of the
> API. Alexey had the hardware flow control in there since 2.1.x .., us
> that at least. In my opinion, Ingos patch is radical enough to be allowed
> in when we are approaching stability. And it is a lazy way of solving the
> problem

The API has been there since 2.1.x, and yet few drivers support it? I
can see why Ingo decided to fix the problem generically. I think it would
be great if his code printed a log message upon trigger that basically said:
"You should get yourself a NAPI enabled driver that does flow-control if
possible." That may give the appropriate visibility to the issue and let
the driver writers improve their drivers accordingly...

Ben

>
> cheers,
> jamal
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Ben Greear <[email protected]> <Ben_Greear AT excite.com>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear

2001-10-04 16:09:42

by Ben Greear

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

"Henning P. Schmiedehausen" wrote:
>
> Ben Greear <[email protected]> writes:
>
> >jamal wrote:
> >>
> >> I think you can save yourself a lot of pain today by going to a "better
> >> driver"/hardware. Switch to a tulip based board; in particular one which
> >> is based on the 21143 chipset. Compile in hardware traffic control and
> >> save yourself some pain.
>
> >The tulip driver only started working for my DLINK 4-port NIC
> >after about 2.4.8, and last I checked the ZYNX 4-port still refuses
> >to work, so I wouldn't consider it a paradigm of
> >stability and grace quite yet. Regardless of that, it is often
> >impossible to trade NICS (think built-in 1U servers), and claiming
> >to only work correctly on certain hardware (and potentially lock up
> >hard on other hardware) is a pretty sorry state of affairs...
>
> Does it finally do speed and duplex auto negotiation with Cisco
> Catalyst Switches? Something I never ever got to work with various 2.0
> and 2.2 drivers, mode settings, Catalyst settings, IOS versions and
> almost anything else that I ever tried.

Check the latest driver, it works with my IBM switch, and with other
EEPRO and Tulip NICs now, so it may work for you. The DLINK 4-port
is actually the only one I know of that I have ever gotten to fully
function. The ZYNX would kind of work at half-duplex for a while,
and an ancient Adaptec I tried locks the whole computer on insmod
of it's driver (IRQ routing issues someone guessed...) There are
several 2-port EEPRO based NICs out there that work really well
too, but they are expensive...

Ben

--
Ben Greear <[email protected]> <Ben_Greear AT excite.com>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear

2001-10-04 17:27:23

by Davide Libenzi

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Wed, 3 Oct 2001, Andreas Dilger wrote:

> If you get to the stage where you are turning off IRQs and going to a
> polling mode, then don't turn IRQs back on until you have a poll (or
> two or whatever) that there is no work to be done. This will at worst
> give you 50% polling success, but in practise you wouldn't start polling
> until there is lots of work to be done, so the real success rate will
> be much higher.
>
> At this point (no work to be done when polling) there are clearly no
> interrupts would be generated (because no packets have arrived), so it
> should be reasonable to turn interrupts back on and stop polling (assuming
> non-broken hardware). You now go back to interrupt-driven work until
> the rate increases again. This means you limit IRQ rates when needed,
> but only do one or two excess polls before going back to IRQ-driven work.
>
> Granted, I don't know what the overhead of turning the IRQs on and off
> is, but since we do it all the time already (for each ISR) it can't be
> that bad.
>
> If you are always having work to do when polling, then interrupts will
> never be turned on again, but who cares at that point because the work
> is getting done? Similarly, if you have IRQs disabled, but are sharing
> IRQs there is nothing wrong in polling all devices sharing that IRQ
> (at least conceptually).
>
> I don't know much about IRQ handlers, but I assume that this is already
> what happens if you are sharing an IRQ - you don't know which of many
> sources it comes from, so you poll all of them to see if they have any
> work to be done. If you are polling some of the shared-IRQ devices too
> frequently (i.e. they never have work to do), you could have some sort
> of progressive backoff, so you skip polling those for a growing number
> of polls (this could also be set by the driver if it knows that it could
> only generate real work every X ms, so we skip about X/poll_rate polls).

This seems a pretty nice solution that achieve 1) to limit the irq
frequency 2) avoid the huge shared irqs latency given by the irq masking.
By having a per irq # poll callbacks could give the opportunity to poll
"time to time" sharing devices during the offending device poll loop.



- Davide


Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

Ben Greear <[email protected]> writes:

>"Henning P. Schmiedehausen" wrote:
>>
>> Does it finally do speed and duplex auto negotiation with Cisco
>> Catalyst Switches? Something I never ever got to work with various 2.0
>> and 2.2 drivers, mode settings, Catalyst settings, IOS versions and
>> almost anything else that I ever tried.

>Check the latest driver, it works with my IBM switch, and with other
>EEPRO and Tulip NICs now, so it may work for you. The DLINK 4-port

Hi,

thanks for the suggestion, but I'm actually sold on using eepro100 and
3c59x NICs, both flavours never gave me any trouble (yes, I know about
the 3xc59x and I was always careful to choose either the "B" with 2.0
and early 2.2 and now the "C" with later 2.2. Call me a snob for going
the "FreeBSD way" and choosing HW that works and not taking the
challenge to bring even the most obscure HW lying in a bin at a
customer to work but telling the customer "you can now buy a new,
guaranteed flawlessly performing NIC for $25 or pay me for four hours
trying to get _that_ NIC to work. I charge a little more than $25 per
hour..". Got them every time. ;-)

Basically I burned [1] all my tulip NICs around a long time ago.

>several 2-port EEPRO based NICs out there that work really well
>too, but they are expensive...

Hm. If I really need more NICs than PCI slots, I normally use a
Router. And I've even toyed a little with a Gigabit card linked to a
Cisco C3524XL using a certain 802.1q unofficial extension to the Linux
kernel to try and provide 24 100 MBit Ethernet Interfaces from a
single Linux Box [2].

Regards
Henning

[1] Put them in the unavoidable Windows NT and 2000 boxes where most of them
with "vendor supported, MHL approved, certified and signed drivers"
crash and burn as happily as under Linux. But the it is the fault of
"the other consultant". I don't do Windows.

[2] Didn't work, though. Got a C7206 instead. O:-)


--
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH [email protected]

Am Schwabachgrund 22 Fon.: 09131 / 50654-0 [email protected]
D-91054 Buckenhof Fax.: 09131 / 50654-20

2001-10-04 17:41:24

by Andreas Dilger

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Oct 04, 2001 07:34 -0400, jamal wrote:
> 1) you shut down shared interupts; take a look at this posting by Marcus
> Sundberg <[email protected]>
>
> ---------------
>
> 0: 7602983 XT-PIC timer
> 1: 10575 XT-PIC keyboard
> 2: 0 XT-PIC cascade
> 8: 1 XT-PIC rtc
> 11: 1626004 XT-PIC Toshiba America Info Systems ToPIC95 PCI
> \
> to Cardbus Bridge with ZV Support, Toshiba America Info Systems ToPIC95
> PCI \
> to Cardbus Bridge with ZV Support (#2), usb-uhci, eth0, BreezeCom Card, \
> Intel 440MX, irda0 12: 1342 XT-PIC PS/2 Mouse
> 14: 23605 XT-PIC ide0
>
> -----------------------------
>
> Now you go and shut down IRQ 11 and punish all devices there. If you can
> avoid that, it is acceptable as a temporary replacement to be upgraded to
> a better scheme.

Well, if we fall back to polling devices if the IRQ is disabled, then the
shared interrupt case can be handled as well. However, there were complaints
about the patch when Ingo had device polling included, as opposed to just
IRQ mitigation.

> 2) By not being granular enough and shutting down sources of noise, you
> are actually not being effective in increasing system utilization.

Well, since the IRQ itself uses system resources, if it is disabled it will
allow those resources to actually do something (i.e. polling instead, when
we know there is a lot of work to do).

Even if it does not have polling in the patch, the choice is to turn off
the IRQ, or have the system hang because it can not make any progress
because of the high number of interrupts. If your patch ensures that the
network IRQ load is kept down, then Ingo's will never be activated.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-10-04 18:03:37

by Ben Greear

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

"Henning P. Schmiedehausen" wrote:

> >several 2-port EEPRO based NICs out there that work really well
> >too, but they are expensive...
>
> Hm. If I really need more NICs than PCI slots, I normally use a
> Router. And I've even toyed a little with a Gigabit card linked to a
> Cisco C3524XL using a certain 802.1q unofficial extension to the Linux
> kernel to try and provide 24 100 MBit Ethernet Interfaces from a
> single Linux Box [2].

I wrote (one of) the VLAN patch, and I've brought up 4k
VLAN interfaces. Let me or the [email protected] mailing list
know if you have trouble with my VLAN patch... My vlan patch
can be found:
http://www.candelatech.com/~greear/vlan.html


Enjoy,
Ben

--
Ben Greear <[email protected]> <Ben_Greear AT excite.com>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear

2001-10-04 18:12:27

by Alan

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

> (a) is not a major security issue. If you allow untrusted users full
> 100/1000Mbps access to your internal network, you have _other_
> security issues, like packet sniffing etc that are much much MUCH
> worse. So the packet flooding thing is very much a corner case, and
> claiming that we have a big problem is silly.

Not nowdays. 100Mbit pipes to the backbone are routine for web serving in
the real world - at least the paying end (aka porn).

Alan

2001-10-04 18:26:08

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Thu, 4 Oct 2001, Ben Greear wrote:

> jamal wrote:
> >
> > On Wed, 3 Oct 2001, Ben Greear wrote:
> >
> > > The tulip driver only started working for my DLINK 4-port NIC after
> > > about 2.4.8, and last I checked the ZYNX 4-port still refuses to work,
> > > so I wouldn't consider it a paradigm of stability and grace quite yet.
> >
> > The tests in http://www.cyberus.ca/~hadi/247-res/ were done with 4-port znyx
> > cards using 2.4.7.
> > What kind of problems are you having? Maybe i can help.
>
> Mostly problems with auto-negotiation it seems. Earlier 2.4 kernels
> just would never go 100bt/FD. Later (broken) versions would claim to
> be 100bt/FD, but they still showed lots of collisions and frame errors.
>
> I'll try the ZYNX on the latest kernel in the next few days and let you
> know what I find...

Please do.

>
> > My point is that the API exists. Driver owners could use it; this
> > discussion seems to have at least helped to point in the existence of the
> > API. Alexey had the hardware flow control in there since 2.1.x .., us
> > that at least. In my opinion, Ingos patch is radical enough to be allowed
> > in when we are approaching stability. And it is a lazy way of solving the
> > problem
>
> The API has been there since 2.1.x, and yet few drivers support it? I
> can see why Ingo decided to fix the problem generically.

That logic is convoluted.

> > > cat /proc/net/softnet_stat
> > > 2b85c320 0000d374 6524ce48 00000000 00000000 00000000 00000000
00000000 0$
> > > 2b8b5e29 0000d615 653eba32 00000000 00000000 00000000 00000000
00000000 0$
>
> So you're priting out counters in HEX?? This seems one place where a nice
> base-10 number would be appropriate :)

Its mostly for formating reasons:
2b85c320 is 730186528 (and wont fit in one line)

cheers,
jamal

2001-10-04 18:35:48

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Thu, 4 Oct 2001, Andreas Dilger wrote:

> On Oct 04, 2001 07:34 -0400, jamal wrote:
> > 1) you shut down shared interupts; take a look at this posting by Marcus
> > Sundberg <[email protected]>
> >
> > ---------------
> >
> > 0: 7602983 XT-PIC timer
> > 1: 10575 XT-PIC keyboard
> > 2: 0 XT-PIC cascade
> > 8: 1 XT-PIC rtc
> > 11: 1626004 XT-PIC Toshiba America Info Systems ToPIC95 PCI
> > \
> > to Cardbus Bridge with ZV Support, Toshiba America Info Systems ToPIC95
> > PCI \
> > to Cardbus Bridge with ZV Support (#2), usb-uhci, eth0, BreezeCom Card, \
> > Intel 440MX, irda0 12: 1342 XT-PIC PS/2 Mouse
> > 14: 23605 XT-PIC ide0
> >
> > -----------------------------
> >
> > Now you go and shut down IRQ 11 and punish all devices there. If you can
> > avoid that, it is acceptable as a temporary replacement to be upgraded to
> > a better scheme.
>
> Well, if we fall back to polling devices if the IRQ is disabled, then the
> shared interrupt case can be handled as well. However, there were complaints
> about the patch when Ingo had device polling included, as opposed to just
> IRQ mitigation.
>

I dont think youve followed the discussions too well and normally i
wouldnt respond but you addressed me. Ingos netdevice polling is not the
right approach, please look at NAPI and read the paper. NAPI does
all what youve been suggesting. We are not even discussing that at this
point. We are discussing the sledgehammer effect and how you could break a
finger or two trying to kill that fly with it. The example above
illustrates it.

cheers,
jamal


2001-10-04 18:55:20

by Ion Badulescu

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Thu, 4 Oct 2001 07:54:19 -0400 (EDT), jamal <[email protected]> wrote:

> Has nothing to do with specific hardware although i see your point.
> send me an eepro and i'll at least add hardware flow control for you.
> The API is simple, its up to the driver maintainers to use. This
> discussion is good to make people aware of those drivers.

A bit of documentation for the hardware flow control API would help as
well. The API might be fine and dandy, but if all you have is a couple of
modified drivers -- some of which are not even in the standard kernel --
then you can bet not many driver writers are going to even be aware of it,
let alone care to implement it.

For instance: in 2.2.19, the help text for CONFIG_NET_HW_FLOWCONTROL says
only tulip supports it in the standard kernel -- yet I can't find that
support anywhere in drivers/net/*.c, tulip.c included.

In 2.4.10 tulip finally supports it (and I'm definitely going to take a
closer look), but that's about it. And tulip is definitely the wrong
example to pick if you want a nice and clean model for your driver.

Ion

--
It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

2001-10-04 18:53:50

by Christopher E. Brown

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Thu, 4 Oct 2001, Henning P. Schmiedehausen wrote:
>
> Does it finally do speed and duplex auto negotiation with Cisco
> Catalyst Switches? Something I never ever got to work with various 2.0
> and 2.2 drivers, mode settings, Catalyst settings, IOS versions and
> almost anything else that I ever tried.
>
> Regards
> Henning


Lets be fair here, while there are issues with some brands of
tulip card, Cisco is often to blame as well. There are known issues
with N-WAY autoneg on many Ciscos, switches *and* routers.

2001-10-04 19:02:53

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Thu, 4 Oct 2001, Ion Badulescu wrote:

> On Thu, 4 Oct 2001 07:54:19 -0400 (EDT), jamal <[email protected]> wrote:
>
> > Has nothing to do with specific hardware although i see your point.
> > send me an eepro and i'll at least add hardware flow control for you.
> > The API is simple, its up to the driver maintainers to use. This
> > discussion is good to make people aware of those drivers.
>
> A bit of documentation for the hardware flow control API would help as
> well. The API might be fine and dandy, but if all you have is a couple of
> modified drivers -- some of which are not even in the standard kernel --
> then you can bet not many driver writers are going to even be aware of it,
> let alone care to implement it.

I could write a small HOWTO at least for HWFLOWCONTROL since that doesnt
need anything fancy.

>
> For instance: in 2.2.19, the help text for CONFIG_NET_HW_FLOWCONTROL says
> only tulip supports it in the standard kernel -- yet I can't find that
> support anywhere in drivers/net/*.c, tulip.c included.
>

Thats dated. It means a doc is needed.

> In 2.4.10 tulip finally supports it (and I'm definitely going to take a
> closer look), but that's about it. And tulip is definitely the wrong
> example to pick if you want a nice and clean model for your driver.
>

I like the tulip code.

cheers,
jamal

2001-10-04 21:16:04

by Ion Badulescu

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Thu, 4 Oct 2001, jamal wrote:

> I could write a small HOWTO at least for HWFLOWCONTROL since that doesnt
> need anything fancy.

That'd be very nice.

Thanks,
Ion

--
It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

> > The paper is at: http://www.cyberus.ca/~hadi/usenix-paper.tgz
> > Robert can point you to the latest patches.
>
>
> Current code... there are still some parts we like to better.
>
> Available via ftp from robur.slu.se:/pub/Linux/net-development/NAPI/
> 2.4.10-poll.pat

I seem to remember jamal saying the NAPI stuff was available
since 2.(early). Is there a stable 2.2.20 patch?

--
Alex Bligh

Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



--On Wednesday, 03 October, 2001 4:51 PM +0200 Ingo Molnar <[email protected]>
wrote:

> your refusal to accept this problem as an existing and real problem is
> really puzzling me.

In at least one environment known to me (router), I'd rather it
kept accepting packets, and f/w'ing them, and didn't switch VTs etc.
By dropping down performance, you've made the DoS attack even
more successful than it would otherwise have been (the kiddie
looks at effect on the host at the end).

--
Alex Bligh

2001-10-04 21:49:55

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Thu, Oct 04, 2001 at 10:28:17PM +0100, Alex Bligh - linux-kernel wrote:
> In at least one environment known to me (router), I'd rather it
> kept accepting packets, and f/w'ing them, and didn't switch VTs etc.
> By dropping down performance, you've made the DoS attack even
> more successful than it would otherwise have been (the kiddie
> looks at effect on the host at the end).

Then bug the driver author of your ethernet cards or turn the hammer off.
You're the sysadmin, you know that your system is unusual. Deal with it.

-ben

2001-10-04 22:01:15

by Simon Kirby

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Thu, Oct 04, 2001 at 10:28:17PM +0100, Alex Bligh - linux-kernel wrote:

> In at least one environment known to me (router), I'd rather it
> kept accepting packets, and f/w'ing them, and didn't switch VTs etc.
> By dropping down performance, you've made the DoS attack even
> more successful than it would otherwise have been (the kiddie
> looks at effect on the host at the end).

No.

Ingo is not limiting interrupts to make it drop packets and forget things
just so that userspace can proceed. Instead, he is postponing servicing
of the interrupts so that the card can batch up more packets and the
interrupt will retrieve more at once rather than continually leaving and
entering the interrupt to just pick up a few packets. Without this, the
interrupt will starve everything else, and nothing will get done.

By postponing servicing of the interrupt (and thus increasing latency
slightly), throughput will actually increase.

Obviously, if the card Rx buffers overflow because the interrupts weren't
serviced quickly enough, then packets will be dropped. This is still
better than the machine not being able to actually do anything with the
received packets (and also not able to do anything else such as allow the
administrator to figure out what is happening).

Simon-

[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]

2001-10-04 22:05:35

by Alan

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

> In at least one environment known to me (router), I'd rather it
> kept accepting packets, and f/w'ing them, and didn't switch VTs etc.
> By dropping down performance, you've made the DoS attack even
> more successful than it would otherwise have been (the kiddie
> looks at effect on the host at the end).

You only think that. After a few minutes the kiddie pulls down your routing
because your route daemons execute no code. Also during the attack your sshd
wont run so you cant log in to find out what is up

Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



--On Thursday, 04 October, 2001 5:49 PM -0400 Benjamin LaHaise
<[email protected]> wrote:

>> In at least one environment known to me (router), I'd rather it
>> kept accepting packets, and f/w'ing them, and didn't switch VTs etc.
>> By dropping down performance, you've made the DoS attack even
>> more successful than it would otherwise have been (the kiddie
>> looks at effect on the host at the end).
>
> Then bug the driver author of your ethernet cards or turn the hammer off.
> You're the sysadmin, you know that your system is unusual. Deal with it.

The hammer has an average age of 13yrs and is difficult to turn off,
unfortunately.

Rather than bugging the author of the driver card, we've actually
been trying to fix it, down to rewriting the firmware. So for
this purpose I/we am/are the driver maintainer thanks. However,
there are limitations like bus speed which mean that in practice
if we receive a large enough number of small packets each second,
the box will saturate.

My point was merely that some applications (and using a linux
box as a router is not that 'unusual') want to deprioritize
different things under resource starvation. Changing the default,
in an unconfigurable way, isn't a great idea. Sure dealing
with external resource exhaustions for hosts is indeed a good
idea. I was just suggesting that it wasn't always what you
wanted to do.

Not sure this required jumping down my throat.

--
Alex Bligh

Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



--On Thursday, 04 October, 2001 3:01 PM -0700 Simon Kirby
<[email protected]> wrote:

> Ingo is not limiting interrupts to make it drop packets and forget things
> just so that userspace can proceed. Instead, he is postponing servicing
> of the interrupts so that the card can batch up more packets and the
> interrupt will retrieve more at once rather than continually leaving and
> entering the interrupt to just pick up a few packets. Without this, the
> interrupt will starve everything else, and nothing will get done.

Ah OK. in this case already looking at interupt coalescing at firmware
level which mitigates this 'earlier on', however even this
stratgy fails at higher pps levels => i.e. in these circumstances
the card buffer is already full-ish, as the interrupt has already been
postponed, and postponing it further can only cause dropped packets
through buffer overrun.

--
Alex Bligh

2001-10-04 23:26:58

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Fri, Oct 05, 2001 at 12:20:34AM +0100, Alex Bligh - linux-kernel wrote:
> Rather than bugging the author of the driver card, we've actually
> been trying to fix it, down to rewriting the firmware. So for
> this purpose I/we am/are the driver maintainer thanks. However,
> there are limitations like bus speed which mean that in practice
> if we receive a large enough number of small packets each second,
> the box will saturate.

Not if the driver has a decent irq mitigation schema and uses the
hw flow control + NAPI bits.

> Not sure this required jumping down my throat.

Frankly I'm sick of this entire discussion where people claim that no
form of interrupt throttling is ever needed. It's an emergency measure
that is needed under some circumstances as very few drivers properly
protect against this kind of DoS. Drivers that do things correctly will
never trigger the hammer. Plus it's configurable. If you'd bothered to
read and understand the rest of this thread you wouldn't have posted.

-ben

Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



--On Thursday, 04 October, 2001 11:10 PM +0100 Alan Cox
<[email protected]> wrote:

> You only think that. After a few minutes the kiddie pulls down your
> routing because your route daemons execute no code. Also during the
> attack your sshd wont run so you cant log in to find out what is up

There is truth in this. Which is why doing things like
a crude WRED on the card, in the firmware,
(i.e. before it sends the data into user space) is something
we looked at but never got round to.

--
Alex Bligh

2001-10-04 23:34:50

by Simon Kirby

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Fri, Oct 05, 2001 at 12:25:41AM +0100, Alex Bligh - linux-kernel wrote:

> > Ingo is not limiting interrupts to make it drop packets and forget things
> > just so that userspace can proceed. Instead, he is postponing servicing
> > of the interrupts so that the card can batch up more packets and the
> > interrupt will retrieve more at once rather than continually leaving and
> > entering the interrupt to just pick up a few packets. Without this, the
> > interrupt will starve everything else, and nothing will get done.
>
> Ah OK. in this case already looking at interupt coalescing at firmware
> level which mitigates this 'earlier on', however even this
> stratgy fails at higher pps levels => i.e. in these circumstances
> the card buffer is already full-ish, as the interrupt has already been
> postponed, and postponing it further can only cause dropped packets
> through buffer overrun.

Right. But right now, the fact that the packets are so small and are
arriving so fast makes the interrupt handler overhead starve everything
else, and interrupt mitigation can make a box that would otherwise be
dead work properly. If the box gets even more packets and the CPU
saturates, then the box would be dead before without the patch anyway.

Simon-

[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]

2001-10-04 23:47:12

by Robert Love

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Thu, 2001-10-04 at 19:26, Benjamin LaHaise wrote:
> Frankly I'm sick of this entire discussion where people claim that no
> form of interrupt throttling is ever needed. It's an emergency measure
> that is needed under some circumstances as very few drivers properly
> protect against this kind of DoS. Drivers that do things correctly will
> never trigger the hammer. Plus it's configurable. If you'd bothered to
> read and understand the rest of this thread you wouldn't have posted.

Agreed. I am actually amazed that the opposite of what is happening
does not happen -- that more people aren't clamoring for this solution.

Six months ago I was testing some TCP application and by accident placed
a sendto() in an infinite loop. The destination of the packets (on my
LAN) locked up completely! And this was a powerful Pentium III with a
3c905 NIC. Not acceptable.

Robert Love

2001-10-04 23:52:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


On 4 Oct 2001, Robert Love wrote:
>
> Agreed. I am actually amazed that the opposite of what is happening
> does not happen -- that more people aren't clamoring for this solution.

Ehh.. I think that most people who are against Ingo's patches are so
mainly because there _is_ an alternative that looks nicer.

Linus

2001-10-05 00:00:32

by Ben Greear

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

Linus Torvalds wrote:
>
> On 4 Oct 2001, Robert Love wrote:
> >
> > Agreed. I am actually amazed that the opposite of what is happening
> > does not happen -- that more people aren't clamoring for this solution.
>
> Ehh.. I think that most people who are against Ingo's patches are so
> mainly because there _is_ an alternative that looks nicer.
>
> Linus

The alternative (NAPI) only works with Tulip and Intel NICs, it seems.
When the alternative works for every driver known (including 3rd party
ones, like the e100), then it will truly be an alternative. Untill
then, it will be a great feature for those who can use it, and the
rest of the poor folks will need a big generic hammer.

>From personal experince, I imagine the problem is also that it was
not invented here, where here is where each of sit. And I include
myself in that bias!

Ben

--
Ben Greear <[email protected]> <Ben_Greear AT excite.com>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear

2001-10-05 00:15:53

by Davide Libenzi

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Thu, 4 Oct 2001, Ben Greear wrote:

> Linus Torvalds wrote:
> >
> > On 4 Oct 2001, Robert Love wrote:
> > >
> > > Agreed. I am actually amazed that the opposite of what is happening
> > > does not happen -- that more people aren't clamoring for this solution.
> >
> > Ehh.. I think that most people who are against Ingo's patches are so
> > mainly because there _is_ an alternative that looks nicer.
> >
> > Linus
>
> The alternative (NAPI) only works with Tulip and Intel NICs, it seems.
> When the alternative works for every driver known (including 3rd party
> ones, like the e100), then it will truly be an alternative. Untill
> then, it will be a great feature for those who can use it, and the
> rest of the poor folks will need a big generic hammer.

NAPI needs aware drivers and introduces changes to the queue processing (
packets left in DMA ring ) and it'll be at least 2.5.x
It's clearly a nicer solution that does not suffer of drawbacks that
Ingo's code have.
Ingo's patch is more hack-ish but addresses the problem with minimal
changes.




- Davide


2001-10-05 02:04:55

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Thu, 4 Oct 2001, Ben Greear wrote:

> Linus Torvalds wrote:
> >
> > On 4 Oct 2001, Robert Love wrote:
> > >
> > > Agreed. I am actually amazed that the opposite of what is happening
> > > does not happen -- that more people aren't clamoring for this solution.
> >
> > Ehh.. I think that most people who are against Ingo's patches are so
> > mainly because there _is_ an alternative that looks nicer.
> >
> > Linus
>
> The alternative (NAPI) only works with Tulip and Intel NICs, it seems.
> When the alternative works for every driver known (including 3rd party
> ones, like the e100), then it will truly be an alternative. Untill
> then, it will be a great feature for those who can use it, and the
> rest of the poor folks will need a big generic hammer.
>

Ben,
Lets put some reality check and history just for entertainment value:
It took ten years of Linux existence (and i am just using you
as an example no pun intended) to realize your life was actually
an emergency that depended on Ingos patch. Maybe i am being cruel,
so lets backtrack only over the last 4 years when Alexey first had the
HFC in there; i am willing to bet a large amount of money that you didnt
once use it or even care to post a query if such a thing existed. Ok, so
lets assume you didnt know it existed ... over a year back i posted widely
on it with conjunction to the return code to the netif_rx() extension ...
and still you didnt care that much although you seem to be a user of one
of the converted drivers -- the tulip and in particular the znyx hardware
which was used in the testing. IIRC, you actually said something on that
post .. Then one bright early morn, Eastern time zone, Ingo appears, not
in the form of atoms rather electrons masquareding as bits ...

cheers,
jamal

PS:- I am going to try and mitigate myself from this thread now; my
email-sending rate will be drastically reduced.

2001-10-05 14:30:53

by Robert Olsson

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


Alex Bligh - linux-kernel writes:

> I seem to remember jamal saying the NAPI stuff was available
> since 2.(early). Is there a stable 2.2.20 patch?


Hello!

Current NAPI incarnation came first for 2.4.3 and holds ANK trademark.
Jamal had pre-NAPI patches long before and we have been testing/profiling
polling and flow control versions of popular network drivers in the lab and
on on highly loaded Internet sites for a long time. I consider the NAPI
work to initiated by Jamal at OLS two years ago. No I don't know of any
usable code for 2.2.*

Cheers.

--ro




2001-10-05 14:50:04

by Robert Olsson

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


Andreas Dilger writes:

> If you get to the stage where you are turning off IRQs and going to a
> polling mode, then don't turn IRQs back on until you have a poll (or
> two or whatever) that there is no work to be done. This will at worst
> give you 50% polling success, but in practise you wouldn't start polling
> until there is lots of work to be done, so the real success rate will
> be much higher.
>
> At this point (no work to be done when polling) there are clearly no
> interrupts would be generated (because no packets have arrived), so it
> should be reasonable to turn interrupts back on and stop polling (assuming
> non-broken hardware). You now go back to interrupt-driven work until
> the rate increases again. This means you limit IRQ rates when needed,
> but only do one or two excess polls before going back to IRQ-driven work.

Hello!

Yes this has been considered and actually I think Jamal did this in one of
the pre NAPI patch. I tried something similar... but instead of using a number
of excess polls I was doing excess polls for a short time (a jiffie). This
was the showstopper mentioned the previous mails. :-)

Anyway it up to driver to decide this policy. If the driver returns
"not_done" it is simply polled again. So low-rate network drivers can have
a different policy compared to an OC-48 driver. Even continues polling is
therefore possible and even showstoppers. :-) There are protection for
polling livelocks.

Cheers.
--ro

2001-10-05 15:20:49

by Robert Olsson

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


Alan Cox writes:

> You only think that. After a few minutes the kiddie pulls down your routing
> because your route daemons execute no code. Also during the attack your sshd
> wont run so you cant log in to find out what is up

Indeed.

I have a real example from a university core router with BGP and full
Internet routing. I managed to get in via ssh during the DoS attack.
We see that the 5 min dropping rate is about the same as the input
rate. The duration of this attack was more half an hour and BGP survied
and the box was pretty manageable. This was with a hacked tulip driver
switching to RX-polling at high loads.

eth2: UP Locked MII Full DuplexLink UP
Admin up 6 day(s) 13 hour(s) 47 min 51 sec
Last input NOW
Last output NOW
5min RX bit/s 23.9 M
5min TX bit/s 1.1 M
5min RX pkts/s 46439
5min TX pkts/s 759
5min TX errors 0
5min RX errors 0
5min RX dropped 47038
5min TX dropped 0
5min collisions 0

Well, this was a router but I think we very soon have the same demands for
most Internet servers.

Cheers.
--ro


2001-10-05 16:42:34

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

Hello!

> i'm asking the following thing. dev->quota, as i read the patch now, can
> cause extra calls to ->poll() even though the RX ring of that particular
> device is empty and the driver has indicated it's done processing RX
> packets. (i'm now assuming that the extra-polling-for-a-jiffy line in the
> current patch is removed - that one is a showstopper to begin with.) Is
> this claim of mine correct?

No.

If ring is empty, device is removed from poll list and dev->poll is not called
more.

dev->quota is to preempt service when ring does not want to clear.
In this case work remains for the next round after all the rest
of interfaces are served. Well, it is to allow to give user control
on distribution cpu time between interfaces, when cpu is 100% utilized
and we have to drop something. Devices with lower weights will get
less service.


> packets. (i'm now assuming that the extra-polling-for-a-jiffy line in the

It is not so bogus with current kernel with working ksoftirqd.

The goal was to check what happens really when we enforce polling
even when machine is generally happy. For me it is not evident apriori:
more cpu is eaten uselessly or less due to absent irqs.
Note, that on dedicated router it is pretty normal to spin in context
of ksoftirqd switching to control tasks when it is required.
And, actually, it is amazing feature of the scheme, that it is so easy
to add such option.

Anyway, to all that I remember, the question remained unanswered. :-)
Robert even observed that only 9% of cpu is eaten, which is surely
cannot be true. :-)

Alexey

2001-10-05 18:49:57

by Andreas Dilger

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Oct 05, 2001 16:52 +0200, Robert Olsson wrote:
> > If you get to the stage where you are turning off IRQs and going to a
> > polling mode, then don't turn IRQs back on until you have a poll (or
> > two or whatever) that there is no work to be done. This will at worst
> > give you 50% polling success, but in practise you wouldn't start polling
> > until there is lots of work to be done, so the real success rate will
> > be much higher.
> >
> > At this point (no work to be done when polling) there are clearly no
> > interrupts would be generated (because no packets have arrived), so it
> > should be reasonable to turn interrupts back on and stop polling (assuming
> > non-broken hardware). You now go back to interrupt-driven work until
> > the rate increases again. This means you limit IRQ rates when needed,
> > but only do one or two excess polls before going back to IRQ-driven work.
>
> Yes this has been considered and actually I think Jamal did this in one of
> the pre NAPI patch. I tried something similar... but instead of using a
> number of excess polls I was doing excess polls for a short time (a
> jiffie). This was the showstopper mentioned the previous mails. :-)

(Note that I hadn't read the NAPI paper until after I posted the above, and
it appears that I was describing pretty much what NAPI already does ;-). In
light of that, I wholly agree that NAPI is a superior solution for handling
IRQ overload from a network device.

> Anyway it up to driver to decide this policy. If the driver returns
> "not_done" it is simply polled again. So low-rate network drivers can have
> a different policy compared to an OC-48 driver. Even continues polling is
> therefore possible and even showstoppers. :-) There are protection for
> polling livelocks.

One question which I have is why would you ever want to continue polling
if there is no work to be done? Is it a tradeoff between the amount of
time to handle an IRQ vs. the time to do a poll? An assumption that if
there was previous network traffic there is likely to be more the next
time the interface is checked (assuming you have other work to do between
the time you last polled the device and the next poll)?

Is enabling/disabling of the RX interrupts on the network card an issue
in the sense of "you need to wait X us after writing to this register
for it to take effect" or other issue which makes it preferrable to have
some "hysteresis" between changing state from IRQ-driven to polling?

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-10-05 19:02:10

by Davide Libenzi

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Fri, 5 Oct 2001, Andreas Dilger wrote:

> On Oct 05, 2001 16:52 +0200, Robert Olsson wrote:
> > > If you get to the stage where you are turning off IRQs and going to a
> > > polling mode, then don't turn IRQs back on until you have a poll (or
> > > two or whatever) that there is no work to be done. This will at worst
> > > give you 50% polling success, but in practise you wouldn't start polling
> > > until there is lots of work to be done, so the real success rate will
> > > be much higher.
> > >
> > > At this point (no work to be done when polling) there are clearly no
> > > interrupts would be generated (because no packets have arrived), so it
> > > should be reasonable to turn interrupts back on and stop polling (assuming
> > > non-broken hardware). You now go back to interrupt-driven work until
> > > the rate increases again. This means you limit IRQ rates when needed,
> > > but only do one or two excess polls before going back to IRQ-driven work.
> >
> > Yes this has been considered and actually I think Jamal did this in one of
> > the pre NAPI patch. I tried something similar... but instead of using a
> > number of excess polls I was doing excess polls for a short time (a
> > jiffie). This was the showstopper mentioned the previous mails. :-)
>
> (Note that I hadn't read the NAPI paper until after I posted the above, and
> it appears that I was describing pretty much what NAPI already does ;-). In
> light of that, I wholly agree that NAPI is a superior solution for handling
> IRQ overload from a network device.
>
> > Anyway it up to driver to decide this policy. If the driver returns
> > "not_done" it is simply polled again. So low-rate network drivers can have
> > a different policy compared to an OC-48 driver. Even continues polling is
> > therefore possible and even showstoppers. :-) There are protection for
> > polling livelocks.
>
> One question which I have is why would you ever want to continue polling
> if there is no work to be done?

According to the doc the poll is stopped when 1) there're no more packets
to be fetched from dma ring 2) quota is reached.




- Davide


2001-10-05 19:17:31

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

Hello!

> One question which I have is why would you ever want to continue polling
> if there is no work to be done? Is it a tradeoff between the amount of
> time to handle an IRQ vs. the time to do a poll?

Yes. IRQ even taken alone eat non-trivial amount of resources.

Actually, I remember Jamal worked with machine, which had
no io-apic and only irq ack/mask/unmask eated >15% of cpu there. :-)

> An assumption that if
> there was previous network traffic there is likely to be more the next
> time the interface is checked (assuming you have other work to do between
> the time you last polled the device and the next poll)?

Exactly.

Note also that the testing of "goto not_done" was made in pure environment:
dedicated router. Continuous polling is an evident advantage in this situation,
only power is eaten. I would not enable this on a notebook. :-)


> Is enabling/disabling of the RX interrupts on the network card an issue
> in the sense of "you need to wait X us after writing to this register
> for it to take effect" or other issue which makes it preferrable to have
> some "hysteresis" between changing state from IRQ-driven to polling?

"some hysteresis" is right word. This loop is an experiment with still
unknown result yet. Originally, Jamal proposed to spin several times.
I killed this. Robert proposed to check inifinite loop yet. (Note,
jiffies check is just a way to get rid of completely idle devices,
one jiffie is enough lonf time to be considered infinite).

Alexey

2001-10-07 06:09:43

by Robert Olsson

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


[email protected] writes:

> "some hysteresis" is right word. This loop is an experiment with still
> unknown result yet. Originally, Jamal proposed to spin several times.
> I killed this. Robert proposed to check inifinite loop yet. (Note,
> jiffies check is just a way to get rid of completely idle devices,
> one jiffie is enough lonf time to be considered infinite).
>

And from our discussion about packet-reordering we get even more motivation
for the "extra-polls" not only to save IRQ's

We may expand this to others too...

As polling-lists are per CPU and consecutive polls stays within the same
CPU the device becomes bound to one CPU. We are protected against packet
reordering as long there are consecutive polls.

I've consulted some CS people who has worked with this issues and I have
understood packet reordering is non-trivial problem at least with a general
approach.

So to me it seems we do very well with a very simple scheme and as I
understand all SMP networking will benefit from this.

Our "field-test" indicates that the packet load is still well distributed
among the CPU's.

So maybe the showstopper comes out as a showwinner. :-)

Cheers.

--ro

2001-10-07 20:38:07

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

Ingo could you explain me one basic thing?

What the hell has the hardirq rate limit logic to do with softirqs?

(btw, I don't know why you call it irq-rewrite, you didn't rewrote
anything, you just added a irq flood avoidance feature by lefting irq
disabled when you detect an irq flood caming in)

hardirqs have nothing to do with softirqs. Softirq as their name suggest
are a totally software thing, they're generated by software,
incidentally for the network stack they're posted from hard irq handlers
because network cards are irq driven, but that's just a special case (of
course it is the common case), but it is not the general case.

Your hardirq rate limit logic that lefts the irq disabled for some time
is certainly needed from a security standpoint to avoid DoS if untrusted
users can generates a flood of irqs using some device, unless the
devices provides a way to flow control the irq rate (which I understood
most hardware that can generate a flood of irqs provides anyways).

as far I can tell any change to the softirq logic is completly
orthogonal with the softirq changes. Changing both things together or
seeing any connection between the two just shows a very limited network
oriented view of the whole picture about the softirqs.

Now I'm not saying that I don't want to change anything in the softirq
logic, for example the deschedule logic made lots of sense and I can see
the benefit for users like the network stack.

Andrea

2001-10-08 00:32:04

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

[ I hope not to reiterate the obvious, I didn't read every single email
of this thread ]

> > > In a generic computing environment i want to spend cycles doing useful
> > > work, not polling. Even the quick kpolld hack [which i dropped, so please
> > > dont regard it as a 'competitor' patch] i consider superior to this, as i
> > > can renice kpolld to reduce polling. (plus kpolld sucks up available idle
> > > cycles as well.) Unless i royally misunderstand it, i cannot stop the
> > > above code from wasting my cycles, and if that is true i do not want to
> > > see it in the kernel proper in this form.
>
> On Wed, 3 Oct 2001, jamal wrote:
> > The interupt just flags "i, netdev, have work to do"; [...]
On Wed, Oct 03, 2001 at 06:51:55PM +0200, Ingo Molnar wrote:
> (and the only thing i pointed out was that the patch as-is did not limit
> the amount of polling done.)

You're perfectly right that it's not ok for a generic computing
environment to spend lots of cpu in polling, but it is clear that in a
dedicated router/firewall we can just shutdown the NIC interrupt forever via
disable_irq (no matter if the nic supports hw flow control or not, and
in turn no matter if the kid tries to spam the machine with small
packets) and dedicate 1 cpu to the polling-work with ksoftirqd polling
forever the NIC to deliver maximal routing performance or something like
that. ksoftirqd will ensure fairness with the userspace load as well.
You probably wouldn't get a benefit with tux because you would
potentially lose way too much cpu with true polling and you're traffic
is mostly going from the server to the clients not the othet way around
(plus the clients uses delayed acks etc..), but the world isn't just
tux.

Of course we agree that such a "polling router/firewall" behaviour must
not be the default but it must be enabled on demand by the admin via
sysctl or whatever else userspace API. And I don't see any problem with
that.

Andrea

2001-10-08 04:58:18

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

In article <[email protected]> you wrote:
> You're perfectly right that it's not ok for a generic computing
> environment to spend lots of cpu in polling, but it is clear that in a
> dedicated router/firewall we can just shutdown the NIC interrupt forever via
> disable_irq (no matter if the nic supports hw flow control or not, and
> in turn no matter if the kid tries to spam the machine with small
> packets) and dedicate 1 cpu to the polling-work with ksoftirqd polling
> forever the NIC to deliver maximal routing performance or something like
> that.

Yes, have a look at the work of the Click Modular Router PPL from MIT,
having a Polling Router Module Implementatin which outperforms Linux Kernel
Routing by far (according to their paper :)

You can find the Link to Click somewhere on my Page:
http://www.freefire.org/tools/index.en.php3in the Operating System section
(i think)

I can recommend the click-paper.pdf

Greetings
Bernd

2001-10-08 14:05:17

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Fri, 5 Oct 2001 [email protected] wrote:

> Hello!
>
> > One question which I have is why would you ever want to continue polling
> > if there is no work to be done? Is it a tradeoff between the amount of
> > time to handle an IRQ vs. the time to do a poll?
>
> Yes. IRQ even taken alone eat non-trivial amount of resources.
>
> Actually, I remember Jamal worked with machine, which had
> no io-apic and only irq ack/mask/unmask eated >15% of cpu there. :-)
>

This was Robert actually; conclusion was Interupts are very expensive. If
we can get rid of as many of them as possible, we are getting a side
benefit. I cant find the old data, but Robert has some data over here:
http://robur.slu.se/Linux/net-development/experiments/010301



> "some hysteresis" is right word. This loop is an experiment with still
> unknown result yet. Originally, Jamal proposed to spin several times.
> I killed this.

It was a good idea you killed it, now that i think in retrospect,
The solution is much cleaner without it.

> Robert proposed to check inifinite loop yet. (Note,
> jiffies check is just a way to get rid of completely idle devices,
> one jiffie is enough lonf time to be considered infinite).
>

In my opinion we really dont need this. I did some quick testing, with and
without it and i dont see any differences.

cheers,
jamal

2001-10-08 14:47:53

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



>Yes, have a look at the work of the Click Modular Router PPL from MIT,
>having a Polling Router Module Implementatin which outperforms Linux
>Kernel Routing by far (according to their paper :)

I have read the click paper; i also just looked at the code and it seems
the tulip driver they use has the same roots as us (based on Alexey's
initial HFC driver)

Several things to note/observe:
- They use some very specialized piece of hardware (with two PCI buses).
- Roberts results on a single PCI bus hardware was showing ~360Kpps
routing vs clicks 435Kpps. This is not "far off" given the differences in
hardware. What would be really interesting is to have the click folks
post their latency results. I am curious as to what a purely polling
scheme they have would achieve (as opposed to NAPI which is a mixture of
interupts and polls).
- Linux is already "very modular" as a router with both the traffic
control framework and netfilter. I like their language specification etc;
ours is a little more primitive in comparison.
- Click seems to only run on a system that is designated as a router (as
you seem to point out).

Linux has a few other perks, but the above were to compare the two.

> You can find the Link to Click somewhere on my Page:
> http://www.freefire.org/tools/index.en.php3in the Operating System
> section (i think)

Nice web page and collection, btw. The right web page seems to be:
http://www.freefire.org/tools/index.en.php3

I looked at the latest click paper on SMP. It would help if they were
aware of whats happening on Linux (since it seems to be their primary OS).
softnet does what they are asking for sans the scheduling (which in Linux
proper is done via the IRQ scheduling). They also have a way for the
admin to specify the scheduling scheme; which is nice, but i am not sure
to be very valuable; I'll read the paper again to avoid hasty judgement.
It would be nice to work with the click people (at least to avoid
redundant work and maybe to get Linux mentioned in their paper -- they
even mention ALTQ but forget Linux, which is more advanced ;->).

cheers,
jamal

2001-10-08 14:55:33

by Alan

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

> Of course we agree that such a "polling router/firewall" behaviour must
> not be the default but it must be enabled on demand by the admin via
> sysctl or whatever else userspace API. And I don't see any problem with
> that.

No I don't agree. "Stop random end users crashing my machine at will" is not
a magic sysctl option - its a default.


Alan

2001-10-08 15:04:23

by Jeff Garzik

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Mon, 8 Oct 2001, Alan Cox wrote:
> > Of course we agree that such a "polling router/firewall" behaviour must
> > not be the default but it must be enabled on demand by the admin via
> > sysctl or whatever else userspace API. And I don't see any problem with
> > that.
>
> No I don't agree. "Stop random end users crashing my machine at will" is not
> a magic sysctl option - its a default.

I think (Ingo's?) analogy of an airbag was appropriate, if that's indeed
how the code winds up functioning.

Having a mechanism that prevents what would otherwise be a lockup is
useful. NAPI is useful. Having both would be nice :)

Jeff




2001-10-08 15:07:24

by Alan

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

> I think (Ingo's?) analogy of an airbag was appropriate, if that's indeed
> how the code winds up functioning.

Very much so

"Driver killed because the air bag enable is off by default and only
mentioned on page 87 of the handbook in a footnote"

> Having a mechanism that prevents what would otherwise be a lockup is
> useful. NAPI is useful. Having both would be nice :)

NAPI is important - the irq disable tactic is a last resort. If the right
hardware is irq flood aware it should only ever trigger to save us from
irq routing errors (eg cardbus hangs)

2001-10-08 15:09:54

by Bill Davidsen

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

In article <[email protected]> [email protected] wrote:

| You're perfectly right that it's not ok for a generic computing
| environment to spend lots of cpu in polling, but it is clear that in a
| dedicated router/firewall we can just shutdown the NIC interrupt forever via
| disable_irq (no matter if the nic supports hw flow control or not, and
| in turn no matter if the kid tries to spam the machine with small
| packets) and dedicate 1 cpu to the polling-work with ksoftirqd polling
| forever the NIC to deliver maximal routing performance or something like
| that. ksoftirqd will ensure fairness with the userspace load as well.
| You probably wouldn't get a benefit with tux because you would
| potentially lose way too much cpu with true polling and you're traffic
| is mostly going from the server to the clients not the othet way around
| (plus the clients uses delayed acks etc..), but the world isn't just
| tux.
|
| Of course we agree that such a "polling router/firewall" behaviour must
| not be the default but it must be enabled on demand by the admin via
| sysctl or whatever else userspace API. And I don't see any problem with
| that.

Depending on implementation, this may be an acceptable default,
assuming that the code can determine when too many irqs are being
serviced. There are many servers, and even workstations in campus
environments, which would benefit from changing to polling under burst
load. I don't know if even a router need be locked in that state, it
should stay there under normal load.

I'm convinced that polling under heavy load is beneficial or
non-harmful on virtually every type of networked machine. Actually, any
machine subject to interrupt storms, many device interface or control
systems can get high rates due to physical events, networking is just a
common problem.

--
bill davidsen <[email protected]>
"If I were a diplomat, in the best case I'd go hungry. In the worst
case, people would die."
-- Robert Lipe

2001-10-08 15:13:05

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Mon, 8 Oct 2001, Alan Cox wrote:

> NAPI is important - the irq disable tactic is a last resort. If the right
> hardware is irq flood aware it should only ever trigger to save us from
> irq routing errors (eg cardbus hangs)

Agreed. As long as the IRQ flood protector can do proper isolation.
Here's hat i see on my dell latitude laptop with a built in ethernet (not
cardbus related ;->)

-------------------------------
[root@jzny /root]# cat /proc/interrupts
CPU0
0: 29408219 XT-PIC timer
1: 332192 XT-PIC keyboard
2: 0 XT-PIC cascade
10: 643040 XT-PIC Texas Instruments PCI1410 PC card Cardbus
Controller, eth0
11: 17 XT-PIC usb-uhci
12: 2207062 XT-PIC PS/2 Mouse
14: 307504 XT-PIC ide0
NMI: 0
LOC: 0
ERR: 0
MIS: 0
-----------------------------

cheers,
jamal

2001-10-08 15:17:05

by Alan

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

> On Mon, 8 Oct 2001, Alan Cox wrote:
>
> > NAPI is important - the irq disable tactic is a last resort. If the right
> > hardware is irq flood aware it should only ever trigger to save us from
> > irq routing errors (eg cardbus hangs)
>
> Agreed. As long as the IRQ flood protector can do proper isolation.
> Here's hat i see on my dell latitude laptop with a built in ethernet (not
> cardbus related ;->)

It doesnt save you from horrible performance. NAPI is there to do that, it
saves you from a dead box. You can at least rmmod the cardbus controller
with protection in place (or go looking for the problem with a debugger)

2001-10-08 15:21:06

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Mon, Oct 08, 2001 at 04:00:36PM +0100, Alan Cox wrote:
> > Of course we agree that such a "polling router/firewall" behaviour must
> > not be the default but it must be enabled on demand by the admin via
> > sysctl or whatever else userspace API. And I don't see any problem with
> > that.
>
> No I don't agree. "Stop random end users crashing my machine at will" is not
> a magic sysctl option - its a default.

The "random user hanging my machine" has nothing to do with "it is ok in
a router to dedicate one cpu to polling".

The whole email was about "in a router is ok to poll" I'm not saying "to
solve the food problem you should be forced to turn on polling".

I also said that if you turn on polling you also solve the DoS, yes, but
that was just a side note. My only implicit thought about the side note
was that most machines sensible to the DoS are routers where people
wants the max performance and where they can dedicate one cpu (also in
UP) to polling. So the only argument I can make is that the amount of
userbase concerned about the "current" hardirq DoS would decrease
significantly if polling method would becomes available in linux.

I'm certainly not saying that the "stop random user crashing my machine
at will" should be a sysctl option and not the default.

Andrea

2001-10-08 15:23:25

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Mon, 8 Oct 2001, Alan Cox wrote:

> It doesnt save you from horrible performance. NAPI is there to do that, it
> saves you from a dead box. You can at least rmmod the cardbus controller
> with protection in place (or go looking for the problem with a debugger)

I hear you, but I think isolation is important;
If i am telneted (literal example here) onto that machine (note eth0 is
not cardbus based) and cardbus is causing the loops then iam screwed.
[The same applies to everything that shares interupts]

cheers,
jamal

2001-10-08 15:25:35

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

On Mon, Oct 08, 2001 at 04:12:53PM +0100, Alan Cox wrote:
> "Driver killed because the air bag enable is off by default and only
> mentioned on page 87 of the handbook in a footnote"

Nobody suggested to not add "an airbag" by default.

Infact the polling isn't an airbag at all, when you poll you're flying
so you never need an airbag at all, only when you're on the ground you
may need the airbag.

Another thing I said recently is that the hardirq airbag have nothing to
do with softirqs, and that's right. Patch messing the softirq logic in
function of the hardirq airbag are just totally broken or at least
confusing because incidentally merged together by mistake.

Andrea

2001-10-08 15:29:55

by Alan

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

> I hear you, but I think isolation is important;
> If i am telneted (literal example here) onto that machine (note eth0 is
> not cardbus based) and cardbus is causing the loops then iam screwed.
> [The same applies to everything that shares interupts]

Worst case it sucks, but it isnt dead.

Once you disable the IRQ and kick over to polling the cardbus and the
ethernet both still get regular service. Ok so your pps rate and your
latency are unpleasant, but you are not dead.

For a shared IRQ we know we can safely switch to a 200Hz poll of shared
irq lines marked 'stuck'. The problem ones are non shared ISA devices going
mad - there you have to be careful not to fake more irqs than real ones
are delivered since some ISA device drivers "know" the IRQ is for them.

Even at 200Hz polling a typical cardbus card with say 32 ring buffer slots
can process 6000pps.

Alan

2001-10-08 15:30:35

by Alan

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

> Another thing I said recently is that the hardirq airbag have nothing to
> do with softirqs, and that's right. Patch messing the softirq logic in
> function of the hardirq airbag are just totally broken or at least
> confusing because incidentally merged together by mistake.

Agreed

2001-10-08 16:00:06

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Mon, 8 Oct 2001, Alan Cox wrote:

> > I hear you, but I think isolation is important;
> > If i am telneted (literal example here) onto that machine (note eth0 is
> > not cardbus based) and cardbus is causing the loops then iam screwed.
> > [The same applies to everything that shares interupts]
>
> Worst case it sucks, but it isnt dead.
>
> Once you disable the IRQ and kick over to polling the cardbus and the
> ethernet both still get regular service. Ok so your pps rate and your
> latency are unpleasant, but you are not dead.
>

Agreed if you add the polling cardbus bit.
Note polling cardbus would require more changes than the above.
My concern was more of the following: This is a temporary solution. A
quickie if you may. The proper solution is to have the isolation part. If
we push this in, doesnt it result in procastination of "we'll do it later"
Why not do it properly since this was never a show stopper to begin with?
[The showstopper was networking]

cheers,
jamal

2001-10-08 16:05:36

by Alan

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

> Agreed if you add the polling cardbus bit.
> Note polling cardbus would require more changes than the above.

I don't think it does. There are two pieces to the problem

a) Not dying horribly
b) Handling it elegantly

b) is driver specific (NAPI etc) and I think well understood to the point
its being used already for performance reasons

a) is as simple as

if(stuck_in_irq(foo) && irq_shared(foo))
{
disable_real_irq(foo);
timer_fake_irq_foo();
}

We know spoofing a shared irq is safe.

Alan

2001-10-08 16:14:06

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Mon, 8 Oct 2001, Alan Cox wrote:

> > Agreed if you add the polling cardbus bit.
> > Note polling cardbus would require more changes than the above.
>
> I don't think it does.

I was repsonding to your earlier comment that:

> Once you disable the IRQ and kick over to polling the cardbus and the
> ethernet both still get regular service. Ok so your pps rate and your
> latency are unpleasant, but you are not dead.

basically pointing that we'll need more work to be done to get Ingos patch
to poll the cardbus and eth0 in the example i gave.
those will have to be per driver. Did i miss something?
Agree on your other points there

cheers,
jamal

2001-10-08 17:39:50

by Robert Olsson

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5


jamal writes:
>
> This was Robert actually; conclusion was Interupts are very expensive. If
> we can get rid of as many of them as possible, we are getting a side
> benefit. I cant find the old data, but Robert has some data over here:
> http://robur.slu.se/Linux/net-development/experiments/010301


Jamal!

I think you ment:
http://robur.slu.se/Linux/net-development/experiments/010313

MB with PIC irq controller IO-APIC boards does a lot better.

Cheers.

--ro

2001-10-08 17:42:20

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Mon, 8 Oct 2001, Robert Olsson wrote:

> I think you ment:
> http://robur.slu.se/Linux/net-development/experiments/010313
>
> MB with PIC irq controller IO-APIC boards does a lot better.
>

Ooops, Yes, i am sorry (12 days difference only ;->)

cheers,
jamal

2001-10-09 00:36:19

by Scott Laird

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Mon, 8 Oct 2001, jamal wrote:
>
> Several things to note/observe:
> - They use some very specialized piece of hardware (with two PCI buses).

Huh? It was just an L440GX, which was probably the single most common PC
server board for a while in 1999-2000. Most of VA Linux's systems used
them. I wouldn't call them "very specialized."

> - Roberts results on a single PCI bus hardware was showing ~360Kpps
> routing vs clicks 435Kpps. This is not "far off" given the differences in
> hardware. What would be really interesting is to have the click folks
> post their latency results. I am curious as to what a purely polling
> scheme they have would achieve (as opposed to NAPI which is a mixture of
> interupts and polls).

Their 'TOCS00' paper lists a 29us one-way latency on page 22.

Click looks interesting, much more so then most academic network projects,
but I'm still not sure if it'd really be useful in most "real"
environments. It looks too flexible for most people to manage. It'd be
an interesting addition to my test lab, though :-).


Scott

2001-10-09 03:19:38

by jamal

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5



On Mon, 8 Oct 2001, Scott Laird wrote:

>
>
> On Mon, 8 Oct 2001, jamal wrote:
> >
> > Several things to note/observe:
> > - They use some very specialized piece of hardware (with two PCI buses).
>
> Huh? It was just an L440GX, which was probably the single most common PC
> server board for a while in 1999-2000. Most of VA Linux's systems used
> them. I wouldn't call them "very specialized."
>

Ok, sorry you are right, not very high end, but not exactly cheap even at
the time to have a motherboard with two PCI busses (i for one would have
been delighted to have had access to one even today);
Nevertheless, impressive numbers still.
I could do achieve MLFR of ~200Kpps on an elcheapo PII with 4port znyx
cards on an ASUS that has a single PCI bus; and from what Donald Becker
was saying we could probably do better with 4 interface cards rather than
a single 4-port card due to bus mastership issues.
I suppose thats why Robert can pull more packets on only two gige NICs on
a single bus. He's more than likely hitting PCI bottlenecks at this point.
A second PCI bus with a second set of cards should help (dis)prove this
theory.

> > - Roberts results on a single PCI bus hardware was showing ~360Kpps
> > routing vs clicks 435Kpps. This is not "far off" given the differences in
> > hardware. What would be really interesting is to have the click folks
> > post their latency results. I am curious as to what a purely polling
> > scheme they have would achieve (as opposed to NAPI which is a mixture of
> > interupts and polls).
>
> Their 'TOCS00' paper lists a 29us one-way latency on page 22.
>

Thats a very good number. I wonder what it means though and at what rates
those numbers are extracted. For example some of the tests i run on the
znyx card with only two ports generating traffic -- you can observe a
rough latency of around 33us upto about the MLFFR and then the latency
jumps sharply to hunderds of us. Infact at 147Kpps input, you observe
anywhere in the range of upto 800us although we are clearly flat at the
MLFFR throughput on the output. These numbers might also be affected by
the latency measurement scheme used,

> Click looks interesting, much more so then most academic network projects,
> but I'm still not sure if it'd really be useful in most "real"

agreed, although i think we need to have more research of the type that
click is bringing ...

> environments. It looks too flexible for most people to manage. It'd be
> an interesting addition to my test lab, though :-).

indeed.

cheers,
jamal

2001-10-09 04:04:08

by Werner Almesberger

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

jamal wrote:
> - Linux is already "very modular" as a router with both the traffic
> control framework and netfilter. I like their language specification etc;
> ours is a little more primitive in comparison.

I guess you're talking about iproute2/tc ;-) Things are better with tcng:
http://tcng.sourceforge.net/

Click covers more areas than just Traffic Control, though.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Lausanne, CH [email protected] /
/_http://icawww.epfl.ch/almesberger/_____________________________________/

2001-10-13 19:35:52

by Pavel Machek

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

Hi!

> Even at 200Hz polling a typical cardbus card with say 32 ring buffer slots
> can process 6000pps.

On my velo, I have pcmcia but don't quite know how to drive it properly.
I have not figured interrupts, so I ran ne2000 in polling mode. .5MB/sec
is not bad for as slow hardware as velo is (see sig). Next experiment was
ATA flash. I had to bump HZ to 1000 for it, and I'm getting spurious
unexpected interrupt messaegs, but it works surprisingly well.
Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.

2001-10-13 19:36:20

by Pavel Machek

[permalink] [raw]
Subject: Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

Hi!

> > Agreed if you add the polling cardbus bit.
> > Note polling cardbus would require more changes than the above.
>
> I don't think it does. There are two pieces to the problem
>
> a) Not dying horribly
> b) Handling it elegantly
>
> b) is driver specific (NAPI etc) and I think well understood to the point
> its being used already for performance reasons
>
> a) is as simple as
>
> if(stuck_in_irq(foo) && irq_shared(foo))
> {
> disable_real_irq(foo);
> timer_fake_irq_foo();
> }

I'd kill irq_shared() test, and added a printk :-).
Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.