2006-05-02 07:41:57

by Brown, Len

[permalink] [raw]
Subject: RE: [(repost) git Patch 1/1] avoid IRQ0 ioapic pin collision

>IRQ is an interrupt request, from a device.
>Currently they can come in over a pin, or over a packet.
>IRQs at the source can be shared.

Well stated.

>For an IRQ number it is just an enumeration of interrupt sources.
>Ideally in a stable manner so that the assigned number does not
>depend on compile options, or which version of the kernel you
>are running.

While desirable, the "stable" part is not guaranteed.
On an ACPI system with PCI Interrupt Link devices,
request_irq() goes into the pci-enable-device path,
which ends up asking ACPI what interrupt pin this device uses.
But the pin is programmable, so we pick one from the list
of possible, program the router accordingly, and this is what
gets returned to request_irq(). ACPI PCI Interrupt Link Devices,
of course, are just a wrapper for what x86 legacy calls PIRQ routers.

In practice, if you boot the same or similar kernel on the
same configuration, you get 'stable', but if something changes
such as enabling/disabling a device or loading an additional device
driver
or otherwise tinkering with probe order, then all bets are off,
since the sequence of selecting and
balancing possible device interrupt lines between available IRQs has
changed.

>> Re: irq making sense
>> Depends on the defintion, I suppose. I'm still looking for one
>> that matches how Linux uses the term.
>
>The global irq_desc array is the core of any definition.
>That is how linux knows about irqs. Beyond that we ideally
>have stable numbers across reboots, and recompiles, and hardware
>addition and removal. Stable numbers are not really possible
>but we can come quite close.

Okay, I'm good with irq_desc, Linux's generic IRQ definition,
being the center of the universe. I'm also good with the index
into irq_desc[] being the "IRQ number" show in /proc/interrupts.
I guess I was thinking from a HW-specific point of view this
afternoon...

I don't like vector numbers being exposed in /proc/interrupts
on x86. On an ACPI-enabled system, I think that they should be
the GSI. This is consistent with the 0-15 legacy definitions
being pin numbers (which I think we must keep), and consistent
with the 16-n numbers being IOAPIC pin numbers (+offset),
(which I think we should have kept).

Of course, to do this simply, we'd have to select the irq_desc[i]
with i = gsi, and not i = vector.

>> We should never have had a problem with un-connected interrupt lines
>> consuming vectors, as the vectors are handed out at run-time
>> only when interrupts are requested by drivers.
>
>Incorrect. By being requested by drivers I assume you mean by
>request_irq. assign_irq_vector is called long before request_irq
>and in fact there is currently not a call from request_irq to
>assign_irq_vector. Look where assign_irq_vector is called in io_apic.c

You are right. This code is wrong.
It makes absolutely no sense to reserve vectors in advance
for every RTE in the IOAPIC when we don't even know if they
are going to be used.

This is clearly a holdover from the early IOAPIC/MPS days
when we were talking about 4 to 8 non-legacy RTEs.

This is where the big system vector shortage problem
should be addressed.

>I think there is a legitimate case for legacy edge triggered
>interrupts to request a vector sooner, as there are so many races in
>the enable/disable paths.

I don't follow you on this part.

>>>Problem 2.
>>> Some systems have more than 224 GSIs that are actually
>>> connected to devices.
>>> There are three possible ways to handle this case.
>>> - Fail after we run out of vectors.
>>> - Share a vector.
>>>
>>> - Allow vectors of different cpus to handle different irqs.
>>> The is the most elegant and scalable, and Natalie's suggestion.
>>
>> So here we allow the same vector to be used by different IOAPICS,
>> or IOAPIC pins, but have them it directed to different CPUs
>> who have per-cpu tables to vector to different devices?
>>
>> A practical workaround? Certainly.
>> An elegant solution? No, that would require better hardware;-)
>
>Agreed. But we lost elegant at the point of on demand assigning of
>vectors to irqs.
>
>> The problem with this workaround is going to be choose a policy
>> of where to direct what, and how to move things if interrupts
>> become un-balanced.
>
>The directing problem already exists. In general an I/O apic can
>only direct an irq at a specific cpu, and linux already supports
>cpusets on which an irq may be delivered.
>
>But yes on systems with 8 or fewer cpus the I/O apics can do the
>directing themselves and it is likely we still want to handle that
>case.

I don't think it is important to optimize for <= 8 CPUs
having hundreds of unique interrupts. Systems with huge
numbers of interrutps will have more than 8 CPUs. If they
don't, then the should function, but being optimal isn't
what the administrator had in mind. With multiple cores
being added to systems, this becomes even more true
over time.

>>>So what would be a path to get there from here?
>>>- Fix the irq set_affinity code so that it makes the changes to
>>> irqs when the interrupt is actually disabled.
>>> Calling desc->handler->disable, desc->handler->enable
>>> does not work immediately so the current code is racy.
>>> Which shows up very badly if you attempt to change the irq vector,
>>> and may cause rare problems today
>>>
>>>- Modify the MSI code to allocate irqs and not vectors.
>>> So we don't have two paths through the code, for no good reason.
>>
>> Modify the MSI code to allocate "something to do
>> with a specific interrupt". Hmmm, but it does this already:-)
>>
>> Seriously, aren't the interrupt vectors, or perhaps the
>> vector/cpu pair, the fundamental resource being allocated here?
>
>So there are two pieces. struct irq_desc that describes
>an interrupt source, and the vector that catches the interrupt
>source and tells the cpu about it.
>
>To be able to manage an irq source we need to enumerate and give it a
>struct irq_desc entry in the global irq_desc array. One of the
>operations we perform after that point is to allocate it a vector so
>we can catch that irq source. But other operations such as reporting,
>and the assignment of cpu affinity also happen based on the entry
>in the irq_desc array.
>
>Except for architecture implementations of irq routing the vectors
>that we catch irqs with should be completely invisible. The ideal
>architecture would not even have a separate concept and except for
>the msi code linux has not exported the concept of vectors outside
>of the architecture implementations.

Agreed.

>> what about per/irq kstats?
>> would you keep stats on a per-vector basis
>> and translate back to irqs for /proc/interrupts?
>
>No. What we have on a per irq basis is what makes sense
>and I would leave that untouched.

Agreed.

>> what about when the same vector is independently
>> used by multiple CPUS?
>
>Accounting by irq makes sense.

Agreed.

>>>- Remove the current irq compression.
>>
>> maybe this can be moved up in the to-do list?
>
>Not much.
>- The current set_affinity code has a real bug.
>- The MSI mismatch is worse and fixing it reduces testing later.
>- We need to raise the number of irqs we can support. Either
> by vastly increasing the number of stubs, whose number looks
> hard to control at compile time, or by turning this into
> a data problem.
>
>>>- Move irq vector allocation/deallocation into request_irq,
>>>and free_irq.
>>>
>>>- Make vector_irq per_cpu.
>>>
>>>- Modify assign_irq to allocate different vectors on different cpus,
>>> to fail if we cannot find a free irq vector somewhere in the
>>> irqs cpu set, and call assign_irq from set_affinity so
>when we change
>>> cpus we can allocate a different irq.
>>>
>>>I have proof of concept level patches for everything thing
>>>except making MSI actually allocate irqs, but that should be straight
>> forward.
>>
>> except that irqs should be allocating vectors,
>> the fundamental currency here, and MSI should
>> be allocating the same, no?
>
>Not quite. The MSI code should first allocate
>an irq_desc entry get an IRQ number.

You're right. I agree.

>Then when the driver calls request_irq the architecture specific
>code should arrange for __do_IRQ to be called with the appropriate
>irq number when the interrupt happens. The architecture specific
>code needs to do this by setting the MSI address and the MSI data to
>an architecture specific values, that on x86 will involve the a
>vector.
>

>It is reassuring to see that the way I want to make the code work tends
>to be they way you already think the code works. I can't be too far
>off in reducing the complexity.

Yeah, thanks for pointing out that the vectors are assigned at
IOAPIC init-time. I hadn't noticed that, and it needs to be addressed.

-Len


2006-05-02 07:47:07

by Andi Kleen

[permalink] [raw]
Subject: Re: [(repost) git Patch 1/1] avoid IRQ0 ioapic pin collision

On Tuesday 02 May 2006 09:41, Brown, Len wrote:

> You are right. This code is wrong.
> It makes absolutely no sense to reserve vectors in advance
> for every RTE in the IOAPIC when we don't even know if they
> are going to be used.
>
> This is clearly a holdover from the early IOAPIC/MPS days
> when we were talking about 4 to 8 non-legacy RTEs.

Yes I agree. A lot of the IO-APIC code could probably
need some renovation.

> This is where the big system vector shortage problem
> should be addressed.

If we go to per CPU IDTs it will be much less pressing, but
still a good idea.

-Andi

P.S.: There seems to be a lot of confusion about all this.
Maybe it would make sense to do a write up defining all the terms
and stick it into Documentation/* ?

2006-05-02 08:33:45

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [(repost) git Patch 1/1] avoid IRQ0 ioapic pin collision

"Brown, Len" <[email protected]> writes:

>>IRQ is an interrupt request, from a device.
>>Currently they can come in over a pin, or over a packet.
>>IRQs at the source can be shared.
>
> Well stated.
>
>>For an IRQ number it is just an enumeration of interrupt sources.
>>Ideally in a stable manner so that the assigned number does not
>>depend on compile options, or which version of the kernel you
>>are running.
>
> While desirable, the "stable" part is not guaranteed.
> On an ACPI system with PCI Interrupt Link devices,
> request_irq() goes into the pci-enable-device path,
> which ends up asking ACPI what interrupt pin this device uses.
> But the pin is programmable, so we pick one from the list
> of possible, program the router accordingly, and this is what
> gets returned to request_irq(). ACPI PCI Interrupt Link Devices,
> of course, are just a wrapper for what x86 legacy calls PIRQ routers.
>
> In practice, if you boot the same or similar kernel on the
> same configuration, you get 'stable', but if something changes
> such as enabling/disabling a device or loading an additional device
> driver
> or otherwise tinkering with probe order, then all bets are off,
> since the sequence of selecting and
> balancing possible device interrupt lines between available IRQs has
> changed.

Agreed in the general case there is nothing we can do. But
with a reasonable degree of stability things will stay put
long enough for people to reason about them.

Although I would actually expect us to use the normal twiddle of
irq lines on a bus. But if we can have less contention that is certainly
better.

>>The global irq_desc array is the core of any definition.
>>That is how linux knows about irqs. Beyond that we ideally
>>have stable numbers across reboots, and recompiles, and hardware
>>addition and removal. Stable numbers are not really possible
>>but we can come quite close.
>
> Okay, I'm good with irq_desc, Linux's generic IRQ definition,
> being the center of the universe. I'm also good with the index
> into irq_desc[] being the "IRQ number" show in /proc/interrupts.
> I guess I was thinking from a HW-specific point of view this
> afternoon...

I had to stare at the code for a while to see this as well.

> I don't like vector numbers being exposed in /proc/interrupts
> on x86. On an ACPI-enabled system, I think that they should be
> the GSI. This is consistent with the 0-15 legacy definitions
> being pin numbers (which I think we must keep), and consistent
> with the 16-n numbers being IOAPIC pin numbers (+offset),
> (which I think we should have kept).

Agreed.

> Of course, to do this simply, we'd have to select the irq_desc[i]
> with i = gsi, and not i = vector.

Exactly.

>>> We should never have had a problem with un-connected interrupt lines
>>> consuming vectors, as the vectors are handed out at run-time
>>> only when interrupts are requested by drivers.
>>
>>Incorrect. By being requested by drivers I assume you mean by
>>request_irq. assign_irq_vector is called long before request_irq
>>and in fact there is currently not a call from request_irq to
>>assign_irq_vector. Look where assign_irq_vector is called in io_apic.c
>
> You are right. This code is wrong.
> It makes absolutely no sense to reserve vectors in advance
> for every RTE in the IOAPIC when we don't even know if they
> are going to be used.
>
> This is clearly a holdover from the early IOAPIC/MPS days
> when we were talking about 4 to 8 non-legacy RTEs.
>
> This is where the big system vector shortage problem
> should be addressed.

Yes. Natalie mentioned that there are some IBM systems
that still run out of processor vectors with our current
irq collapsing.

>>I think there is a legitimate case for legacy edge triggered
>>interrupts to request a vector sooner, as there are so many races in
>>the enable/disable paths.
>
> I don't follow you on this part.

For edge triggered legacy irqs coming off the i8259 it probably makes
sense to assign them vectors early like we do currently.

I believe although I am not certain that with edge triggered irqs
actually disabling them makes it easy to miss edges.

>>
>>> The problem with this workaround is going to be choose a policy
>>> of where to direct what, and how to move things if interrupts
>>> become un-balanced.
>>
>>The directing problem already exists. In general an I/O apic can
>>only direct an irq at a specific cpu, and linux already supports
>>cpusets on which an irq may be delivered.
>>
>>But yes on systems with 8 or fewer cpus the I/O apics can do the
>>directing themselves and it is likely we still want to handle that
>>case.
>
> I don't think it is important to optimize for <= 8 CPUs
> having hundreds of unique interrupts. Systems with huge
> numbers of interrutps will have more than 8 CPUs. If they
> don't, then the should function, but being optimal isn't
> what the administrator had in mind. With multiple cores
> being added to systems, this becomes even more true
> over time.

Sorry. To be clear for systems with <= 8 CPUs we need to allocate
a common vector for all 8 cpus at once if we want to use
flat mode and this makes a version of assign_irq_vector that can work
assign per cpu vectors more complicated, since it has to handle both
cases. Hmm, should assign_irq_vector become a ioapic method?

I don't think we want to loose the optimization of letting the
hardware select select the cpu on small systems.

>>It is reassuring to see that the way I want to make the code work tends
>>to be they way you already think the code works. I can't be too far
>>off in reducing the complexity.
>
> Yeah, thanks for pointing out that the vectors are assigned at
> IOAPIC init-time. I hadn't noticed that, and it needs to be addressed.

Exactly. Once that is fixed setting gsi == irq should be trivial.
Then we only have to worry about big systems that actually use
more than 244 IRQs. Which is where the per cpu vector assignment
comes in.

Eric


2006-05-02 13:53:14

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH] Document what in IRQ is.

Andi Kleen <[email protected]> writes:

> P.S.: There seems to be a lot of confusion about all this.
> Maybe it would make sense to do a write up defining all the terms
> and stick it into Documentation/* ?

How does this look?

I am pretty horrible when it comes to Documentation,
but this seems to be the essence of what I was saying earlier.

Eric


diff --git a/Documentation/IRQ.txt b/Documentation/IRQ.txt
new file mode 100644
index 0000000..5340369
--- /dev/null
+++ b/Documentation/IRQ.txt
@@ -0,0 +1,22 @@
+What is an IRQ?
+
+An IRQ is an interrupt request from a device.
+Currently they can come in over a pin, or over a packet.
+IRQs at the source can be shared.
+
+An IRQ number is a kernel identifier used to talk about a hardware
+interrupt source. Typically this is an index into the global irq_desc
+array, but except for what linux/interrupt.h implements the details
+are architecture specific.
+
+An IRQ number is an enumeration of the possible interrupt sources on a
+machine. Typically what is enumerated is the number of input pins on
+all of the interrupt controller in the system. In the case of ISA
+what is enumerated are the 16 input pins to the pair of i8259
+interrupt controllers.
+
+Architectures can assign additional meaning to the IRQ numbers, and
+are encouraged to in the case where there is any manual configuration
+of the hardware involved. The ISA IRQ case on x86 where anyone who
+has been around a while can tell you how the first 16 IRQs map to the
+input pins on a pair of i8259s is the classic example.

2006-05-03 23:59:32

by Randy Dunlap

[permalink] [raw]
Subject: Re: [RFC][PATCH] Document what in IRQ is.

On Tue, 02 May 2006 07:52:22 -0600 Eric W. Biederman wrote:

> Andi Kleen <[email protected]> writes:
>
> > P.S.: There seems to be a lot of confusion about all this.
> > Maybe it would make sense to do a write up defining all the terms
> > and stick it into Documentation/* ?
>
> How does this look?
>
> I am pretty horrible when it comes to Documentation,
> but this seems to be the essence of what I was saying earlier.
>
> Eric
>
>
> diff --git a/Documentation/IRQ.txt b/Documentation/IRQ.txt
> new file mode 100644
> index 0000000..5340369
> --- /dev/null
> +++ b/Documentation/IRQ.txt
> @@ -0,0 +1,22 @@
> +What is an IRQ?
> +
> +An IRQ is an interrupt request from a device.
> +Currently they can come in over a pin, or over a packet.

No comma. Change packet to message?

> +IRQs at the source can be shared.

Huh? That simple sentence confuses me. Should "source" really be
"sink" or "destination"? Or maybe say "IRQs at an interrupt controller
can be shared." Or is that too hardware-specific?
Anyway, what source is meant here? It doesn't mean that IRQs
at the producer device can be shared, right? It's more at the
consumer device where they can be shared.

> +An IRQ number is a kernel identifier used to talk about a hardware
> +interrupt source. Typically this is an index into the global irq_desc
> +array, but except for what linux/interrupt.h implements the details
> +are architecture specific.
> +
> +An IRQ number is an enumeration of the possible interrupt sources on a
> +machine. Typically what is enumerated is the number of input pins on
> +all of the interrupt controller in the system. In the case of ISA
controllers
> +what is enumerated are the 16 input pins to the pair of i8259
is
> +interrupt controllers.
> +
> +Architectures can assign additional meaning to the IRQ numbers, and
> +are encouraged to in the case where there is any manual configuration
> +of the hardware involved. The ISA IRQ case on x86 where anyone who
> +has been around a while can tell you how the first 16 IRQs map to the
awhile
> +input pins on a pair of i8259s is the classic example.


---
~Randy

2006-05-04 02:49:20

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH] Document what in IRQ is.

"Randy.Dunlap" <[email protected]> writes:

> On Tue, 02 May 2006 07:52:22 -0600 Eric W. Biederman wrote:
>
>> Andi Kleen <[email protected]> writes:
>>
>> > P.S.: There seems to be a lot of confusion about all this.
>> > Maybe it would make sense to do a write up defining all the terms
>> > and stick it into Documentation/* ?
>>
>> How does this look?
>>
>> I am pretty horrible when it comes to Documentation,
>> but this seems to be the essence of what I was saying earlier.
>>
>> Eric
>>
>>
>> diff --git a/Documentation/IRQ.txt b/Documentation/IRQ.txt
>> new file mode 100644
>> index 0000000..5340369
>> --- /dev/null
>> +++ b/Documentation/IRQ.txt
>> @@ -0,0 +1,22 @@
>> +What is an IRQ?
>> +
>> +An IRQ is an interrupt request from a device.
>> +Currently they can come in over a pin, or over a packet.
>
> No comma. Change packet to message?

Sounds good.

>> +IRQs at the source can be shared.
>
> Huh? That simple sentence confuses me. Should "source" really be
> "sink" or "destination"? Or maybe say "IRQs at an interrupt controller
> can be shared." Or is that too hardware-specific?
> Anyway, what source is meant here? It doesn't mean that IRQs
> at the producer device can be shared, right? It's more at the
> consumer device where they can be shared.

By source I was thinking at the irq controller pin.

Interrupts are usually thrown from interrupt controllers to
something in the chipset that interrupts the cpu, giving the
cpu a token (ie an interrupt vector) that uniquely identifies
which interrupt source threw the interrupt.

Linux does not have generic infrastructure to allow two interrupt
sources to share the same token passed to the kernel.

In addition there are good reasons on some systems to change the
token dynamically, (say to point the IRQ at a different CPU). So
no generic code in the code should know about the token the cpu
receives. The fact that msi.c actually knows about that token
today makes is inflexible and maintenance problem.

I guess I need to figure out how to work this additional information
into my documentation then.

>> +An IRQ number is a kernel identifier used to talk about a hardware
>> +interrupt source. Typically this is an index into the global irq_desc
>> +array, but except for what linux/interrupt.h implements the details
>> +are architecture specific.
>> +
>> +An IRQ number is an enumeration of the possible interrupt sources on a
>> +machine. Typically what is enumerated is the number of input pins on
>> +all of the interrupt controller in the system. In the case of ISA
> controllers
>> +what is enumerated are the 16 input pins to the pair of i8259
> is
>> +interrupt controllers.
>> +
>> +Architectures can assign additional meaning to the IRQ numbers, and
>> +are encouraged to in the case where there is any manual configuration
>> +of the hardware involved. The ISA IRQ case on x86 where anyone who
>> +has been around a while can tell you how the first 16 IRQs map to the
> awhile
>> +input pins on a pair of i8259s is the classic example.

Eric

2006-05-04 05:42:35

by Brown, Len

[permalink] [raw]
Subject: RE: [RFC][PATCH] Document what in IRQ is.


>Linux does not have generic infrastructure to allow two interrupt
>sources to share the same token passed to the kernel.

On i386 and x86_64 io_apic.c, see irq_pin_list
This advertises to support multiple pins per IRQ.

I suspect this code never runs, and simply adds
complexity to code that has no shortage of complexity.

If somebody can explain to me what a "shared ISA-space IRQ"
is supposed to be, I'm all ears. I've had a BUG() in this
code for a while waiting for it to be used, and never seen it fire.

If we can get rid of that concept, then we have a 1:1 mapping
between irqs and apic:pin. Possibly this simplification would
be helpful as we re-think how the mapping from cpu:vector -> irq works.

-Len