2007-08-06 15:18:41

by Martin Wilck

[permalink] [raw]
Subject: PATCH/RFC: [kdump] fix APIC shutdown sequence

PATCH/RFC: [kdump] fix APIC shutdown sequence

This patch fixes a problem that we have encountered
with kdump under high I/O load on some machines.
The machines showing the errors have an Intel ICH7
chip set with a 6702PXH PCI Express-to-PCI Bridge
(8086:032c) containing an IO-APIC.

The bug symptom is that certain controllers connected
to the 6702PXH bridge wouldn't receive any IRQs in the
kdump kernel. In the error case (which is about 20% of
all cases) the IRR bit of the IO-APIC pin for that
controller is always set after the start of the kdump
kernel, indicating an IRQ in progress. We haven't found
a way to recover from this situation when it has once
occured, except for a system reset.

The error is caused by IRQs arriving while the APIC
subsystem is deactivated in machine_crash_shutdown().

Apparently, the IO-APIC gets stuck if it sends an IRQ
message to a Local APIC and never receives an EOI for that
message. This can have several possible reasons:

1. If, under SMP, the IO-APIC logical destination field is
set by the IRQ balancing code to one of the "other"
CPUs (i.e. not the crashing_cpu), and an IRQ arrives
on the respective pin after that CPU has shut down
its local APIC (but before the IO-APIC pin is masked)
the IRQ message can't be delivered.

2. The crashing CPU itself disables its local APIC
before the IO-APIC, leaving a short time window
where the IOAPIC can receive IRQs, but not
deliver them.

3. An IRQ is received and delivered to a local APIC, but
no CPU ever executes the IRQ handler and therefore no
EOI is sent.

After a lot of failed attempts, i have come up with the
following patch, which fixes the problem.

The patch first masks all IO-Apic pins to avoid a sitation
where the IO-Apic can receive, but not deliver, the IRQs.
Moreover, it enables interrupts for a short period before
eventually starting the kdump kernel, so that EOIs can be
sent to the APICs as necessary.

Notes:
a) Simply calling disable_IO_APIC() early doesn't
work, probably because that also clears the IRQ vector
information, so that arriving EOI messages can't be
associated with pins by the IO-APIC.
b) We have tried patches that avoid re-enabling interrupts,
but so far without success. Re-enabling IRQs is of course
dangerous while dumping, and I'd rather find a way to avoid it.
c) There are indications that besides the EOI, it's also
necessary that the PCI IRQ pin is deasserted at least for
a short time. That usually requires that the driver IRQ
handler is called and tells the FW that the IRQ was received.
Whether or not this is a requirement hasn't been finally
clarified yet.
d) The problem is only seen with the IO-APIC in the 6702PXH
PCI bridge, which is the system's secondary IO-APIC. On the
system's main IO-APIC, we see other IRQs (timer etc) arrive
and never get an EOI, but we see no errors.

The patch below is against 2.6.23-rc1. The problem was
originally analyzed and the patch developed against the
Red Hat EL5 kernel (2.6.18-8.el5). I verified that the
problem still occurs with 2.6.23-rc1, and that the patch
below fixes the problem.

Regards
Martin

PS: patch attached ain MIME format because it'd be mangled
quoted-printable by my Mail relay.

--
Martin Wilck
PRIMERGY System Software Engineer
FSC IP ESP DE6

Fujitsu Siemens Computers GmbH
Heinz-Nixdorf-Ring 1
33106 Paderborn
Germany

Tel: ++49 5251 8 15113
Fax: ++49 5251 8 20409
Email: mailto:[email protected]
Internet: http://www.fujitsu-siemens.com
Company Details: http://www.fujitsu-siemens.com/imprint.html


Attachments:
kdump.diff (8.33 kB)

2007-08-07 14:30:10

by Vivek Goyal

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

On Mon, Aug 06, 2007 at 05:08:05PM +0200, Martin Wilck wrote:
> PATCH/RFC: [kdump] fix APIC shutdown sequence
>
> This patch fixes a problem that we have encountered
> with kdump under high I/O load on some machines.
> The machines showing the errors have an Intel ICH7
> chip set with a 6702PXH PCI Express-to-PCI Bridge
> (8086:032c) containing an IO-APIC.
>

I quickly went through the problem description and the
patch. I think currently problem is not fully understood
and we are trying to put a patch. I think we need to
do little more study of the problem and then think of
a solution.

> The bug symptom is that certain controllers connected
> to the 6702PXH bridge wouldn't receive any IRQs in the
> kdump kernel. In the error case (which is about 20% of
> all cases) the IRR bit of the IO-APIC pin for that
> controller is always set after the start of the kdump
> kernel, indicating an IRQ in progress. We haven't found
> a way to recover from this situation when it has once
> occured, except for a system reset.
>
> The error is caused by IRQs arriving while the APIC
> subsystem is deactivated in machine_crash_shutdown().
>
> Apparently, the IO-APIC gets stuck if it sends an IRQ
> message to a Local APIC and never receives an EOI for that
> message. This can have several possible reasons:
>

We need to zoom onto one precise reason to solve the issue
Speculation will not help.

> 1. If, under SMP, the IO-APIC logical destination field is
> set by the IRQ balancing code to one of the "other"
> CPUs (i.e. not the crashing_cpu), and an IRQ arrives
> on the respective pin after that CPU has shut down
> its local APIC (but before the IO-APIC pin is masked)
> the IRQ message can't be delivered.

Point 1 and Point 2 seems to be same.

>
> 2. The crashing CPU itself disables its local APIC
> before the IO-APIC, leaving a short time window
> where the IOAPIC can receive IRQs, but not
> deliver them.
>

I doubut that it would be the issue. Looking at intel IOAPIC (82093AA)
documentation, it says that IRR bit of IOAPIC will be set only if
destination CPU has accepted the interrupt. So if we have disabled
the LAPIC, it will not accept the interrupt and IRR bit of IOAPIC
should not be set.

> 3. An IRQ is received and delivered to a local APIC, but
> no CPU ever executes the IRQ handler and therefore no
> EOI is sent.
>

We do issue EOI for all the pending interrupts in second
kernel. Look at setup_local_APIC(). Once the second is booting, it
checks if there are any pending interrupts (ISR bit is set). If yes,
it goes ahead and issues an extra EOI. This should also clear the
IRR register of IOAPIC.


> After a lot of failed attempts, i have come up with the
> following patch, which fixes the problem.
>
> The patch first masks all IO-Apic pins to avoid a sitation
> where the IO-Apic can receive, but not deliver, the IRQs.
> Moreover, it enables interrupts for a short period before
> eventually starting the kdump kernel, so that EOIs can be
> sent to the APICs as necessary.
>
> Notes:
> a) Simply calling disable_IO_APIC() early doesn't
> work, probably because that also clears the IRQ vector
> information, so that arriving EOI messages can't be
> associated with pins by the IO-APIC.

disable_IO_APIC() code does not clear the vector information
in routing table. It just masks the interrupt. So even if
an EOI is issued later in second kernel, it should clear the
IRR bit at IOAPIC.

> b) We have tried patches that avoid re-enabling interrupts,
> but so far without success. Re-enabling IRQs is of course
> dangerous while dumping, and I'd rather find a way to avoid it.
> c) There are indications that besides the EOI, it's also
> necessary that the PCI IRQ pin is deasserted at least for
> a short time. That usually requires that the driver IRQ
> handler is called and tells the FW that the IRQ was received.
> Whether or not this is a requirement hasn't been finally
> clarified yet.

I doubt this. There are situations when there is no device
driver for the device and device pushes the interrupt (frequently
observed in the case of kdump). Kernel still keeps on receiving
the interrupt without driver telling device to lower the interrupt
line.

> d) The problem is only seen with the IO-APIC in the 6702PXH
> PCI bridge, which is the system's secondary IO-APIC. On the
> system's main IO-APIC, we see other IRQs (timer etc) arrive
> and never get an EOI, but we see no errors.
>
> The patch below is against 2.6.23-rc1. The problem was
> originally analyzed and the patch developed against the
> Red Hat EL5 kernel (2.6.18-8.el5). I verified that the
> problem still occurs with 2.6.23-rc1, and that the patch
> below fixes the problem.
>

I can imagine one possibility. There might be pending interrupts
on a non-crashing cpu. When second kernel boots, we initialize only
one cpu and issue EOI for pending interrupts only on that CPU. So
if an interrupt is pending on other CPU, then IRR bit for that interrupt
on IOAPIC will remain set and one would not get further interrupts from
that device.

- Can you please see if you can reproduce same problem with a
single processor (maxcpus=1)

- Can you please print local apic (print_local_APIC) and
ioapic registers (print_IO_APIC) and verify above theory?

Thanks
Vivek

2007-08-07 17:41:45

by Martin Wilck

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

Hello Vivek,

thank you very much for looking at this problem, and for your
comments.

>> The error is caused by IRQs arriving while the APIC
>> subsystem is deactivated in machine_crash_shutdown().
>>
>> Apparently, the IO-APIC gets stuck if it sends an IRQ
>> message to a Local APIC and never receives an EOI for that
>> message. This can have several possible reasons:
>>
> We need to zoom onto one precise reason to solve the issue
> Speculation will not help.

We have made a number of logical analyzer traces. In all cases
where the error occurred, there was a INT message from the IO-APIC
that never received an EOI. These INT messages always occurred shortly
before or during machine_crash_shutdown(); unfortunately it is
impossible from analyzer traces to tell exactly when
machine_crash_shutdown() was entered.

Such a situation has never been observed in the "good" case.
So, we do have some evidence, not just bare speculation.

>> 2. The crashing CPU itself disables its local APIC
>> before the IO-APIC, leaving a short time window
>> where the IOAPIC can receive IRQs, but not
>> deliver them.
>>
>
> I doubut that it would be the issue. Looking at intel IOAPIC (82093AA)
> documentation, it says that IRR bit of IOAPIC will be set only if
> destination CPU has accepted the interrupt. So if we have disabled
> the LAPIC, it will not accept the interrupt and IRR bit of IOAPIC
> should not be set.

Can you explain how, on the front side bus, the IO-APIC knows whether
a CPU has accepted the INT message? There is no response
to the INT message on the bus, except for the EOI which comes much later.
I'm not saying that you're wrong, I just really don't understand this
point.

In the logical analyzer, we can't see when exactly the local APICs are
disabled. But we see that IRQs arriving after the IO APIC pin is
masked never do any harm, while IRQs arriving "during the shutdown
sequence" (we can see e.g. the 2nd CPU taking the bus after the NMI
IPI) cause the error situation.

>> 3. An IRQ is received and delivered to a local APIC, but
>> no CPU ever executes the IRQ handler and therefore no
>> EOI is sent.
>>
>
> We do issue EOI for all the pending interrupts in second
> kernel. Look at setup_local_APIC(). Once the second is booting, it
> checks if there are any pending interrupts (ISR bit is set). If yes,
> it goes ahead and issues an extra EOI. This should also clear the
> IRR register of IOAPIC.

In an earlier patch, I tried to add that same code in
machine_crash_shutdown() and crash_nmi_callback(), in order
to send EOIs for pending IRQs on all CPUs. Unfortunately,
that had no effect.

> disable_IO_APIC() code does not clear the vector information
> in routing table. It just masks the interrupt. So even if
> an EOI is issued later in second kernel, it should clear the
> IRR bit at IOAPIC.

Hmm... ioapic_mask_entry() writes
"union entry_union eu = { .entry.mask = 1 }" to the LVT register.
That clears all bits except the mask bit, so that the vector information
is lost. Please correct me if I'm mistaken.

>> c) There are indications that besides the EOI, it's also
>> necessary that the PCI IRQ pin is deasserted at least for
>> a short time.

> I doubt this. There are situations when there is no device
> driver for the device and device pushes the interrupt (frequently
> observed in the case of kdump). Kernel still keeps on receiving
> the interrupt without driver telling device to lower the interrupt
> line.

So far I haven't come up with a patch that just sends EOI without
actually calling any HW IRQ handler. That would clarify this question.
It's on my todo list.

> I can imagine one possibility. There might be pending interrupts
> on a non-crashing cpu. When second kernel boots, we initialize only
> one cpu and issue EOI for pending interrupts only on that CPU. So
> if an interrupt is pending on other CPU, then IRR bit for that interrupt
> on IOAPIC will remain set and one would not get further interrupts from
> that device.

> - Can you please see if you can reproduce same problem with a
> single processor (maxcpus=1)

That has been done in the past. The error occurs, too, although not
quite as often.

> - Can you please print local apic (print_local_APIC) and
> ioapic registers (print_IO_APIC) and verify above theory?

We always see the IO-APIC IRR bit in the error situation, before and
after the start of the kdump kernel.

*Before* the kdump kernel starts (more precisely: before the call
to disable_IO_APIC()), the IO-APIC "delivery status" bit is also set.

I checked local APIC ISR and IRR bits in an earlier version of my patch
(see above). They were sometimes set, and sometimes not (unlike the IO-APIC
IRR/Delivery Status which behave always the same).

The patch was very different at that time, e.g. it didn't call the
crash_mask_IO_APIC() function yet. So perhaps, it's worth a try to
modify it and just send EOI on all CPUs instead of enabling interrupts.
I'll put it on the toto list, too.

Regards and thanks
Martin

--
Martin Wilck
PRIMERGY System Software Engineer
FSC IP ESP DE6

Fujitsu Siemens Computers GmbH
Heinz-Nixdorf-Ring 1
33106 Paderborn
Germany

Tel: ++49 5251 8 15113
Fax: ++49 5251 8 20409
Email: mailto:[email protected]
Internet: http://www.fujitsu-siemens.com
Company Details: http://www.fujitsu-siemens.com/imprint.html

2007-08-07 19:45:09

by Chip Coldwell

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

On Tue, 7 Aug 2007, Vivek Goyal wrote:

> On Mon, Aug 06, 2007 at 05:08:05PM +0200, Martin Wilck wrote:
>
> > 1. If, under SMP, the IO-APIC logical destination field is
> > set by the IRQ balancing code to one of the "other"
> > CPUs (i.e. not the crashing_cpu), and an IRQ arrives
> > on the respective pin after that CPU has shut down
> > its local APIC (but before the IO-APIC pin is masked)
> > the IRQ message can't be delivered.
>
> Point 1 and Point 2 seems to be same.
>
> >
> > 2. The crashing CPU itself disables its local APIC
> > before the IO-APIC, leaving a short time window
> > where the IOAPIC can receive IRQs, but not
> > deliver them.
> >
>
> I doubut that it would be the issue. Looking at intel IOAPIC (82093AA)
> documentation, it says that IRR bit of IOAPIC will be set only if
> destination CPU has accepted the interrupt.

I think you mean "destination Local APIC has accepted the interrupt"
above. The Intel documentation cited above contains this text on page
12:

For level triggered interrupts, this bit is set to 1 when local
APIC(s) accept the level interrupt sent by the IOAPIC. The Remote
IRR bit is set to 0 when an EOI message with a matching interrupt
vector is received from a local APIC.

The following text is from the IA-64 documentation ...

Any interrupt that is received by the processor is kept pending
and recorded in the Interrupt Request Register (IRR). If the
processor is not servicing an interrupt, then the contents of the
TPR determines whether the processor will accept a pending
interrupt depending on the priority of the interrupt compared to
the current TPR. If the interrupt has a higher priority, then the
processor is interrupted, otherwise the interrupt is kept pending.

So, I think if the CPU has interrupts disabled, but the Local APIC
does not, the IRR could get set. I guess we need to be sure to turn
off the Local APIC first before disabling interrupts in the CPU.

Chip

--
Charles M. "Chip" Coldwell
Senior Software Engineer
Red Hat, Inc
978-392-2426

2007-08-08 00:30:19

by Andrew Morton

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

On Mon, 06 Aug 2007 17:08:05 +0200
Martin Wilck <[email protected]> wrote:

> PATCH/RFC: [kdump] fix APIC shutdown sequence
>
> This patch fixes a problem that we have encountered
> with kdump under high I/O load on some machines.
> The machines showing the errors have an Intel ICH7
> chip set with a 6702PXH PCI Express-to-PCI Bridge
> (8086:032c) containing an IO-APIC.
>
> The bug symptom is that certain controllers connected
> to the 6702PXH bridge wouldn't receive any IRQs in the
> kdump kernel. In the error case (which is about 20% of
> all cases) the IRR bit of the IO-APIC pin for that
> controller is always set after the start of the kdump
> kernel, indicating an IRQ in progress. We haven't found
> a way to recover from this situation when it has once
> occured, except for a system reset.
>
> The error is caused by IRQs arriving while the APIC
> subsystem is deactivated in machine_crash_shutdown().
>
> Apparently, the IO-APIC gets stuck if it sends an IRQ
> message to a Local APIC and never receives an EOI for that
> message. This can have several possible reasons:
>
> 1. If, under SMP, the IO-APIC logical destination field is
> set by the IRQ balancing code to one of the "other"
> CPUs (i.e. not the crashing_cpu), and an IRQ arrives
> on the respective pin after that CPU has shut down
> its local APIC (but before the IO-APIC pin is masked)
> the IRQ message can't be delivered.
>
> 2. The crashing CPU itself disables its local APIC
> before the IO-APIC, leaving a short time window
> where the IOAPIC can receive IRQs, but not
> deliver them.
>
> 3. An IRQ is received and delivered to a local APIC, but
> no CPU ever executes the IRQ handler and therefore no
> EOI is sent.
>
> After a lot of failed attempts, i have come up with the
> following patch, which fixes the problem.
>
> The patch first masks all IO-Apic pins to avoid a sitation
> where the IO-Apic can receive, but not deliver, the IRQs.
> Moreover, it enables interrupts for a short period before
> eventually starting the kdump kernel, so that EOIs can be
> sent to the APICs as necessary.
>
> Notes:
> a) Simply calling disable_IO_APIC() early doesn't
> work, probably because that also clears the IRQ vector
> information, so that arriving EOI messages can't be
> associated with pins by the IO-APIC.
> b) We have tried patches that avoid re-enabling interrupts,
> but so far without success. Re-enabling IRQs is of course
> dangerous while dumping, and I'd rather find a way to avoid it.
> c) There are indications that besides the EOI, it's also
> necessary that the PCI IRQ pin is deasserted at least for
> a short time. That usually requires that the driver IRQ
> handler is called and tells the FW that the IRQ was received.
> Whether or not this is a requirement hasn't been finally
> clarified yet.
> d) The problem is only seen with the IO-APIC in the 6702PXH
> PCI bridge, which is the system's secondary IO-APIC. On the
> system's main IO-APIC, we see other IRQs (timer etc) arrive
> and never get an EOI, but we see no errors.
>
> The patch below is against 2.6.23-rc1. The problem was
> originally analyzed and the patch developed against the
> Red Hat EL5 kernel (2.6.18-8.el5). I verified that the
> problem still occurs with 2.6.23-rc1, and that the patch
> below fixes the problem.
>

Some little things..

Please feed the diff through scripts/checkpatch.pl. It finds a lot of issues.

For some reason you added a lot of lines which start with space-tab rather
than tab. checkpatch does detect this.


> diff -puN arch/x86_64/kernel/crash.c~kdump-fix-apic-shutdown-sequence arch/x86_64/kernel/crash.c
> --- a/arch/x86_64/kernel/crash.c~kdump-fix-apic-shutdown-sequence
> +++ a/arch/x86_64/kernel/crash.c
> @@ -18,6 +18,7 @@
> #include <linux/elf.h>
> #include <linux/elfcore.h>
> #include <linux/kdebug.h>
> +#include <linux/interrupt.h>
>
> #include <asm/processor.h>
> #include <asm/hardirq.h>
> @@ -30,6 +31,11 @@ static int crashing_cpu;
>
> #ifdef CONFIG_SMP
> static atomic_t waiting_for_crash_ipi;
> +static atomic_t crash_ipi_stage2 = ATOMIC_INIT(0);
> +
> +extern void remove_siblinginfo(int);
> +extern void remove_cpu_from_maps(void);
> +extern void crash_mask_IO_APIC(int);

please neverevereverever put extern declarations in C files. Find a
suitable header for it, ensure that the header is included by the
definition site and by all callers.

> static int crash_nmi_callback(struct notifier_block *self,
> unsigned long val, void *data)
> @@ -53,13 +59,30 @@ static int crash_nmi_callback(struct not
> local_irq_disable();
>
> crash_save_cpu(regs, cpu);
> - disable_local_APIC();
> - atomic_dec(&waiting_for_crash_ipi);
> + disable_APIC_timer();
> + atomic_dec(&waiting_for_crash_ipi);
> +
> + while(atomic_read(&crash_ipi_stage2) == 0)

Need a space here. Everything I've mentioned thus far should have been
picked up by checkpatch, so I'll shut up now.

> +
> +static void nmi_shootdown_cpus_stage2(void)
> +{
> +
> + unsigned long msecs;
> + atomic_set(&crash_ipi_stage2, num_online_cpus());
> +
> msecs = 1000; /* Wait at most a second for the other cpus to stop */
> - while ((atomic_read(&waiting_for_crash_ipi) > 0) && msecs) {
> + while ((atomic_read(&crash_ipi_stage2) > 1) && msecs) {
> mdelay(1);
> msecs--;
> }
> - /* Leave the nmi callback set */
> - disable_local_APIC();
> }

This is tricky code which you're adding. Empirically-based stuff derived
from observation and experimentation. It is quite impossible for readers
of this code to work out what the code is doing and why it is doing it
purely by reading the C statements.

Hence for any form of maintainability you really do need to spell
*everything* out in code comments. All those little discoveries and
observations. Otherwise it'll be lost, unless the fraught reader decides
to go hunting in the git repository.

> #else
> static void nmi_shootdown_cpus(void)
> {
> /* There are no cpus to shootdown */
> }
> +static void nmi_shootdown_cpus_stage2(void)
> +{
> +}
> #endif

Although the compiler will inline this function for you, we'd
conventionally declare it static inline. We might as well do this here as
well, and fix up nmi_shootdown_cpus() too.

> diff -puN arch/x86_64/kernel/io_apic.c~kdump-fix-apic-shutdown-sequence arch/x86_64/kernel/io_apic.c
> --- a/arch/x86_64/kernel/io_apic.c~kdump-fix-apic-shutdown-sequence
> +++ a/arch/x86_64/kernel/io_apic.c
> @@ -371,6 +371,50 @@ static void unmask_IO_APIC_irq (unsigned
> spin_unlock_irqrestore(&ioapic_lock, flags);
> }
>
> +static void crash_mask_IO_APIC_pin(unsigned int apic, unsigned int pin, int reg1)
> +{
> + struct IO_APIC_route_entry entry;
> + unsigned long flags;
> +
> + /* Check delivery_mode to be sure we're not clearing an SMI pin
> + * Don't bother disabling masked pins
> + */
> + spin_lock_irqsave(&ioapic_lock, flags);
> + *(((int*)&entry) + 0) = io_apic_read(apic, 0x10 + 2 * pin);
> + spin_unlock_irqrestore(&ioapic_lock, flags);
> + if (entry.delivery_mode == dest_SMI || entry.mask)
> + return;
> +
> + entry.mask = 1;
> + *(((int*)&entry) + 1) = reg1;
> +
> + spin_lock_irqsave(&ioapic_lock, flags);
> + io_apic_write(apic, 0x10 + 2 * pin, *(((int *)&entry) + 0));
> + io_apic_write(apic, 0x11 + 2 * pin, *(((int *)&entry) + 1));
> + io_apic_sync(apic);
> + spin_unlock_irqrestore(&ioapic_lock, flags);
> +}
> +
> +/*
> + * This function is used for kdump to mask all IO-APIC IRQs, and reset
> + * the destination apic ID.
> + */
> +void crash_mask_IO_APIC(int cpu)
> +{
> + int apic, pin;
> + int reg1;
> + cpumask_t mask;
> +
> + cpus_clear(mask);
> + cpu_set(cpu, mask);
> + reg1 = SET_APIC_LOGICAL_ID(cpu_mask_to_apicid(mask));
> +
> + for (apic = 0; apic < nr_ioapics; apic++) {
> + for (pin = 0; pin < nr_ioapic_registers[apic]; pin++)
> + crash_mask_IO_APIC_pin(apic, pin, reg1);
> + }
> +}

The above two functions are unneeded if CONFIG_CRASH_DUMP=n, hence they
should have the appropriate ifdef wrapping.


2007-08-08 01:08:24

by Eric W. Biederman

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence


A couple of questions.

How bad is it if you just run with irqpoll in the kdump kernel?
If running with irqpoll is usable that is probably preferable
to putting in a hardware work around we can survive without.


Have you done any looking at moving where the kernel initalizes
io_apics? One of the todo items on the path is to leave
io_apic mode enabled and just startup the kernel in io_apic
mode.

When doing that a version of your work around (if needed)
would certainly be reasonable.

My general concern is that the trickier we get in the code
running in the crashed kernel the less reliable the entire
process will be.

Eric

2007-08-08 08:33:08

by Martin Wilck

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

Hello Andrew,

thanks a lot for looking at this patch.

> Please feed the diff through scripts/checkpatch.pl. It finds a lot of issues.

I should have read the latest version of SubmittingPatches :-(
I didn't expect you to pick up this patch so quickly, if I did
I'd have cleaned it more thoroughly. I apologize.

I expect this patch to go through a few more iterations in discussions
with Vivek and other people on this list.

> This is tricky code which you're adding. Empirically-based stuff derived
> from observation and experimentation. It is quite impossible for readers
> of this code to work out what the code is doing and why it is doing it
> purely by reading the C statements.

OK. I thought this might make the patch itself too long, that's why I put
it into the header.

Thanks, Martin

--
Martin Wilck
PRIMERGY System Software Engineer
FSC IP ESP DE6

Fujitsu Siemens Computers GmbH
Heinz-Nixdorf-Ring 1
33106 Paderborn
Germany

Tel: ++49 5251 8 15113
Fax: ++49 5251 8 20409
Email: mailto:[email protected]
Internet: http://www.fujitsu-siemens.com
Company Details: http://www.fujitsu-siemens.com/imprint.html

2007-08-08 09:03:36

by Martin Wilck

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

Hello Eric,

> How bad is it if you just run with irqpoll in the kdump kernel?
> If running with irqpoll is usable that is probably preferable
> to putting in a hardware work around we can survive without.

Yes, I tried that. No effect.

> Have you done any looking at moving where the kernel initalizes
> io_apics? One of the todo items on the path is to leave
> io_apic mode enabled and just startup the kernel in io_apic
> mode.

I have tried to recover from the "IRR set" situation in several ways by
changing setup_IO_APIC_irq(). But I haven't found a way to recover from
this situation once disable_IO_APIC() had been called.
I concluded thatthe sequence of events
"send INT message - never receive EOI - disable IO-APIC pin"
messes up the IO-APIC (at least this specific one in the
PCIEx-PCI bridge of the ICH7).

The current state of affairs is that the only thing that "worked" was to
re-enable irqs.

> When doing that a version of your work around (if needed)
> would certainly be reasonable.

> My general concern is that the trickier we get in the code
> running in the crashed kernel the less reliable the entire
> process will be.

That's certainly correct. I am grateful for all hints that simplify my patch.

Thanks,
Martin

--
Martin Wilck
PRIMERGY System Software Engineer
FSC IP ESP DE6

Fujitsu Siemens Computers GmbH
Heinz-Nixdorf-Ring 1
33106 Paderborn
Germany

Tel: ++49 5251 8 15113
Fax: ++49 5251 8 20409
Email: mailto:[email protected]
Internet: http://www.fujitsu-siemens.com
Company Details: http://www.fujitsu-siemens.com/imprint.html

2007-08-08 09:33:39

by Vivek Goyal

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

On Wed, Aug 08, 2007 at 11:03:17AM +0200, Martin Wilck wrote:
> Hello Eric,
>
> > How bad is it if you just run with irqpoll in the kdump kernel?
> > If running with irqpoll is usable that is probably preferable
> > to putting in a hardware work around we can survive without.
>
> Yes, I tried that. No effect.
>

Martin, at least irpoll should have worked. I am assuming your timer
interrupts are coming in second kernel. In that case we are not
dependent at all on actually receiving device interrupt. Polling should
take care of it.

What is that device which is not working? What is the success criterion?
I mean, did you check that the irq handler of your driver is now invoked
with irqpoll and now it can drive the device without worrying about
LAPIC and IOAPIC settings?

Thanks
Vivek

2007-08-08 10:35:58

by Vivek Goyal

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

On Tue, Aug 07, 2007 at 07:41:30PM +0200, Martin Wilck wrote:
[..]
>
> Such a situation has never been observed in the "good" case.
> So, we do have some evidence, not just bare speculation.
>
> >> 2. The crashing CPU itself disables its local APIC
> >> before the IO-APIC, leaving a short time window
> >> where the IOAPIC can receive IRQs, but not
> >> deliver them.
> >>
> >
> > I doubut that it would be the issue. Looking at intel IOAPIC (82093AA)
> > documentation, it says that IRR bit of IOAPIC will be set only if
> > destination CPU has accepted the interrupt. So if we have disabled
> > the LAPIC, it will not accept the interrupt and IRR bit of IOAPIC
> > should not be set.
>
> Can you explain how, on the front side bus, the IO-APIC knows whether
> a CPU has accepted the INT message? There is no response
> to the INT message on the bus, except for the EOI which comes much later.
> I'm not saying that you're wrong, I just really don't understand this
> point.
>

I don't know what is exactly hardware protocol. I am just going by
intel documentation.

Remote IRR—RO. This bit is used for level triggered interrupts. Its meaning
is undefined for edge triggered interrupts. For level triggered interrupts,
this bit is set to 1 when local APIC(s) accept the level interrupt sent by
the IOAPIC. The Remote IRR bit is set to 0 when an EOI message with a matching
interrupt vector is received from a local APIC.

According to this, IRR bit is set to one only when Local APIC accepts the
level trigger interrupt. This implies there should be a way of IOAPIC
knowing that destination local apic has accepted the interrupt.

Anyway, I think after disabling the LAPIC, it should not accept more
interrupt from IOAPIC. It will only deliver pending interrupts to CPU
if CPU has not masked it. So in your case it looks like we have got
pending interrupts at LAPIC which never received EOI in first kernel.
Second kernel will field these interrupts and will mark as spurious
interrupt and issue EOI. But by that time IOAPIC does not have vector
info so will not clear IRR bit.

> In the logical analyzer, we can't see when exactly the local APICs are
> disabled. But we see that IRQs arriving after the IO APIC pin is
> masked never do any harm, while IRQs arriving "during the shutdown
> sequence" (we can see e.g. the 2nd CPU taking the bus after the NMI
> IPI) cause the error situation.
>
> >> 3. An IRQ is received and delivered to a local APIC, but
> >> no CPU ever executes the IRQ handler and therefore no
> >> EOI is sent.
> >>
> >
> > We do issue EOI for all the pending interrupts in second
> > kernel. Look at setup_local_APIC(). Once the second is booting, it
> > checks if there are any pending interrupts (ISR bit is set). If yes,
> > it goes ahead and issues an extra EOI. This should also clear the
> > IRR register of IOAPIC.
>
> In an earlier patch, I tried to add that same code in
> machine_crash_shutdown() and crash_nmi_callback(), in order
> to send EOIs for pending IRQs on all CPUs. Unfortunately,
> that had no effect.
>

So you tried issuing EOI to all the pending interrupts in first
kernel. Did you check that at the time of issuing EOI, corresponding
vector bits were set in ISR/IRR at LAPIC and IRR was set at IOAPIC?
If yes, then only some hardware guy can tell us that why issuing an EOI
did not clear the IRR bit at IOAPIC.

I have never used a trace analyzer. Can you really parse the EOI message
and see what are the contents? I mean in terms of making sure right
vector info is there so that IOAPIC can reset IRR?


> > disable_IO_APIC() code does not clear the vector information
> > in routing table. It just masks the interrupt. So even if
> > an EOI is issued later in second kernel, it should clear the
> > IRR bit at IOAPIC.
>
> Hmm... ioapic_mask_entry() writes
> "union entry_union eu = { .entry.mask = 1 }" to the LVT register.
> That clears all bits except the mask bit, so that the vector information
> is lost. Please correct me if I'm mistaken.
>

Sorry, you are right. I read the code too fast. IOAPIC entries will be
cleared and precisely that seems to be the reason why issuing an EOI
in second kernel will not help this situation in its current form.

> >> c) There are indications that besides the EOI, it's also
> >> necessary that the PCI IRQ pin is deasserted at least for
> >> a short time.
>
> > I doubt this. There are situations when there is no device
> > driver for the device and device pushes the interrupt (frequently
> > observed in the case of kdump). Kernel still keeps on receiving
> > the interrupt without driver telling device to lower the interrupt
> > line.
>
> So far I haven't come up with a patch that just sends EOI without
> actually calling any HW IRQ handler. That would clarify this question.
> It's on my todo list.
>

Isn't the "nobody cared for irq x" situation similar. Some device has
asserted a level triggered interrupt and there is no respective device
driver. So kernel sees a flurry of interrupts. (issues and EOI but
immediately sees next interrupt as device has not de-asserted the interrupt
line)?

[..]
> > - Can you please print local apic (print_local_APIC) and
> > ioapic registers (print_IO_APIC) and verify above theory?
>
> We always see the IO-APIC IRR bit in the error situation, before and
> after the start of the kdump kernel.
>
> *Before* the kdump kernel starts (more precisely: before the call
> to disable_IO_APIC()), the IO-APIC "delivery status" bit is also set.
>
> I checked local APIC ISR and IRR bits in an earlier version of my patch
> (see above). They were sometimes set, and sometimes not (unlike the IO-APIC
> IRR/Delivery Status which behave always the same).
>

This will be interesting. So you are saying that there are cases where
IRR bit is set at IOAPIC but corresponding IRR/ISR bit is not set at
LAPIC? If that is the case then even enabling the interrupts will not
help? Because IRR/ISR bit is not set, CPU will never receive an interrupt
and it will never issue an EOI for that vector? I am not sure in such cases
how will your patch solve the issue.

I think we should not be enabling the interrupt after the crash. If needed,
we should just issue the EOI and it should work. Otherwise we need to get
in touch with hardware folks to tell us what is going on.

Thanks
Vivek

2007-08-08 11:38:35

by Vivek Goyal

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

On Mon, Aug 06, 2007 at 05:08:05PM +0200, Martin Wilck wrote:
> PATCH/RFC: [kdump] fix APIC shutdown sequence
>
> This patch fixes a problem that we have encountered
> with kdump under high I/O load on some machines.
> The machines showing the errors have an Intel ICH7
> chip set with a 6702PXH PCI Express-to-PCI Bridge
> (8086:032c) containing an IO-APIC.
>
> The bug symptom is that certain controllers connected
> to the 6702PXH bridge wouldn't receive any IRQs in the
> kdump kernel. In the error case (which is about 20% of
> all cases) the IRR bit of the IO-APIC pin for that
> controller is always set after the start of the kdump
> kernel, indicating an IRQ in progress. We haven't found
> a way to recover from this situation when it has once
> occured, except for a system reset.
>
> The error is caused by IRQs arriving while the APIC
> subsystem is deactivated in machine_crash_shutdown().
>
> Apparently, the IO-APIC gets stuck if it sends an IRQ
> message to a Local APIC and never receives an EOI for that
> message. This can have several possible reasons:
>

Got this oops while testing your patch when I did
"echo c > /proc/sysrq-trigger"

[root@llm37 ~]# SysRq : Trigger a crashdump
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
[<0000000000000000>]
PGD 22229b067 PUD 0
Oops: 0000 [1] SMP
CPU 0
Modules linked in:
Pid: 2947, comm: klogd Not tainted 2.6.23-rc1-apic-issue #2
RIP: 0010:[<0000000000000000>] [<0000000000000000>]
RSP: 0018:ffffffff807daf50 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffffffff8073f480 RCX: ffffffff807ddea0
RDX: ffffffff807717c0 RSI: ffffffff8073f480 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff807ddd28 R11: 0000000000000150 R12: 0000000000000000
R13: ffffffff8073f4d0 R14: 000000000000000c R15: 0000000000000000
FS: 00002b204ea686f0(0000) GS:ffffffff8073c000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000223e9e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process klogd (pid: 2947, threadinfo ffff81021b7b8000, task ffff8102251a27b0)
Stack: ffffffff8025afd0 0000000000000046 ffffffff807dddf8 0000000000000000
0000000000000030 0000000000000000 ffffffff8020e1ae ffffffff807d15c8
0000000000000000 ffffffff807dde20 0000000000000000 ffffffff807ddee8
Call Trace:
<IRQ> [<ffffffff8025afd0>] handle_edge_irq+0x5c/0x127
[<ffffffff8020e1ae>] do_IRQ+0xf1/0x15f
[<ffffffff8020c191>] ret_from_intr+0x0/0xa
<EOI> <NMI> [<ffffffff80351ae3>] __delay+0x6/0x10
[<ffffffff8021e328>] crash_nmi_callback+0x4b/0x77
[<ffffffff80557075>] notifier_call_chain+0x29/0x4c
[<ffffffff80249a11>] notify_die+0x2d/0x34
[<ffffffff805555a3>] default_do_nmi+0x55/0x197
[<ffffffff80555f0e>] do_nmi+0x2e/0x44
[<ffffffff805553af>] nmi+0x7f/0x90
[<ffffffff8022e00a>] find_busiest_group+0x20e/0x6b3
<<EOE>> [<ffffffff80552d13>] __sched_text_start+0x1cb/0x5ba
[<ffffffff80282294>] do_sync_write+0xc9/0x10c
[<ffffffff80235c32>] do_syslog+0x11d/0x3a9
[<ffffffff80246a03>] autoremove_wake_function+0x0/0x2e
[<ffffffff802731c6>] free_pages_and_swap_cache+0x73/0x8f
[<ffffffff802bbeac>] kmsg_read+0x3a/0x44
[<ffffffff802b5245>] proc_reg_read+0x7e/0x99
[<ffffffff80282b08>] vfs_read+0xaa/0x132
[<ffffffff80282ea4>] sys_read+0x45/0x6e
[<ffffffff8020bc7e>] system_call+0x7e/0x83


Code: Bad RIP value.
RIP [<0000000000000000>]
RSP <ffffffff807daf50>
CR2: 0000000000000000

Thanks
Vivek

2007-08-08 12:04:55

by Martin Wilck

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

Hi Vivek,

>>> How bad is it if you just run with irqpoll in the kdump kernel?
>>> If running with irqpoll is usable that is probably preferable
>>> to putting in a hardware work around we can survive without.
>> Yes, I tried that. No effect.
>>
>
> Martin, at least irpoll should have worked. I am assuming your timer
> interrupts are coming in second kernel. In that case we are not
> dependent at all on actually receiving device interrupt. Polling should
> take care of it.

You are right. I just tested irqpoll again , and it does works even if the error
(detected by the IRR bit set in the IO-APIC) occurs.

I have no idea what went wrong when I tried "irqpoll" last time. But I was
using a different kernel, controller firmware, driver, and HW configuration,
so it can probably be explained somehow. Unfortunately, the unsuccessful early
attempts caused me to conclude prematurely that "irqpoll" didn't help. I admit
I didn't understand "irqpoll" fully until just now.

> What is that device which is not working? What is the success criterion?

It's a LSI megaraid_sas "zero channel RAID" (ZCR) controller. The system has an
on-board LSI 1068 (mptsas). If you put the ZCR in a certain PCI slot, the
1068 is hidden from the system, which sees the megaraid_sas controller
(1000:0413) instead of the 1068. The ZCR internally uses the 1068 as low-level
controller.

The success criterion was that the disks on the ZCR were successfully detected
and the dump was written.

Martin

--
Martin Wilck
PRIMERGY System Software Engineer
FSC IP ESP DE6

Fujitsu Siemens Computers GmbH
Heinz-Nixdorf-Ring 1
33106 Paderborn
Germany

Tel: ++49 5251 8 15113
Fax: ++49 5251 8 20409
Email: mailto:[email protected]
Internet: http://www.fujitsu-siemens.com
Company Details: http://www.fujitsu-siemens.com/imprint.html

2007-08-08 14:07:30

by Chip Coldwell

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

On Wed, 8 Aug 2007, Vivek Goyal wrote:

> On Tue, Aug 07, 2007 at 07:41:30PM +0200, Martin Wilck wrote:
> >
> > Can you explain how, on the front side bus, the IO-APIC knows whether
> > a CPU has accepted the INT message? There is no response
> > to the INT message on the bus, except for the EOI which comes much later.
> > I'm not saying that you're wrong, I just really don't understand this
> > point.
> >
>
> I don't know what is exactly hardware protocol. I am just going by
> intel documentation.

I think it's important to distinguish between the LAPIC receiving an
interrupt and the CPU receiving an interrupt. The former could happen
without the latter if the CPU has set the TPR above the priority of
the interrupt received by the LAPIC. In that case, the interrupt is
kept pending in the LAPIC and recorded in the IRR if I understand the
Intel documentation correctly.

So I think the scenario which leaves IRR set when the kdump kernel
starts is possible.

Chip

--
Charles M. "Chip" Coldwell
Senior Software Engineer
Red Hat, Inc
978-392-2426

2007-08-08 14:43:24

by Vivek Goyal

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

On Wed, Aug 08, 2007 at 10:06:13AM -0400, Chip Coldwell wrote:
> On Wed, 8 Aug 2007, Vivek Goyal wrote:
>
> > On Tue, Aug 07, 2007 at 07:41:30PM +0200, Martin Wilck wrote:
> > >
> > > Can you explain how, on the front side bus, the IO-APIC knows whether
> > > a CPU has accepted the INT message? There is no response
> > > to the INT message on the bus, except for the EOI which comes much later.
> > > I'm not saying that you're wrong, I just really don't understand this
> > > point.
> > >
> >
> > I don't know what is exactly hardware protocol. I am just going by
> > intel documentation.
>
> I think it's important to distinguish between the LAPIC receiving an
> interrupt and the CPU receiving an interrupt. The former could happen
> without the latter if the CPU has set the TPR above the priority of
> the interrupt received by the LAPIC. In that case, the interrupt is
> kept pending in the LAPIC and recorded in the IRR if I understand the
> Intel documentation correctly.
>
> So I think the scenario which leaves IRR set when the kdump kernel
> starts is possible.

Hi Chip,

That's true. I agree that once kdump kernel starts we very well can be
in a situation where ISR/IRR bits of LAPIC are set and IRR bit of IOAPIC
is set. These are pending interrupt which will be delivered in the
second kernel and that kernel will reject these pending interrupts as spurious
interrupt and send an EOI. This will clear the LAPIC state.

But the issue here seems to be that LAPIC state got clear but IRR bit
at IOAPIC bit is not cleared because IOAPIC vector information was deleted
in first kernel and now upon receiving EOI, it does not know this EOI belongs
to which vector.

Given the fact that "irqpoll" makes things work, I think we should not
complicate the logic and leave it like that. Anyway, our goal is to capture
the dump and not make sure in second kernel interrupt from all the devices
are coming.

Otherwise we shall have to resort to techniques like first masking all
the interrupts at IOAPIC level, then issuing EOI on all the cpus in the
first kernel itself. It makes the logic little twisted.

Thanks
Vivek

2007-08-08 15:25:29

by Eric W. Biederman

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

Martin Wilck <[email protected]> writes:

> Hello Eric,
>
>> How bad is it if you just run with irqpoll in the kdump kernel?
>> If running with irqpoll is usable that is probably preferable
>> to putting in a hardware work around we can survive without.
>
> Yes, I tried that. No effect.

Ok. Later in the thread it sounds like you have retried this and
irqpoll is working now.

>> Have you done any looking at moving where the kernel initalizes
>> io_apics? One of the todo items on the path is to leave
>> io_apic mode enabled and just startup the kernel in io_apic
>> mode.
>
> I have tried to recover from the "IRR set" situation in several ways by
> changing setup_IO_APIC_irq(). But I haven't found a way to recover from
> this situation once disable_IO_APIC() had been called.

Yes. The long term goal is to remove the need for calling
disable_IO_APIC(). Because that makes the code simpler etc.

Once we get the kernel to the point where it can start in
ioapic mode (and not in i8259 mode) we can remove the
disabled code from the kexec on panic path.

> I concluded thatthe sequence of events
> "send INT message - never receive EOI - disable IO-APIC pin"
> messes up the IO-APIC (at least this specific one in the
> PCIEx-PCI bridge of the ICH7).

It is quite possible. I have observed a lot of obscure bugs in the
corner cases of the state machines, although it is possible
this is correct behavior and it is just specific to level
triggered interrupts which are almost exclusively not on
the first ioapic in a system like you describe.

I suspect the issue is that we never send the EOI message from
the local apic, and so it waits forever. Or that we have
reprogrammed the vectors by the time we send the EOI message
so that the EOI and the ioapic don't agree on the vector
number when the EOI message is sent. Grumble silly level
triggered interrupts grumble.

Eric

2007-08-08 17:35:25

by Martin Wilck

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

Eric W. Biederman wrote:

> Ok. Later in the thread it sounds like you have retried this and
> irqpoll is working now.

Yes. I'd give a lot to know what went wrong when I tried that in April.
It'd have saved me many hours of work if I had discovered this workaround
before.

>>> Have you done any looking at moving where the kernel initalizes
>>> io_apics? One of the todo items on the path is to leave
>>> io_apic mode enabled and just startup the kernel in io_apic
>>> mode.
>> I have tried to recover from the "IRR set" situation in several ways by
>> changing setup_IO_APIC_irq(). But I haven't found a way to recover from
>> this situation once disable_IO_APIC() had been called.
>
> Yes. The long term goal is to remove the need for calling
> disable_IO_APIC(). Because that makes the code simpler etc.

I think a lot would be gained if disable_IO_APIC() would just mask the IRQs
(like the function in my patch does), and perhaps fix the dest ID, instead of
totally clearing the registers.

Moreover, it'd be reasonable to separate out the code that restores virtual
wire mode from disable_IO_APIC().

> It is quite possible. I have observed a lot of obscure bugs in the
> corner cases of the state machines, although it is possible
> this is correct behavior and it is just specific to level
> triggered interrupts which are almost exclusively not on
> the first ioapic in a system like you describe.

Even if my patch in the form in which I submitted it is unusable,
I think the basic idea that IRQs should be masked bottom-up
(IO-APIC first, then local APIC, then CPU) is correct.

Or is there any specific reason why the current code does it vice-versa?

Martin


--
Martin Wilck
PRIMERGY System Software Engineer
FSC IP ESP DE6

Fujitsu Siemens Computers GmbH
Heinz-Nixdorf-Ring 1
33106 Paderborn
Germany

Tel: ++49 5251 8 15113
Fax: ++49 5251 8 20409
Email: mailto:[email protected]
Internet: http://www.fujitsu-siemens.com
Company Details: http://www.fujitsu-siemens.com/imprint.html

2007-08-08 18:01:48

by Eric W. Biederman

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

Martin Wilck <[email protected]> writes:

> Eric W. Biederman wrote:
>
>> Ok. Later in the thread it sounds like you have retried this and
>> irqpoll is working now.
>
> Yes. I'd give a lot to know what went wrong when I tried that in April.
> It'd have saved me many hours of work if I had discovered this workaround
> before.

Yes that is odd.

>>>> Have you done any looking at moving where the kernel initalizes
>>>> io_apics? One of the todo items on the path is to leave
>>>> io_apic mode enabled and just startup the kernel in io_apic
>>>> mode.
>>> I have tried to recover from the "IRR set" situation in several ways by
>>> changing setup_IO_APIC_irq(). But I haven't found a way to recover from
>>> this situation once disable_IO_APIC() had been called.
>>
>> Yes. The long term goal is to remove the need for calling
>> disable_IO_APIC(). Because that makes the code simpler etc.
>
> I think a lot would be gained if disable_IO_APIC() would just mask the IRQs
> (like the function in my patch does), and perhaps fix the dest ID, instead of
> totally clearing the registers.

Even masked we still won't see the EOI, because then we are in i8259
mode. So masked versus totally cleared really should make no
difference.

> Moreover, it'd be reasonable to separate out the code that restores virtual
> wire mode from disable_IO_APIC().

Possibly. If we were to change the order probably.

>> It is quite possible. I have observed a lot of obscure bugs in the
>> corner cases of the state machines, although it is possible
>> this is correct behavior and it is just specific to level
>> triggered interrupts which are almost exclusively not on
>> the first ioapic in a system like you describe.
>
> Even if my patch in the form in which I submitted it is unusable,
> I think the basic idea that IRQs should be masked bottom-up
> (IO-APIC first, then local APIC, then CPU) is correct.
>
> Or is there any specific reason why the current code does it vice-versa?

I haven't looked. My guess is that in the normal case it doesn't matter
because no irqs are alive and the kexec on panic case just copied that.

I guess right now I can see work put into cleaning things up a little
or not having to call disable_IO_APIC() at all. that was supposed
to be a short term hack, until we fixed things properly.

So far no one has been brave enough to mess with the kernel apic
initialization order so we can remove disable_IO_APIC. If our
goal isn't to fix things properly by removing the need for
disable_IO_APIC I don't feel like putting much effort into this
code path.

Eric

2007-08-08 18:07:28

by Martin Wilck

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

Vivek Goyal wrote:

> Got this oops while testing your patch when I did
> "echo c > /proc/sysrq-trigger"

That's bad :-(

...
> Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
[<0000000000000000>]
> <IRQ> [<ffffffff8025afd0>] handle_edge_irq+0x5c/0x127
> [<ffffffff8020e1ae>] do_IRQ+0xf1/0x15f
> [<ffffffff8020c191>] ret_from_intr+0x0/0xa
> <EOI> <NMI> [<ffffffff80351ae3>] __delay+0x6/0x10
> [<ffffffff8021e328>] crash_nmi_callback+0x4b/0x77

Obviously, it the oops occurs after the IRQs are re-enabled in crash_nmi_callback().
It appears that desc->chip->mask (in mask_ack_irq(), kernel/irq/chip.c:462) was NULL.
I have no idea how my patch could cause that.

This proves that re-enabling IRQs in this situation is too dangerous :-(.
I really did hundreds of dumps in different ways (echo c>/proc/sysrq-trigger,
forcing Oops and panic(), actually hititng Alt-Sysrq) and this never happened to
me.

Martin

--
Martin Wilck
PRIMERGY System Software Engineer
FSC IP ESP DE6

Fujitsu Siemens Computers GmbH
Heinz-Nixdorf-Ring 1
33106 Paderborn
Germany

Tel: ++49 5251 8 15113
Fax: ++49 5251 8 20409
Email: mailto:[email protected]
Internet: http://www.fujitsu-siemens.com
Company Details: http://www.fujitsu-siemens.com/imprint.html

2007-08-08 18:15:47

by Martin Wilck

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

Vivek Goyal wrote:

> But the issue here seems to be that LAPIC state got clear but IRR bit
> at IOAPIC bit is not cleared because IOAPIC vector information was deleted
> in first kernel and now upon receiving EOI, it does not know this EOI belongs
> to which vector.

I am making another experiment right now. I check the LAPIC ISR flags in
machine_crash_shutdown() before disabling the LAPIC, trying to send EOI if
I find one ISR bit set. So far, I haven't seen a single case where an ISR bit
was set, but several (9) where the IO-APIC IRR bit was set. OTOH, I know that if
I'd re-enable IRQs, IRQ 225 would be raised. The whole thing is done before
disable_IO_APIC().

Don't ask me for an explanation why I don't see the ISR bits.

The arch/x86_64/kernel/crash.c portion of the patch I am using is attached for
your reference. The rest is the same as for the original patch.

Martin

--
Martin Wilck
PRIMERGY System Software Engineer
FSC IP ESP DE6

Fujitsu Siemens Computers GmbH
Heinz-Nixdorf-Ring 1
33106 Paderborn
Germany

Tel: ++49 5251 8 15113
Fax: ++49 5251 8 20409
Email: mailto:[email protected]
Internet: http://www.fujitsu-siemens.com
Company Details: http://www.fujitsu-siemens.com/imprint.html


Attachments:
kdump_23.diff (3.83 kB)

2007-08-08 18:23:17

by Martin Wilck

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

Eric W. Biederman wrote:

>> I think a lot would be gained if disable_IO_APIC() would just mask the IRQs
>> (like the function in my patch does), and perhaps fix the dest ID, instead of
>> totally clearing the registers.
>
> Even masked we still won't see the EOI, because then we are in i8259
> mode. So masked versus totally cleared really should make no
> difference.

You are talking about EOI sent in the kdump kernel, correct? With my patch
the EOI (sent in the "old" kernel) is seen.

> I guess right now I can see work put into cleaning things up a little
> or not having to call disable_IO_APIC() at all. that was supposed
> to be a short term hack, until we fixed things properly.
>
> So far no one has been brave enough to mess with the kernel apic
> initialization order so we can remove disable_IO_APIC. If our
> goal isn't to fix things properly by removing the need for
> disable_IO_APIC I don't feel like putting much effort into this
> code path.

Who is working on this? Can you give a reference to a list of things that
would need to be done to do this right?

Martin

--
Martin Wilck
PRIMERGY System Software Engineer
FSC IP ESP DE6

Fujitsu Siemens Computers GmbH
Heinz-Nixdorf-Ring 1
33106 Paderborn
Germany

Tel: ++49 5251 8 15113
Fax: ++49 5251 8 20409
Email: mailto:[email protected]
Internet: http://www.fujitsu-siemens.com
Company Details: http://www.fujitsu-siemens.com/imprint.html

2007-08-08 18:38:51

by Martin Wilck

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

Eric W. Biederman wrote:

>>> irqpoll is working now.
>> Yes. I'd give a lot to know what went wrong when I tried that in April.
>> It'd have saved me many hours of work if I had discovered this workaround
>> before.
>
> Yes that is odd.

Got it. At that time I was using "noirqdebug" because we had a FW problem that
caused a lot of misrouted IRQs. With "noirqdebug", "irqpoll" has no effect.
Bummer.

Martin

--
Martin Wilck
PRIMERGY System Software Engineer
FSC IP ESP DE6

Fujitsu Siemens Computers GmbH
Heinz-Nixdorf-Ring 1
33106 Paderborn
Germany

Tel: ++49 5251 8 15113
Fax: ++49 5251 8 20409
Email: mailto:[email protected]
Internet: http://www.fujitsu-siemens.com
Company Details: http://www.fujitsu-siemens.com/imprint.html

2007-08-08 21:28:41

by Eric W. Biederman

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

Martin Wilck <[email protected]> writes:

> Vivek Goyal wrote:
>
>> Got this oops while testing your patch when I did
>> "echo c > /proc/sysrq-trigger"
>
> That's bad :-(
>
> ...
>> Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
> [<0000000000000000>]
>> <IRQ> [<ffffffff8025afd0>] handle_edge_irq+0x5c/0x127
>> [<ffffffff8020e1ae>] do_IRQ+0xf1/0x15f
>> [<ffffffff8020c191>] ret_from_intr+0x0/0xa
>> <EOI> <NMI> [<ffffffff80351ae3>] __delay+0x6/0x10
>> [<ffffffff8021e328>] crash_nmi_callback+0x4b/0x77
>
> Obviously, it the oops occurs after the IRQs are re-enabled in
> crash_nmi_callback().
> It appears that desc->chip->mask (in mask_ack_irq(), kernel/irq/chip.c:462) was
> NULL.
> I have no idea how my patch could cause that.
>
> This proves that re-enabling IRQs in this situation is too dangerous :-(.
> I really did hundreds of dumps in different ways (echo c>/proc/sysrq-trigger,
> forcing Oops and panic(), actually hititng Alt-Sysrq) and this never happened to
> me.

This reminds me.

There is currently an issue that no has finished converting
the lapic to the new generic irq way. Which can have this effect.

Eric

2007-08-09 10:11:31

by Vivek Goyal

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

On Wed, Aug 08, 2007 at 08:15:32PM +0200, Martin Wilck wrote:
> Vivek Goyal wrote:
>
> > But the issue here seems to be that LAPIC state got clear but IRR bit
> > at IOAPIC bit is not cleared because IOAPIC vector information was deleted
> > in first kernel and now upon receiving EOI, it does not know this EOI belongs
> > to which vector.
>
> I am making another experiment right now. I check the LAPIC ISR flags in
> machine_crash_shutdown() before disabling the LAPIC, trying to send EOI if
> I find one ISR bit set. So far, I haven't seen a single case where an ISR bit
> was set, but several (9) where the IO-APIC IRR bit was set. OTOH, I know that if
> I'd re-enable IRQs, IRQ 225 would be raised. The whole thing is done before
> disable_IO_APIC().
>
> Don't ask me for an explanation why I don't see the ISR bits.
>

Did you also check IRR bits on LAPIC. May be some interrupt is already
being served and your new interrupts has been queued on LAPIC and IRR bit
on LAPIC is set?

Following is the text from intel system programming manual.

The IRR contains the active interrupt requests that have been accepted, but
not yet dispatched to the processor for servicing. When the local APIC accepts
an interrupt, it sets the bit in the IRR that corresponds the vector of the
accepted interrupt. When the processor core is ready to handle the next
interrupt, the local APIC clears the highest priority IRR bit that is set and
sets the corresponding ISR bit. The vector for the highest priority bit set in
the ISR is then dispatched to the processor core for servicing.

In this case, I suspect that your interrupt gets queued at LAPIC (IRR is set)
but it does not make into ISR (may be because interrupts are disabled and
some other vector is already being served. Can you please paste your LAPIC
and IOAPIC state here.

Thanks
Vivek

2007-08-09 17:35:27

by Martin Wilck

[permalink] [raw]
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence

Vivek Goyal wrote:

> Did you also check IRR bits on LAPIC. May be some interrupt is already
> being served and your new interrupts has been queued on LAPIC and IRR bit
> on LAPIC is set?

That's it. Whenever the IO-APIC IRR bit is set, I see the LAPIC IRR bit set, too.
I never see any ISR bits set. Very rarely I see that IRQs are IRQ_PENDING or
IRQ_INPROGRESS, but that's apparently unrelated to the problem.

Actually, I often see APIC IRR bits set, but it only seems to matter for
IRQs coming from the secondary IO-APIC.

Unfortunately, just writng APIC_EOI in this situation (when an IRR bit is set)
has no effect.

I have tried to set the IRQ_DISABLED flag for all IRQs. From my understanding,
if I re-enable IRQs, that disables the actual IRQ handlers, but the ack()
function of the IRQ chip (which, in our case, sends EOI) is called anyway.
*That appears to work*, in the kdump kernel the IRR is cleared.

I'll run an overnight test with that technique tonight.

Martin

--
Martin Wilck
PRIMERGY System Software Engineer
FSC IP ESP DE6

Fujitsu Siemens Computers GmbH
Heinz-Nixdorf-Ring 1
33106 Paderborn
Germany

Tel: ++49 5251 8 15113
Fax: ++49 5251 8 20409
Email: mailto:[email protected]
Internet: http://www.fujitsu-siemens.com
Company Details: http://www.fujitsu-siemens.com/imprint.html