Hi!
I want to discuss possible race condition in __setup_irq() which is called from request_irq() function.
Unfortunately, we faced real crash with kernel 2.6.32.58 which is too far from current, but potential problem still persist in current linux-next.
This is what happens in 2.6.32.58 on our custom MIPS board:
-----------------------------------------------------------------
CPU 1 Unable to handle kernel paging request at virtual address 0000000000000008, epc == ffffffff801cb9f8, ra == ffffffff801cdef4
Oops[#1]:
Cpu 1
$ 0 : 0000000000000000 ffffffff80660008 0000000000000000 ffffffff805e5570
$ 4 : 000000000000002b 0000000000000000 0000000000080000 ffffffffffff00fe
$ 8 : 0000200300080000 0000000000000000 0000000000000000 a800000001000400
$12 : 000000001000cce0 000000001000001f a800000009400000 0000000000000000
$16 : ffffffff805db500 000000000000002b 8001070000000010 8001070000000238
$20 : 8001070000000220 0000000000000001 0000000000080000 000000000000002b
$24 : 0000000000000353 ffffffff80349ed8..................................
$28 : a8000000007ac000 a8000000007afc10 ffffffff80650000 ffffffff801cdef4
Hi : 0000000000000000
Lo : 0000000000000000
epc : ffffffff801cb9f8 handle_IRQ_event+0x30/0x190
Not tainted
ra : ffffffff801cdef4 handle_percpu_irq+0x54/0xc0
Status: 1000cce2 KX SX UX KERNEL EXL.
Cause : 00800008
BadVA : 0000000000000008
PrId : 000d900a (Cavium Octeon II)
Modules linked in: fsmddg_sfn srio_i2c eeprom bmu_ctrl sfp gpio_bmu leds_bmu led fsmddg_system_driver_mfd_dt fsmddg_paniclog ipv6 cramfs phram [last unloaded: fsmddg_sfn]
Process swapper (pid: 0, threadinfo=a8000000007ac000, task=a8000000007aae38, tls=0000000000000000)
Stack : ffffffff805db500 000000000000002b 8001070000000010 8001070000000238
8001070000000220 0000000000000001 0000000000080000 8001070000000108
ffffffff80650000 ffffffff801cdef4 0000000000000001 000000000000002b
0000000000000001 ffffffff8015f094 000000000000002b ffffffff8011a594
0000000000000000 ffffffff8066c290 ffffffff8066c290 0000000000000002
ffffffff8065a758 ffffffff80650000 ffffffff80650000 800000000fd00000
f5b3fdf8ceff7fff ffffffff80100880 0000000000000000 ffffffff80660008
ffffffff80100a80 0000000000000000 0000000000100000 a80000004e5b7038
000000001000cce0 ffffffffffff00fe 0000000000000001 0000000000000000
0000000000000000 a800000001000400 0000000000000000 000000000000cc00
...
Call Trace:
[<ffffffff801cb9f8>] handle_IRQ_event+0x30/0x190
[<ffffffff801cdef4>] handle_percpu_irq+0x54/0xc0
[<ffffffff8015f094>] do_IRQ+0x2c/0x40
[<ffffffff8011a594>] plat_irq_dispatch+0x10c/0x1e8
[<ffffffff80100880>] ret_from_irq+0x0/0x4
[<ffffffff80100aa0>] r4k_wait+0x20/0x40
[<ffffffff8015fcc4>] cpu_idle+0x9c/0x108
Code: ffb30018 ffb20010 ffb10008 <dca20008> e8450002 00a0802d 41606020 3c028057 02e0982d.
Disabling lock debugging due to kernel taint
Kernel panic - not syncing: Fatal exception in interrupt
Rebooting in 3 seconds..octeon_restart SMP
-----------------------------------------------------------------
The reason for this is that IRQ is enabled before IRQ handler is actually configured. Please take a look at kernel/irq/manage.c of the linux-next, __setup_irq():
1053 if (!shared) {
1054 init_waitqueue_head(&desc->wait_for_threads);
1055
1056 /* Setup the type (level, edge polarity) if configured: */
1057 if (new->flags & IRQF_TRIGGER_MASK) {
1058 ret = __irq_set_trigger(desc, irq,
1059 new->flags & IRQF_TRIGGER_MASK);
1060
1061 if (ret)
1062 goto out_mask;
1063 }
1064
1065 desc->istate &= ~(IRQS_AUTODETECT | IRQS_SPURIOUS_DISABLED | \
1066 IRQS_ONESHOT | IRQS_WAITING);
1067 irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
1068
1069 if (new->flags & IRQF_PERCPU) {
1070 irqd_set(&desc->irq_data, IRQD_PER_CPU);
1071 irq_settings_set_per_cpu(desc);
1072 }
1073
1074 if (new->flags & IRQF_ONESHOT)
1075 desc->istate |= IRQS_ONESHOT;
1076
1077 if (irq_settings_can_autoenable(desc))
1078 irq_startup(desc, true);
So in case no IRQ_NOAUTOEN flag was passed to request_irq() function, IRQ will be enabled here.
1079 else
1080 /* Undo nested disables: */
1081 desc->depth = 1;
1082
1083 /* Exclude IRQ from balancing if requested */
1084 if (new->flags & IRQF_NOBALANCING) {
1085 irq_settings_set_no_balancing(desc);
1086 irqd_set(&desc->irq_data, IRQD_NO_BALANCING);
1087 }
1088
1089 /* Set default affinity mask once everything is setup */
1090 setup_affinity(irq, desc, mask);
1091
1092 } else if (new->flags & IRQF_TRIGGER_MASK) {
1093 unsigned int nmsk = new->flags & IRQF_TRIGGER_MASK;
1094 unsigned int omsk = irq_settings_get_trigger_mask(desc);
1095
1096 if (nmsk != omsk)
1097 /* hope the handler works with current trigger mode */
1098 pr_warning("irq %d uses trigger mode %u; requested %u\n",
1099 irq, nmsk, omsk);
1100 }
1101
1102 new->irq = irq;
1103 *old_ptr = new;
Descriptor table for this IRQ will be filled with handler only here!
This code is inside raw_spin_lock_irqsave() protected region, but actually IRQ could be triggered on another core where IRQs are not disabled!
So if IRQ affinity is set up in the way that IRQ itself and request_irq() happen on different cores, IRQ that is already pending in hardware will occur
before it's handler is actually set up.
And this actually happens on our boards. The only reason the topic of the message contains "Possible" is that this race present in kernel for quite a long time and I have not found
any occurrences on other SMP systems than our Octeon. Other possible cause could be wrong usage of request_irq(), but the whole configuration seems to be legal:
IRQ affinity is set to 1 (core 0 processes IRQ).
request_irq() happens during kernel init on core 5.
IRQ is already pending (but not enabled) before request_irq() happens.
IRQ is not shared and should be enabled by request_irq() automatically.
The fix could be like following. It also takes into account, that "desc" structure must be propagated to other cores on SMP, before
first IRQ occurs. Similar fix works fine with 2.6.32.58, as I'm currently unable to test linux-next on Octeon MIPS.
This modification ensures that IRQs are only enabled when their
handler has actually been set up and propagated to other cores on SMP.
Signed-off-by: Alexander Sverdlin <[email protected]>
diff -uprN linux.old/kernel/irq/manage.c linux/kernel/irq/manage.c
--- linux.old/kernel/irq/manage.c 2012-05-16 14:08:38.000000000 +0200
+++ linux/kernel/irq/manage.c 2012-05-16 14:19:16.000000000 +0200
@@ -1074,12 +1074,6 @@ __setup_irq(unsigned int irq, struct irq
if (new->flags & IRQF_ONESHOT)
desc->istate |= IRQS_ONESHOT;
- if (irq_settings_can_autoenable(desc))
- irq_startup(desc, true);
- else
- /* Undo nested disables: */
- desc->depth = 1;
-
/* Exclude IRQ from balancing if requested */
if (new->flags & IRQF_NOBALANCING) {
irq_settings_set_no_balancing(desc);
@@ -1107,11 +1101,35 @@ __setup_irq(unsigned int irq, struct irq
desc->irqs_unhandled = 0;
/*
+ * Enable IRQs ONLY after handler has been already written to the
+ * descriptor, as IRQ can happen immediately on another core
+ */
+ if (!shared) {
+ if (irq_settings_can_autoenable(desc)) {
+ /*
+ * IRQ could immediately fire on another core, so make
+ * sure, that changes to the descriptor are propagated
+ * to other cores
+ */
+ smp_mb();
+ irq_startup(desc, true);
+ } else {
+ /* Undo nested disables: */
+ desc->depth = 1;
+ }
+ }
+
+ /*
* Check whether we disabled the irq via the spurious handler
* before. Reenable it and give it another chance.
*/
if (shared && (desc->istate & IRQS_SPURIOUS_DISABLED)) {
desc->istate &= ~IRQS_SPURIOUS_DISABLED;
+ /*
+ * IRQ could immediately fire on another core, so make sure,
+ * that changes to the descriptor are propagated to other cores
+ */
+ smp_mb();
__enable_irq(desc, irq, false);
}
--
Best regards,
Alexander Sverdlin.
On Wed, 16 May 2012, Alexander Sverdlin wrote:
> [<ffffffff801cb9f8>] handle_IRQ_event+0x30/0x190
> [<ffffffff801cdef4>] handle_percpu_irq+0x54/0xc0
> [<ffffffff8015f094>] do_IRQ+0x2c/0x40
> [<ffffffff8011a594>] plat_irq_dispatch+0x10c/0x1e8
> [<ffffffff80100880>] ret_from_irq+0x0/0x4
> [<ffffffff80100aa0>] r4k_wait+0x20/0x40
> [<ffffffff8015fcc4>] cpu_idle+0x9c/0x108
>
> This code is inside raw_spin_lock_irqsave() protected region, but
> actually IRQ could be triggered on another core where IRQs are not
> disabled!
And that interrupt will spin on desc->lock until this code has set up
the action. So nothing happens at all.
Except for per_cpu interrupts. Now, that's a different issue and you
are doing something completely wrong here.
> So if IRQ affinity is set up in the way that IRQ itself and
> request_irq() happen on different cores, IRQ that is already pending
> in hardware will occur before it's handler is actually set up.
per_cpu interrupts are special.
> And this actually happens on our boards. The only reason the topic
> of the message contains "Possible" is that this race present in
> kernel for quite a long time and I have not found any occurrences on
> other SMP systems than our Octeon. Other possible cause could be
> wrong usage of request_irq(), but the whole configuration seems to
> be legal:
Well, there is no law which forbids doing that.
> IRQ affinity is set to 1 (core 0 processes IRQ).
> request_irq() happens during kernel init on core 5.
> IRQ is already pending (but not enabled) before request_irq() happens.
> IRQ is not shared and should be enabled by request_irq() automatically.
But it's wrong nevertheless.
Your irq is using handle_percpu_irq() as the flow handler.
handle_percpu_irq() is a special flow handler which does not take the
irq descriptor lock for performance reasons. It's a single interrupt
number which has a percpu dev_id and can be handled on all cores in
parallel.
The interrupts need to be marked as such and requested with
request_percpu_irq(). Those interrupts are either marked as
NOAUTOENABLE or set up by the low level setup code, which runs on the
boot cpu with interrupt enabled.
Those interrupts are marked as percpu and can only be requested with
request_percpu_irq().
>From your description it looks like you are using a regular interrupt,
because interrupt affinities of per cpu interrupts cannot be set. They
are hardwired.
I don't know what your archaeologic kernel version is doing there, but
the current cavium code only uses handle_percpu_irq flow handler for a
handful special interrupts which are handled and setup by the cavium
core code correctly.
So nothing to fix here.
Thanks,
tglx
Hello Thomas,
On 05/16/2012 03:44 PM, Thomas Gleixner wrote:
> I don't know what your archaeologic kernel version is doing there, but
> the current cavium code only uses handle_percpu_irq flow handler for a
> handful special interrupts which are handled and setup by the cavium
> core code correctly.
>
> So nothing to fix here.
Thanks for clarification!
I'll clean-up all our IRQ code, remove unnecessary PER_CPU flags and drop a note, how it helped.
--
Best regards,
Alexander Sverdlin.
Hello Thomas, David, Venkat,
On 05/16/2012 03:44 PM, Thomas Gleixner wrote:
> Your irq is using handle_percpu_irq() as the flow handler.
>
> handle_percpu_irq() is a special flow handler which does not take the
> irq descriptor lock for performance reasons. It's a single interrupt
> number which has a percpu dev_id and can be handled on all cores in
> parallel.
>
> The interrupts need to be marked as such and requested with
> request_percpu_irq(). Those interrupts are either marked as
> NOAUTOENABLE or set up by the low level setup code, which runs on the
> boot cpu with interrupt enabled.
>
> Those interrupts are marked as percpu and can only be requested with
> request_percpu_irq().
>
Could someone comment please, why exactly this happens in current linux-next for Octeon:
In arch/mips/cavium-octeon/octeon-irq.c MBOX IRQs are set up to be handled by handle_percpu_irq():
static void __init octeon_irq_init_ciu(void)
{
...
octeon_irq_set_ciu_mapping(OCTEON_IRQ_MBOX0, 0, 32, chip_mbox, handle_percpu_irq);
octeon_irq_set_ciu_mapping(OCTEON_IRQ_MBOX1, 0, 33, chip_mbox, handle_percpu_irq);
But in arch/mips/cavium-octeon/smp.c it's requested as normal IRQ:
void octeon_prepare_cpus(unsigned int max_cpus)
{
...
if (request_irq(OCTEON_IRQ_MBOX0, mailbox_interrupt,
IRQF_PERCPU | IRQF_NO_THREAD, "SMP-IPI",
mailbox_interrupt)) {
panic("Cannot request_irq(OCTEON_IRQ_MBOX0)");
}
Is it a bug, or some kind of special case?
--
Best regards,
Alexander Sverdlin.
On Fri, 25 May 2012, Alexander Sverdlin wrote:
> Hello Thomas, David, Venkat,
>
> On 05/16/2012 03:44 PM, Thomas Gleixner wrote:
> > Your irq is using handle_percpu_irq() as the flow handler.
> >
> > handle_percpu_irq() is a special flow handler which does not take the
> > irq descriptor lock for performance reasons. It's a single interrupt
> > number which has a percpu dev_id and can be handled on all cores in
> > parallel.
> >
> > The interrupts need to be marked as such and requested with
> > request_percpu_irq(). Those interrupts are either marked as
> > NOAUTOENABLE or set up by the low level setup code, which runs on the
> > boot cpu with interrupt enabled.
> >
> > Those interrupts are marked as percpu and can only be requested with
> > request_percpu_irq().
> >
>
> Could someone comment please, why exactly this happens in current linux-next for Octeon:
>
> In arch/mips/cavium-octeon/octeon-irq.c MBOX IRQs are set up to be handled by handle_percpu_irq():
>
> static void __init octeon_irq_init_ciu(void)
> {
> ...
> octeon_irq_set_ciu_mapping(OCTEON_IRQ_MBOX0, 0, 32, chip_mbox, handle_percpu_irq);
> octeon_irq_set_ciu_mapping(OCTEON_IRQ_MBOX1, 0, 33, chip_mbox, handle_percpu_irq);
>
> But in arch/mips/cavium-octeon/smp.c it's requested as normal IRQ:
>
> void octeon_prepare_cpus(unsigned int max_cpus)
> {
> ...
> if (request_irq(OCTEON_IRQ_MBOX0, mailbox_interrupt,
> IRQF_PERCPU | IRQF_NO_THREAD, "SMP-IPI",
> mailbox_interrupt)) {
> panic("Cannot request_irq(OCTEON_IRQ_MBOX0)");
> }
>
> Is it a bug, or some kind of special case?
I forgot the other case where a simple dev_id is used. That one is a
simple per cpu interrupt, which has a regular dev_id. Note the flag:
IRQF_PERCPU. So yes we have two variants, but both are dealing with
per cpu interrupts. The subtle difference is how the dev_id is handled
and how the enable/disable mechanism works.
What are you trying to do/solve ?
Thanks,
tglx
On 05/25/2012 07:01 AM, Alexander Sverdlin wrote:
> Hello Thomas, David, Venkat,
>
> On 05/16/2012 03:44 PM, Thomas Gleixner wrote:
>> Your irq is using handle_percpu_irq() as the flow handler.
>>
>> handle_percpu_irq() is a special flow handler which does not take the
>> irq descriptor lock for performance reasons. It's a single interrupt
>> number which has a percpu dev_id and can be handled on all cores in
>> parallel.
>>
>> The interrupts need to be marked as such and requested with
>> request_percpu_irq(). Those interrupts are either marked as
>> NOAUTOENABLE or set up by the low level setup code, which runs on the
>> boot cpu with interrupt enabled.
>>
>> Those interrupts are marked as percpu and can only be requested with
>> request_percpu_irq().
>>
>
> Could someone comment please, why exactly this happens in current linux-next for Octeon:
>
> In arch/mips/cavium-octeon/octeon-irq.c MBOX IRQs are set up to be handled by handle_percpu_irq():
>
> static void __init octeon_irq_init_ciu(void)
> {
> ...
> octeon_irq_set_ciu_mapping(OCTEON_IRQ_MBOX0, 0, 32, chip_mbox, handle_percpu_irq);
> octeon_irq_set_ciu_mapping(OCTEON_IRQ_MBOX1, 0, 33, chip_mbox, handle_percpu_irq);
>
> But in arch/mips/cavium-octeon/smp.c it's requested as normal IRQ:
>
> void octeon_prepare_cpus(unsigned int max_cpus)
> {
> ...
> if (request_irq(OCTEON_IRQ_MBOX0, mailbox_interrupt,
> IRQF_PERCPU | IRQF_NO_THREAD, "SMP-IPI",
> mailbox_interrupt)) {
> panic("Cannot request_irq(OCTEON_IRQ_MBOX0)");
> }
>
> Is it a bug, or some kind of special case?
>
Probably a bug. But it works, so it has never been an issue.
Patches to bring this code more in line with current Linux irq orthodoxy
would receive prompt review. Especially if there were an explanation of
some problem they were fixing.
David Daney