2009-09-21 18:42:24

by Remy Bohmer

[permalink] [raw]
Subject: 2.6.31-rt11 freeze on userland start on ARM

Hi all,

I am integrating the 2.6.31-rt11 kernel on our ARM9 based (Atmel
at91sam9261) board.
Kernel boots fine but when userland starts the linuxrc process, and
the first 'echo' from the /etc/init.d/rcS script is printed to the
serial console (DBGU) the system locks up completely, from userland no
character ever makes it to the terminal.

I found the reason of the lockup and know a workaround, but I can use
some good suggestions to solve it the correct way.

What happens is that the kernel continuously schedules a IRQ-thread;
namely IRQ1-atmel_serial. And this IRQ thread keeps getting scheduled
forever...

Looking more closely I noticed that it is new compared to 2.6.24/26-RT
that a IRQ thread is started for this driver.
Notice that the DBGU interrupt is called the system-interrupt and it
is shared with the timer interrupt. The timer interrupt has IRQF_TIMER
set which incorporates IRQF_NODELAY. This is different compared to
2.6.24/26 where a sharing with a IRQF_NODELAY interrupt would make all
shared handlers also run in IRQF_NODELAY context.
As such we have here a interrupt handler running as NODELAY handler,
that is shared with a interrupt handler that runs in thread context.

So, as workaround/test I made this change:

Index: linux-2.6.31/drivers/serial/atmel_serial.c
===================================================================
--- linux-2.6.31.orig/drivers/serial/atmel_serial.c 2009-09-21
19:44:48.000000000 +0200
+++ linux-2.6.31/drivers/serial/atmel_serial.c 2009-09-21
19:45:15.000000000 +0200
@@ -808,7 +808,8 @@ static int atmel_startup(struct uart_por
/*
* Allocate the IRQ
*/
- retval = request_irq(port->irq, atmel_interrupt, IRQF_SHARED,
+ retval = request_irq(port->irq, atmel_interrupt,
+ IRQF_SHARED | IRQF_NODELAY,
tty ? tty->name : "atmel_serial", port);
if (retval) {
printk("atmel_serial: atmel_startup - Can't get irq\n");
---

This change makes the atmel-serial driver interrupt handler run as
IRQF_NODELAY handler again, just as on 2.6.24/26, and the board is
booting properly again with 2.6.31.
Anyone any ideas how to fix it properly? Or interested in more
debugging information. (I have an ETM tracer hooked up...)

Notice that this driver actually needs the NODELAY flag set on
preempt-RT to prevent missing characters with its 1 byte FIFO-hardware
without flow-control ;-) (I will provide a clean patch later)
For now, at least it shows a bug in the new irq-threading mechanisms...

I also have a few related questions, besides investigating the
root-cause of this bug:
What is the rationale behind the per-driver irq-thread? What is the
gain here for RT? My first impression is that this would increase the
latencies in case of sharing interrupts with NODELAY interrupts. All
handlers need to run, so the master interrupt cannot be enabled again
until all IRQ-threads have run, so the NODELAY handler must wait until
all IRQ-threads have run. So, giving different prios to the
IRQ-threads that share the same source would increase the latencies
even more.
If different drivers share the same interrupt line, even additional
schedule overhead can be added to the latencies...
On first impression the former implementation seems more efficient. I
guess it is changed for a good reason, so, I must be missing something
here... I hope someone can explain...

Kind regards,

Remy


2009-09-24 09:35:18

by yi li

[permalink] [raw]
Subject: Re: 2.6.31-rt11 freeze on userland start on ARM

I met similar problem on Blackfin (BF537) using 2.6.31-rt10 (I made
some local changes to make 2.6.31-rt10 built for Blackfin).
The "init" process tries to print on serial console, but it can't.

But in my case, I do NOT think the reason is that "kernel continuously
schedules a IRQ-thread, namely IRQ1-atmel_serial".
Instead, the serial TX irq handler thread never get scheduled - this
irq handler has no chance to run.

Setting serial TX/RX irqs to "IRQF_NODELAY" would boot the kernel. But
this should no be a correct fix.

So this looks like a common issue. Is there any way to debug or fix this?

Regards,
-Yi

On Tue, Sep 22, 2009 at 2:36 AM, Remy Bohmer <[email protected]> wrote:
> Hi all,
>
> I am integrating the 2.6.31-rt11 kernel on our ARM9 based (Atmel
> at91sam9261) board.
> Kernel boots fine but when userland starts the linuxrc process, and
> the first 'echo' from the /etc/init.d/rcS script is printed to the
> serial console (DBGU) the system locks up completely, from userland no
> character ever makes it to the terminal.
>
> I found the reason of the lockup and know a workaround, but I can use
> some good suggestions to solve it the correct way.
>
> What happens is that the kernel continuously schedules a IRQ-thread;
> namely IRQ1-atmel_serial. And this IRQ thread keeps getting scheduled
> forever...
>
> Looking more closely I noticed that it is new compared to 2.6.24/26-RT
> that a IRQ thread is started for this driver.
> Notice that the DBGU interrupt is called the system-interrupt and it
> is shared with the timer interrupt. The timer interrupt has IRQF_TIMER
> set which incorporates IRQF_NODELAY. This is different compared to
> 2.6.24/26 where a sharing with a IRQF_NODELAY interrupt would make all
> shared handlers also run in IRQF_NODELAY context.
> As such we have here a interrupt handler running as NODELAY handler,
> that is shared with a interrupt handler that runs in thread context.
>
> So, as workaround/test I made this change:
>
> Index: linux-2.6.31/drivers/serial/atmel_serial.c
> ===================================================================
> --- linux-2.6.31.orig/drivers/serial/atmel_serial.c ? ? 2009-09-21
> 19:44:48.000000000 +0200
> +++ linux-2.6.31/drivers/serial/atmel_serial.c ?2009-09-21
> 19:45:15.000000000 +0200
> @@ -808,7 +808,8 @@ static int atmel_startup(struct uart_por
> ? ? ? ?/*
> ? ? ? ? * Allocate the IRQ
> ? ? ? ? */
> - ? ? ? retval = request_irq(port->irq, atmel_interrupt, IRQF_SHARED,
> + ? ? ? retval = request_irq(port->irq, atmel_interrupt,
> + ? ? ? ? ? ? ? ? ? ? ? IRQF_SHARED | IRQF_NODELAY,
> ? ? ? ? ? ? ? ? ? ? ? ?tty ? tty->name : "atmel_serial", port);
> ? ? ? ?if (retval) {
> ? ? ? ? ? ? ? ?printk("atmel_serial: atmel_startup - Can't get irq\n");
> ---
>
> This change makes the atmel-serial driver interrupt handler run as
> IRQF_NODELAY handler again, just as on 2.6.24/26, and the board is
> booting properly again with 2.6.31.
> Anyone any ideas how to fix it properly? Or interested in more
> debugging information. (I have an ETM tracer hooked up...)
>
> Notice that this driver actually needs the NODELAY flag set on
> preempt-RT to prevent missing characters with its 1 byte FIFO-hardware
> without flow-control ;-) ?(I will provide a clean patch later)
> For now, at least it shows a bug in the new irq-threading mechanisms...
>
> I also have a few related questions, besides investigating the
> root-cause of this bug:
> What is the rationale behind the per-driver irq-thread? What is the
> gain here for RT? My first impression is that this would increase the
> latencies in case of sharing interrupts with NODELAY interrupts. All
> handlers need to run, so the master interrupt cannot be enabled again
> until all IRQ-threads have run, so the NODELAY handler must wait until
> all IRQ-threads have run. So, giving different prios to the
> IRQ-threads that share the same source would increase the latencies
> even more.
> If different drivers share the same interrupt line, even additional
> schedule overhead can be added to the latencies...
> On first impression the former implementation seems more efficient. I
> guess it is changed for a good reason, so, I must be missing something
> here... I hope someone can explain...
>
> Kind regards,
>
> Remy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>

2009-09-30 12:58:26

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.31-rt11 freeze on userland start on ARM

On Mon, 21 Sep 2009, Remy Bohmer wrote:
> So, as workaround/test I made this change:
>
> Index: linux-2.6.31/drivers/serial/atmel_serial.c
> ===================================================================
> --- linux-2.6.31.orig/drivers/serial/atmel_serial.c 2009-09-21
> 19:44:48.000000000 +0200
> +++ linux-2.6.31/drivers/serial/atmel_serial.c 2009-09-21
> 19:45:15.000000000 +0200
> @@ -808,7 +808,8 @@ static int atmel_startup(struct uart_por
> /*
> * Allocate the IRQ
> */
> - retval = request_irq(port->irq, atmel_interrupt, IRQF_SHARED,
> + retval = request_irq(port->irq, atmel_interrupt,
> + IRQF_SHARED | IRQF_NODELAY,
> tty ? tty->name : "atmel_serial", port);
> if (retval) {
> printk("atmel_serial: atmel_startup - Can't get irq\n");

The serial irq cannot run in hard irq context.

There are two solutions to this problem:

1) Use the other timer which is available on AT91.

2) Make the serial driver explicitely threaded and check in the
hardirq handler whether the irq originated from the serial driver. If
yes, disable it in the serial device and return IRQ_WAKE_THREAD
otherwise return IRQ_NONE.

Thanks,

tglx

2009-09-30 16:10:59

by Remy Bohmer

[permalink] [raw]
Subject: Re: 2.6.31-rt11 freeze on userland start on ARM

Hi Thomas,

Thanks for your answer. But it is not entirely clear to me...

2009/9/30 Thomas Gleixner <[email protected]>:
> On Mon, 21 Sep 2009, Remy Bohmer wrote:
>> So, as workaround/test I made this change:
>>
>> Index: linux-2.6.31/drivers/serial/atmel_serial.c
>> ===================================================================
>> --- linux-2.6.31.orig/drivers/serial/atmel_serial.c ? 2009-09-21
>> 19:44:48.000000000 +0200
>> +++ linux-2.6.31/drivers/serial/atmel_serial.c ? ? ? ?2009-09-21
>> 19:45:15.000000000 +0200
>> @@ -808,7 +808,8 @@ static int atmel_startup(struct uart_por
>> ? ? ? /*
>> ? ? ? ?* Allocate the IRQ
>> ? ? ? ?*/
>> - ? ? retval = request_irq(port->irq, atmel_interrupt, IRQF_SHARED,
>> + ? ? retval = request_irq(port->irq, atmel_interrupt,
>> + ? ? ? ? ? ? ? ? ? ? IRQF_SHARED | IRQF_NODELAY,
>> ? ? ? ? ? ? ? ? ? ? ? tty ? tty->name : "atmel_serial", port);
>> ? ? ? if (retval) {
>> ? ? ? ? ? ? ? printk("atmel_serial: atmel_startup - Can't get irq\n");
>
> The serial irq cannot run in hard irq context.

Indeed most drivers cannot, but for this particular handler can you
please explain me why?
Maybe I am missing some new mechanism that prevents it that was not
there on older RT-kernels (tested up-to latest 2.6.24-rt +
2.6.26-rt)...
The atmel_serial IRQ was adapted such (I think it was mainlined in
2.6.25 already) to make it suitable to run in hard-irq context. (I
know, because I did that myself)
FWIW... it seems to run stable in hard-irq context on 2.6.31-rt with
the patch above as well... (coincidence?)

> There are two solutions to this problem:
>
> 1) Use the other timer which is available on AT91.

You mean the TC-timer? This is what I am actually using as well. But
that does not solve the problem... Hmm, I do not use the clockevent
part of that TC-lib, because I needed that 3th block for something
else. (and 32kHz is not a very pleasant timer to use as well...)
So, I guess this is not an option for me, and need to move to the next option...

> 2) Make the serial driver explicitely threaded and check in the
> hardirq handler whether the irq originated from the serial driver. If
> yes, disable it in the serial device and return IRQ_WAKE_THREAD
> otherwise return IRQ_NONE.

Interesting, this sounds new, and I have to dig into it to find out
how this is supposed to work... Do you happen to have any good
pointers for examples or doc?
TOL: could the generic interrupt code not check for this? It can see
the timer interrupt handler not returning 'IRQ_HANDLED', and still see
the interrupt being active, and know that there is a IRQ thread on the
same line, so it can conclude itself to mask the source in the
interrupt controller and wake the thread... Or am I wrong?

Kind regards,

Remy

2009-10-04 12:00:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.31-rt11 freeze on userland start on ARM

On Wed, 30 Sep 2009, Remy Bohmer wrote:
> > The serial irq cannot run in hard irq context.
>
> Indeed most drivers cannot, but for this particular handler can you
> please explain me why?
> Maybe I am missing some new mechanism that prevents it that was not
> there on older RT-kernels (tested up-to latest 2.6.24-rt +
> 2.6.26-rt)...

Which had the same problem ....

> The atmel_serial IRQ was adapted such (I think it was mainlined in
> 2.6.25 already) to make it suitable to run in hard-irq context. (I
> know, because I did that myself)

That's fine for mainline but not for -rt.

> FWIW... it seems to run stable in hard-irq context on 2.6.31-rt with
> the patch above as well... (coincidence?)

Yes. The serial code takes locks which are converted to sleeping locks
on -RT. That's a nono.

> > 2) Make the serial driver explicitely threaded and check in the
> > hardirq handler whether the irq originated from the serial driver. If
> > yes, disable it in the serial device and return IRQ_WAKE_THREAD
> > otherwise return IRQ_NONE.
>
> Interesting, this sounds new, and I have to dig into it to find out
> how this is supposed to work... Do you happen to have any good
> pointers for examples or doc?

drivers/net/wireless/b43/main.c in mainline

> TOL: could the generic interrupt code not check for this? It can see
> the timer interrupt handler not returning 'IRQ_HANDLED', and still see
> the interrupt being active, and know that there is a IRQ thread on the
> same line, so it can conclude itself to mask the source in the
> interrupt controller and wake the thread... Or am I wrong?

What happens if both are active at the same time ? Also masking the
interrupt line will block your timer interrupts until the threaded
handler has completed.

Thanks,

tglx

2009-10-04 20:59:52

by Remy Bohmer

[permalink] [raw]
Subject: Re: 2.6.31-rt11 freeze on userland start on ARM

Hi,

2009/10/4 Thomas Gleixner <[email protected]>:
> On Wed, 30 Sep 2009, Remy Bohmer wrote:
>> > The serial irq cannot run in hard irq context.
>>
>> Indeed most drivers cannot, but for this particular handler can you
>> please explain me why?
>> Maybe I am missing some new mechanism that prevents it that was not
>> there on older RT-kernels (tested up-to latest 2.6.24-rt +
>> 2.6.26-rt)...
>
> Which had the same problem ....

what problem...?

>> The atmel_serial IRQ was adapted such (I think it was mainlined in
>> 2.6.25 already) to make it suitable to run in hard-irq context. (I
>> know, because I did that myself)
>
> That's fine for mainline but not for -rt.

The goal was making its interrupt handler suitable for -rt as well as
mainline (on older kernels...).

> Yes. The serial code takes locks which are converted to sleeping locks
> on -RT. That's a nono.

Yes, I know that spinlocks are converted to mutexes and as such they
can sleep which is not allowed in hard interrupt context.

But, maybe I am really overlooking something, the serial interrupt
handler code of the atmel serial driver only reads the characters from
the device, and puts them in a local ringbuffer with atomic
instructions, then a tasklet is being scheduled to handle the data
outside interrupt context. This tasklet/softirq is being scheduled by
wake_up_process(), which is afaik allowed to be called from hard-irq
context.

The interface to the generic serial driver (that uses spinlocks) is
handled by the tasklet (softirq), not by the interrupt handler.

So, technically, as far as I can see the atmel-serial driver itself
should be safe to be run as nodelay-handler as it is now...(it is
doing that here for years already).
I do not see any path that conflicts with this rule. So, I still do
not see what is technically wrong with this particular driver, except
that there is a new mechanism available now to solve this differently.

>> > 2) Make the serial driver explicitely threaded and check in the
>> > hardirq handler whether the irq originated from the serial driver. If
>> > yes, disable it in the serial device and return IRQ_WAKE_THREAD
>> > otherwise return IRQ_NONE.
>>
>> Interesting, this sounds new, and I have to dig into it to find out
>> how this is supposed to work... Do you happen to have any good
>> pointers for examples or doc?
>
> drivers/net/wireless/b43/main.c in mainline

Thanks, This is indeed a good/clear example.
I will adapt the atmel-serial driver to this new irq-threading model
and provide a patch for it, it seems cleaner and makes the tasklet
stuff obsolete for this driver.

>> TOL: could the generic interrupt code not check for this? It can see
>> the timer interrupt handler not returning 'IRQ_HANDLED', and still see
>> the interrupt being active, and know that there is a IRQ thread on the
>> same line, so it can conclude itself to mask the source in the
>> interrupt controller and wake the thread... Or am I wrong?
>
> What happens if both are active at the same time ? Also masking the
> interrupt line will block your timer interrupts until the threaded
> handler has completed.

That is indeed true, and proves once again that shared interrupt
handlers are really annoying, especially on -rt...

The old way we did in 2.6.24-rt + 2.6.26-rt was to adapt the handler
to allow it to run in hard-irq context for -rt as well as mainline.
The new way handles this differently... Since a -rt-friendly generic
solution seems not possible, I guess before -rt ever becomes mainline
_all_ interrupt handlers that are shared with a nodelay interrupt in
some configuration must be adapted to the new threaded_irq model... A
huge job...

Kind regards,

Remy

2009-10-05 10:34:38

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.31-rt11 freeze on userland start on ARM

B1;2005;0cOn Sun, 4 Oct 2009, Remy Bohmer wrote:

> Hi,
>
> 2009/10/4 Thomas Gleixner <[email protected]>:
> > On Wed, 30 Sep 2009, Remy Bohmer wrote:
> >> > The serial irq cannot run in hard irq context.
> >>
> >> Indeed most drivers cannot, but for this particular handler can you
> >> please explain me why?
> >> Maybe I am missing some new mechanism that prevents it that was not
> >> there on older RT-kernels (tested up-to latest 2.6.24-rt +
> >> 2.6.26-rt)...
> >
> > Which had the same problem ....
>
> what problem...?

The tasklet conversion happened after .24 and the hard interrupt
context receive loop has introduced quite nasty latencies in
.26. There was some other warning problem in .26 which I forgot.

> I will adapt the atmel-serial driver to this new irq-threading model
> and provide a patch for it, it seems cleaner and makes the tasklet
> stuff obsolete for this driver.

Great, that code really looks ugly.

> >> TOL: could the generic interrupt code not check for this? It can see
> >> the timer interrupt handler not returning 'IRQ_HANDLED', and still see
> >> the interrupt being active, and know that there is a IRQ thread on the
> >> same line, so it can conclude itself to mask the source in the
> >> interrupt controller and wake the thread... Or am I wrong?
> >
> > What happens if both are active at the same time ? Also masking the
> > interrupt line will block your timer interrupts until the threaded
> > handler has completed.
>
> That is indeed true, and proves once again that shared interrupt
> handlers are really annoying, especially on -rt...
>
> The old way we did in 2.6.24-rt + 2.6.26-rt was to adapt the handler
> to allow it to run in hard-irq context for -rt as well as mainline.
> The new way handles this differently... Since a -rt-friendly generic
> solution seems not possible, I guess before -rt ever becomes mainline
> _all_ interrupt handlers that are shared with a nodelay interrupt in
> some configuration must be adapted to the new threaded_irq model... A
> huge job...

Don't think so. ATx seems to be one of the weirder pieces of hardware :)

Thanks,

tglx