2007-06-01 18:16:35

by Matt Fredrickson

[permalink] [raw]
Subject: Device Driver Etiquette

Greetings,

I maintain a device driver that has been bitten by the transition to 4K
stacks. It is a T1/E1 line interface PCI card driver that is
maintained outside of the kernel, although is used by a significant
number of people. The card has a part for doing echo cancellation, but
it is accessed through a vendor API which (when we received it) was
quite stack heavy. It used the stack for a number of large data
structures, although I have moved a great deal of them (particularly
the larger ones) onto the heap instead of the stack. Although this has
reduced stack usage to the point where it is usable within 4K stacks,
on some code paths, it can still use quite a bit of stack space (though
under 4K) for local variables and a handful of small data structures.
The problem is that in order to initialize and use the echo canceler,
there is a firmware load portion which takes a noticeable period of
time (~5-10 seconds). That is done through this stack heavy portion of
the code.

The problem that I am seeing is that while this card is loading its
firmware, there are other device interrupts that occur, and if enough
levels of interrupt happen, it overflows the stack. Oh, the firmware
load occurs in the context of an ioctl, FWIW.

My solution was to disable interrupts while the firmware was being
loaded using local_irq_save/local_irq_restore, although that seems to
be a naughty thing to do, especially given the long period of time it
takes to load the firmware on the card.

My question is this: is there a way to either work around the problem I
am seeing with the stack without recompiling the kernel with 8K stack
size or without disabling irqs for such a long period of time (which I
think is not a nice thing to do either) OR is it acceptable (although
not nice) to simply fix it this way, by disabling irqs while it loads
the firmware?

Thanks,
Matthew Fredrickson
Kernel Developer
Digium, Inc.


2007-06-01 18:38:39

by Lee Revell

[permalink] [raw]
Subject: Re: Device Driver Etiquette

On 6/1/07, Matthew Fredrickson <[email protected]> wrote:
> is it acceptable (although
> not nice) to simply fix it this way, by disabling irqs while it loads
> the firmware?
>

I would say to just disable IRQs while loading firmware. Almost every
server I maintain has some vendor driver which generates a "many lost
ticks!" message on load. As long as it's only done at module load
time it should be fine.

Of course the best solution is to just get the driver into mainline.

Lee

2007-06-01 22:28:40

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Device Driver Etiquette

Lee Revell wrote:
> On 6/1/07, Matthew Fredrickson <[email protected]> wrote:
>> is it acceptable (although
>> not nice) to simply fix it this way, by disabling irqs while it loads
>> the firmware?
>>
>
> I would say to just disable IRQs while loading firmware. Almost every
> server I maintain has some vendor driver which generates a "many lost
> ticks!" message on load. As long as it's only done at module load
> time it should be fine.
>
> Of course the best solution is to just get the driver into mainline.
>

Disable interrupts for 5-10 seconds (see OP)? You're nuts.

That might be only a minor disaster on an SMP system, assuming IRQ
balancing sends them elsewhere, but on a uniprocessor system you're
effectively shooting the machine in the head.

It should be possible to streamline the code to not hog the stack that way.

-hpa

2007-06-01 23:16:56

by Daniel J Blueman

[permalink] [raw]
Subject: Re: Device Driver Etiquette

On 1 Jun, 19:40, "Lee Revell" <[email protected]> wrote:
> On 6/1/07, Matthew Fredrickson <[email protected]> wrote:
>
> > is it acceptable (although
> > not nice) to simply fix it this way, by disabling irqs while it loads
> > the firmware?
>
> I would say to just disable IRQs while loading firmware. Almost every
> server I maintain has some vendor driver which generates a "many lost
> ticks!" message on load. As long as it's only done at module load
> time it should be fine.

For anything ~10s or more, you'll probably also need to call the timer
update function to prevent soft lockup warning being generated.

> Of course the best solution is to just get the driver into mainline.
>
> Lee
--
Daniel J Blueman

2007-06-02 15:05:21

by Satyam Sharma

[permalink] [raw]
Subject: Re: Device Driver Etiquette

Hi Matthew,

On 6/1/07, Matthew Fredrickson <[email protected]> wrote:
> Greetings,
>
> I maintain a device driver that has been bitten by the transition to 4K
> stacks. It is a T1/E1 line interface PCI card driver that is
> maintained outside of the kernel, although is used by a significant
> number of people. The card has a part for doing echo cancellation, but
> it is accessed through a vendor API which (when we received it) was
> quite stack heavy. It used the stack for a number of large data
> structures, although I have moved a great deal of them (particularly
> the larger ones) onto the heap instead of the stack. Although this has
> reduced stack usage to the point where it is usable within 4K stacks,
> on some code paths, it can still use quite a bit of stack space (though
> under 4K) for local variables and a handful of small data structures.
> The problem is that in order to initialize and use the echo canceler,
> there is a firmware load portion which takes a noticeable period of
> time (~5-10 seconds). That is done through this stack heavy portion of
> the code.

As Peter suggested earlier, the best strategy would be to
cleanup the code and make it stack-light in the first place ...

> My question is this: is there a way to either work around the problem I
> am seeing with the stack without recompiling the kernel with 8K stack
> size or without disabling irqs for such a long period of time (which I
> think is not a nice thing to do either)

There are standard mechanisms to push off long duration or
sleep-capable work from interrupts-disabled contexts to user
contexts. But those contexts can also, obviously, be interrupted,
so if you're stack-heavy and suffer a concurrent interrupt (which
you most definitely will in any codepath this long), there's always
the possibility of stack overflowing. I don't really see a better
solution to this than using bigger stacks or reducing your usage
of the same, but of course, better suggestions would be difficult
to give without looking at the code in question.

> OR is it acceptable (although
> not nice) to simply fix it this way, by disabling irqs while it loads
> the firmware?

IMHO, *not*. You could try this, in any case, and learn it the
hard way :-)

Satyam

2007-06-02 15:12:58

by Matt Fredrickson

[permalink] [raw]
Subject: Re: Device Driver Etiquette


----- "Daniel J Blueman" <[email protected]> wrote:
> On 1 Jun, 19:40, "Lee Revell" <[email protected]> wrote:
> > On 6/1/07, Matthew Fredrickson <[email protected]> wrote:
> >
> > > is it acceptable (although
> > > not nice) to simply fix it this way, by disabling irqs while it
> loads
> > > the firmware?
> >
> > I would say to just disable IRQs while loading firmware. Almost
> every
> > server I maintain has some vendor driver which generates a "many
> lost
> > ticks!" message on load. As long as it's only done at module load
> > time it should be fine.
>
> For anything ~10s or more, you'll probably also need to call the
> timer
> update function to prevent soft lockup warning being generated.

Ahhh... so there is a way to get rid of that cursed message. I forgot to mention this in my original message, the only place that I had seen this problem is on a certain machine (Dell 2950) with a certain distribution (FC6) kernel. I had trimmed the code down to fit in 4K stacks already quite a bit.

I believe what actually made it crash and overflow the stack (the straw that breaks the camels back, so to speak) was the intermittent triggering of the softlockup detector. I think if I can disable that while the firmware is loading that will fix the stack overflow issues and correct my problem.

Matthew Fredrickson

2007-06-02 23:22:48

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Device Driver Etiquette

On Fri, 2007-06-01 at 12:47 -0500, Matthew Fredrickson wrote:

> My question is this: is there a way to either work around the problem I
> am seeing with the stack without recompiling the kernel with 8K stack
> size or without disabling irqs for such a long period of time (which I
> think is not a nice thing to do either) OR is it acceptable (although
> not nice) to simply fix it this way, by disabling irqs while it loads
> the firmware?

I wonder if you're chasing ghosts; 4K stack kernels have a seperate
stack for interrupts so... those should be safe.

Btw, you forgot to post a pointer to the source code of your driver, so
it's a lot harder for us (read: impossible) to give you good advice..

Greetings,
Arjan van de Ven
--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2007-06-03 01:45:48

by Satyam Sharma

[permalink] [raw]
Subject: Re: Device Driver Etiquette

On 6/3/07, Arjan van de Ven <[email protected]> wrote:
> On Fri, 2007-06-01 at 12:47 -0500, Matthew Fredrickson wrote:
>
> > My question is this: is there a way to either work around the problem I
> > am seeing with the stack without recompiling the kernel with 8K stack
> > size or without disabling irqs for such a long period of time (which I
> > think is not a nice thing to do either) OR is it acceptable (although
> > not nice) to simply fix it this way, by disabling irqs while it loads
> > the firmware?
>
> I wonder if you're chasing ghosts; 4K stack kernels have a seperate
> stack for interrupts so... those should be safe.

Argh, yes, thanks for reminding/correcting all of us :-)

> Btw, you forgot to post a pointer to the source code of your driver, so
> it's a lot harder for us (read: impossible) to give you good advice..

Right, and dmesg / debug output of when things go wrong is also
needed at the very least to confirm whether or not there is really a
stack overflow issue here in the first place.

Satyam

2007-06-03 06:21:01

by Matt Fredrickson

[permalink] [raw]
Subject: Re: Device Driver Etiquette


----- "Arjan van de Ven" <[email protected]> wrote:
> On Fri, 2007-06-01 at 12:47 -0500, Matthew Fredrickson wrote:
>
> > My question is this: is there a way to either work around the
> problem I
> > am seeing with the stack without recompiling the kernel with 8K
> stack
> > size or without disabling irqs for such a long period of time (which
> I
> > think is not a nice thing to do either) OR is it acceptable
> (although
> > not nice) to simply fix it this way, by disabling irqs while it
> loads
> > the firmware?
>
> I wonder if you're chasing ghosts; 4K stack kernels have a seperate
> stack for interrupts so... those should be safe.
>
> Btw, you forgot to post a pointer to the source code of your driver,
> so
> it's a lot harder for us (read: impossible) to give you good advice..

As someone mentioned in another post, I believe what is causing this problem is a combination of factors. The triggering of the softlockup detector seems to be what pushes it over the edge. I think I will try as suggested to change the timer for it so that it does not trigger while the card is initializing.

If you'd like to see the source, here's a link:
http://svn.digium.com/view/zaptel/trunk/wct4xxp/

I can't give you the stack trace right now, since I'm not in the office, but it begins in the vpm450_init function and goes into the Octasic API functions, in the ChannelOpen routines as well as the chip initialization routines (the ones that load the firmware.

---
Matthew Fredrickson
Kernel Developer
Digium, Inc.

2007-06-03 11:22:54

by Daniel J Blueman

[permalink] [raw]
Subject: Re: Device Driver Etiquette

On 02/06/07, Matt Fredrickson <[email protected]> wrote:
> ----- "Daniel J Blueman" <[email protected]> wrote:
> > On 1 Jun, 19:40, "Lee Revell" <[email protected]> wrote:
> > > On 6/1/07, Matthew Fredrickson <[email protected]> wrote:
> > >
> > > > is it acceptable (although
> > > > not nice) to simply fix it this way, by disabling irqs while it
> > loads
> > > > the firmware?
> > >
> > > I would say to just disable IRQs while loading firmware. Almost
> > every
> > > server I maintain has some vendor driver which generates a "many
> > lost
> > > ticks!" message on load. As long as it's only done at module load
> > > time it should be fine.
> >
> > For anything ~10s or more, you'll probably also need to call the
> > timer
> > update function to prevent soft lockup warning being generated.
>
> Ahhh... so there is a way to get rid of that cursed message. I forgot to mention this in my original message, the only place that I had seen this problem is on a certain machine (Dell 2950) with a certain distribution (FC6) kernel. I had trimmed the code down to fit in 4K stacks already quite a bit.
>
> I believe what actually made it crash and overflow the stack (the straw that breaks the
> camels back, so to speak) was the intermittent triggering of the softlockup detector.
> I think if I can disable that while the firmware is loading that will fix the stack overflow
> issues and correct my problem.

It may not help, since the NMIs used to detect soft lockups will still
be received; the soft lockup update call just updates the per-cpu
timer, but there may be less chance of overrunning the end of the
stack from not getting the stack trace.

There is no substitute for implementing your own firmware uploading
mechanism or getting this one addressed though ;-) .

Dan

> Matthew Fredrickson
--
Daniel J Blueman