Hi folks,
I do some ECC crypto in a kthread. A fast 32bit implementation usually
uses around 2k - 3k bytes of stack. Since kernel threads get 8k, I
figured this would be okay. And for the most part, it is. However,
everything falls apart on architectures like MIPS, which do not use a
separate irq stack.
>From what I can tell, on MIPS, the irq handler uses whichever stack
was in current at the time of interruption. At the end of the irq
handler, softirqs trigger if preemption hasn't been disabled. When the
softirq handler runs, it will still use the same interrupted stack. So
let's take some pathological case of huge softirq stack usage: wifi
driver inside of l2tp inside of gre inside of ppp. Now, my ECC crypto
is humming along happily in its kthread, when all of the sudden, a
wifi packet arrives, triggering the interrupt. Or, perhaps instead,
TCP sends an ack packet on softirq, using my kthread's stack. The
interrupt is serviced, and at the end of it, softirq is serviced,
using my kthread's stack, which was already half full. When this
softirq is serviced, it goes through our 4 layers of network device
drivers. Since we started with a half full stack, we very quickly blow
it up, and everything explodes, and users write angry mailing list
posts.
It seems to me x86, ARM, SPARC, SH, ParisC, PPC, Metag, and UML all
concluded that letting the interrupt handler use current's stack was a
terrible idea, and instead have a per-cpu irq stack that gets used
when the handler is service. Whew!
But for the remaining platforms, such as MIPS, this is still a
problem. In an effort to work around this in my code, rather than
having to invoke kmalloc for what should be stack-based variables, I
was thinking I'd just disable preemption for those functions that use
a lot of stack, so that stack-hungry softirq handlers don't crush it.
This is generally unsatisfactory, so I don't want to do this
unconditionally. Instead, I'd like to do some cludge such as:
#ifndef CONFIG_HAVE_SEPARATE_IRQ_STACK
preempt_disable();
#endif
some_func_that_uses_lots_of_stack();
#ifndef CONFIG_HAVE_SEPARATE_IRQ_STACK
preempt_enable();
#endif
However, for this to work, I actual need that config variable. Would
you accept a patch that adds this config variable to the relavent
platforms? If not, do you have a better solution for me (which doesn't
involve using kmalloc or choosing a different crypto primitive)?
Thanks,
Jason
On Wed, 9 Nov 2016, Jason A. Donenfeld wrote:
> But for the remaining platforms, such as MIPS, this is still a
> problem. In an effort to work around this in my code, rather than
> having to invoke kmalloc for what should be stack-based variables, I
> was thinking I'd just disable preemption for those functions that use
> a lot of stack, so that stack-hungry softirq handlers don't crush it.
> This is generally unsatisfactory, so I don't want to do this
> unconditionally. Instead, I'd like to do some cludge such as:
>
> #ifndef CONFIG_HAVE_SEPARATE_IRQ_STACK
> preempt_disable();
That preempt_disable() prevents merily preemption as the name says, but it
wont prevent softirq handlers from running on return from interrupt. So
what's the point?
> However, for this to work, I actual need that config variable. Would
> you accept a patch that adds this config variable to the relavent
> platforms?
It might have been a good idea, to cc all relevant arch maintainers on
that ...
> If not, do you have a better solution for me (which doesn't
> involve using kmalloc or choosing a different crypto primitive)?
What's wrong with using kmalloc?
Thanks,
tglx
Hey Thomas,
On Wed, Nov 9, 2016 at 10:40 PM, Thomas Gleixner <[email protected]> wrote:
> That preempt_disable() prevents merily preemption as the name says, but it
> wont prevent softirq handlers from running on return from interrupt. So
> what's the point?
Oh, interesting. Okay, then in that case the proposed define wouldn't
be useful for my purposes. What clever tricks do I have at my
disposal, then?
>> If not, do you have a better solution for me (which doesn't
>> involve using kmalloc or choosing a different crypto primitive)?
>
> What's wrong with using kmalloc?
It's cumbersome and potentially slow. This is crypto code, where speed
matters a lot. Avoiding allocations is usually the lowest hanging
fruit among optimizations. To give you some idea, here's a somewhat
horrible solution using kmalloc I hacked together: [1]. I'm not to
happy with what it looks like, code-wise, and there's around a 16%
slowdown, which isn't nice either.
[1] https://git.zx2c4.com/WireGuard/commit/?h=jd/curve25519-kmalloc
On Thu, Nov 10, 2016 at 1:17 AM, David Daney <[email protected]> wrote:
> Easiest thing to do would be to select 16K page size in your .config, I
> think that will give you a similar sized stack.
I didn't realize that was possible...
I'm mostly concerned about the best way to deal with systems that have
a limited stack size on architectures without support for separate irq
stacks. Part of this I assume involves actually detecting with a
processor definition that the current architecture has a deceptively
small stack.
On 11/09/2016 01:27 PM, Jason A. Donenfeld wrote:
> Hi folks,
>
> I do some ECC crypto in a kthread. A fast 32bit implementation usually
> uses around 2k - 3k bytes of stack. Since kernel threads get 8k, I
> figured this would be okay. And for the most part, it is. However,
> everything falls apart on architectures like MIPS, which do not use a
> separate irq stack.
Easiest thing to do would be to select 16K page size in your .config, I
think that will give you a similar sized stack.
>
>>From what I can tell, on MIPS, the irq handler uses whichever stack
> was in current at the time of interruption. At the end of the irq
> handler, softirqs trigger if preemption hasn't been disabled. When the
> softirq handler runs, it will still use the same interrupted stack. So
> let's take some pathological case of huge softirq stack usage: wifi
> driver inside of l2tp inside of gre inside of ppp. Now, my ECC crypto
> is humming along happily in its kthread, when all of the sudden, a
> wifi packet arrives, triggering the interrupt. Or, perhaps instead,
> TCP sends an ack packet on softirq, using my kthread's stack. The
> interrupt is serviced, and at the end of it, softirq is serviced,
> using my kthread's stack, which was already half full. When this
> softirq is serviced, it goes through our 4 layers of network device
> drivers. Since we started with a half full stack, we very quickly blow
> it up, and everything explodes, and users write angry mailing list
> posts.
>
> It seems to me x86, ARM, SPARC, SH, ParisC, PPC, Metag, and UML all
> concluded that letting the interrupt handler use current's stack was a
> terrible idea, and instead have a per-cpu irq stack that gets used
> when the handler is service. Whew!
>
> But for the remaining platforms, such as MIPS, this is still a
> problem. In an effort to work around this in my code, rather than
> having to invoke kmalloc for what should be stack-based variables, I
> was thinking I'd just disable preemption for those functions that use
> a lot of stack, so that stack-hungry softirq handlers don't crush it.
> This is generally unsatisfactory, so I don't want to do this
> unconditionally. Instead, I'd like to do some cludge such as:
>
> #ifndef CONFIG_HAVE_SEPARATE_IRQ_STACK
> preempt_disable();
> #endif
>
> some_func_that_uses_lots_of_stack();
>
> #ifndef CONFIG_HAVE_SEPARATE_IRQ_STACK
> preempt_enable();
> #endif
>
> However, for this to work, I actual need that config variable. Would
> you accept a patch that adds this config variable to the relavent
> platforms? If not, do you have a better solution for me (which doesn't
> involve using kmalloc or choosing a different crypto primitive)?
>
> Thanks,
> Jason
>
>
On Thu, 10 Nov 2016, Jason A. Donenfeld wrote:
> Hey Thomas,
>
> On Wed, Nov 9, 2016 at 10:40 PM, Thomas Gleixner <[email protected]> wrote:
> > That preempt_disable() prevents merily preemption as the name says, but it
> > wont prevent softirq handlers from running on return from interrupt. So
> > what's the point?
>
> Oh, interesting. Okay, then in that case the proposed define wouldn't
> be useful for my purposes.
If you want to go with that config, then you need
local_bh_disable()/enable() to fend softirqs off, which disables also
preemption.
> What clever tricks do I have at my disposal, then?
Make MIPS use interrupt stacks.
> >> If not, do you have a better solution for me (which doesn't
> >> involve using kmalloc or choosing a different crypto primitive)?
> >
> > What's wrong with using kmalloc?
>
> It's cumbersome and potentially slow. This is crypto code, where speed
> matters a lot. Avoiding allocations is usually the lowest hanging
> fruit among optimizations. To give you some idea, here's a somewhat
> horrible solution using kmalloc I hacked together: [1]. I'm not to
> happy with what it looks like, code-wise, and there's around a 16%
> slowdown, which isn't nice either.
Does the slowdown come from the kmalloc overhead or mostly from the less
efficient code?
If it's mainly kmalloc, then you can preallocate the buffer once for the
kthread you're running in and be done with it. If it's the code, then bad
luck.
Thanks,
tglx
On Thu, Nov 10, 2016 at 10:03 AM, Thomas Gleixner <[email protected]> wrote:
> If you want to go with that config, then you need
> local_bh_disable()/enable() to fend softirqs off, which disables also
> preemption.
Thanks. Indeed this is what I want.
>
>> What clever tricks do I have at my disposal, then?
>
> Make MIPS use interrupt stacks.
Yea, maybe I'll just implement this. It clearly is the most correct solution.
@MIPS maintainers: would you merge something like this if done well?
Are there reasons other than man-power why it isn't currently that
way?
> Does the slowdown come from the kmalloc overhead or mostly from the less
> efficient code?
>
> If it's mainly kmalloc, then you can preallocate the buffer once for the
> kthread you're running in and be done with it. If it's the code, then bad
> luck.
I fear both. GCC can optimize stack variables in ways that it cannot
optimize various memory reads and writes.
Strangely, the solution that appeals to me most at the moment is to
kmalloc (or vmalloc?) a new stack, copy over thread_info, and fiddle
with the stack registers. I don't see any APIs, however, for a
platform independent way of doing this. And maybe this is a horrible
idea. But at least it'd allow me to keep my stack-based code the
same...
On Thu, 10 Nov 2016, Jason A. Donenfeld wrote:
> On Thu, Nov 10, 2016 at 10:03 AM, Thomas Gleixner <[email protected]> wrote:
> > Does the slowdown come from the kmalloc overhead or mostly from the less
> > efficient code?
> >
> > If it's mainly kmalloc, then you can preallocate the buffer once for the
> > kthread you're running in and be done with it. If it's the code, then bad
> > luck.
>
> I fear both. GCC can optimize stack variables in ways that it cannot
> optimize various memory reads and writes.
The question is how much of it is code and how much of it is the kmalloc.
> Strangely, the solution that appeals to me most at the moment is to
> kmalloc (or vmalloc?) a new stack, copy over thread_info, and fiddle
> with the stack registers. I don't see any APIs, however, for a
> platform independent way of doing this. And maybe this is a horrible
> idea. But at least it'd allow me to keep my stack-based code the
> same...
Do not even think about going there. That's going to be a major
mess.
As a short time workaround you can increase THREAD_SIZE_ORDER for now and
then fix it proper with switching to seperate irq stacks.
Thanks,
tglx
Hi Jason,
On 10/11/16 11:41, Jason A. Donenfeld wrote:
> On Thu, Nov 10, 2016 at 10:03 AM, Thomas Gleixner <[email protected]> wrote:
>> If you want to go with that config, then you need
>> local_bh_disable()/enable() to fend softirqs off, which disables also
>> preemption.
> Thanks. Indeed this is what I want.
>
>>> What clever tricks do I have at my disposal, then?
>> Make MIPS use interrupt stacks.
> Yea, maybe I'll just implement this. It clearly is the most correct solution.
> @MIPS maintainers: would you merge something like this if done well?
> Are there reasons other than man-power why it isn't currently that
> way?
I don't see a reason not to do this - I'm taking a look into it.
Thanks,
Matt
>> Does the slowdown come from the kmalloc overhead or mostly from the less
>> efficient code?
>>
>> If it's mainly kmalloc, then you can preallocate the buffer once for the
>> kthread you're running in and be done with it. If it's the code, then bad
>> luck.
> I fear both. GCC can optimize stack variables in ways that it cannot
> optimize various memory reads and writes.
>
> Strangely, the solution that appeals to me most at the moment is to
> kmalloc (or vmalloc?) a new stack, copy over thread_info, and fiddle
> with the stack registers. I don't see any APIs, however, for a
> platform independent way of doing this. And maybe this is a horrible
> idea. But at least it'd allow me to keep my stack-based code the
> same...
>
Hi Matt,
On Thu, Nov 10, 2016 at 5:36 PM, Matt Redfearn <[email protected]> wrote:
>
> I don't see a reason not to do this - I'm taking a look into it.
Great thanks! This is good to hear. If you go into the arch/ directory
and simply grep for "irq_stack", you can pretty easily base your
implementation on a variety of other architectures' implementations.
Jason
Hi Thomas,
On Thu, Nov 10, 2016 at 2:00 PM, Thomas Gleixner <[email protected]> wrote:
> Do not even think about going there. That's going to be a major
> mess.
Lol! Okay. Thank you for reigning in my clearly reckless
propensities... Sometimes playing in traffic is awfully tempting.
>
> As a short time workaround you can increase THREAD_SIZE_ORDER for now and
> then fix it proper with switching to seperate irq stacks.
Okay. I think in the end I'll kmalloc, accept the 16% slowdown [1],
and focus efforts on having a separate IRQ stack. Matt emailed in this
thread saying he was already looking into it, so I think by the time
that slowdown makes a difference, we'll have the right pieces in place
anyway.
Thanks for the guidance here.
Regards,
Jason
[1] https://git.zx2c4.com/WireGuard/commit/?id=cc3d7df096a88cdf96d016bdcb2f78fa03abb6f3