2015-11-16 19:52:37

by Jason A. Donenfeld

[permalink] [raw]
Subject: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets

Hi David & Folks,

I have a virtual device driver that does some fancy processing of
packets in ndo_start_xmit before forwarding them onward out of a
tunnel elsewhere. In order to make that fancy processing fast, I have
AVX and AVX2 implementations. This means I need to use the FPU.

So, I do the usual pattern found throughout the kernel:

if (!irq_fpu_usable())
generic_c(...);
else {
kernel_fpu_begin();
optimized_avx(...);
kernel_fpu_end();
}

This works fine with, say, iperf3 in TCP mode. The AVX performance is
great. However, when using iperf3 in UDP mode, irq_fpu_usable() is
mostly false! I added a dump_stack() call to see why, except nothing
looks strange; the initial call in the stack trace is
entry_SYSCALL_64_fastpath. Why would irq_fpu_usable() return false
when we're in a syscall? Doesn't that mean this is in process context?

So, I find this a bit disturbing. If anybody has an explanation, and a
way to work around it, I'd be quite happy. Or, simply if there is a
debugging technique you'd recommend, I'd be happy to try something and
report back.

Thanks,
Jason


2015-11-16 20:32:19

by David Miller

[permalink] [raw]
Subject: Re: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets

From: "Jason A. Donenfeld" <[email protected]>
Date: Mon, 16 Nov 2015 20:52:28 +0100

> This works fine with, say, iperf3 in TCP mode. The AVX performance
> is great. However, when using iperf3 in UDP mode, irq_fpu_usable()
> is mostly false! I added a dump_stack() call to see why, except
> nothing looks strange; the initial call in the stack trace is
> entry_SYSCALL_64_fastpath. Why would irq_fpu_usable() return false
> when we're in a syscall? Doesn't that mean this is in process
> context?

Network device driver transmit executes with software interrupts
disabled.

Therefore on x86, you cannot use the FPU.

2015-11-16 20:58:55

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets

Hi David,

On Mon, Nov 16, 2015 at 9:32 PM, David Miller <[email protected]> wrote:
> Network device driver transmit executes with software interrupts
> disabled.
>
> Therefore on x86, you cannot use the FPU.

That is extremely problematic for me. Is there a way to make this not
so? A driver flag that would allow this?

Also - how come it irq_fpu_usable() is true when using TCP but not
when using UDP?

Further, irq_fpu_usable() doesn't only check for interrupts. There are
two other conditions that allow the FPU's usage, from
arch/x86/kernel/fpu/core.c:

bool irq_fpu_usable(void)
{
return !in_interrupt() ||
interrupted_user_mode() ||
interrupted_kernel_fpu_idle();
}

2015-11-16 21:17:51

by David Miller

[permalink] [raw]
Subject: Re: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets

From: "Jason A. Donenfeld" <[email protected]>
Date: Mon, 16 Nov 2015 21:58:49 +0100

> That is extremely problematic for me. Is there a way to make this
> not so?

Not without a complete redesign of the x86 fpu save/restore mechanism.

The driver is the wrong place to do software cryptographic transforms
anyways.

Judging from your other emails, you doing a lot of weird shit in your
driver.

Maybe you should just tell us exactly what kind of device it is for,
exactly what the features and offloads are, and maybe we can tell you
therefore what kind of facilities would match that situation best.

You're currently trying to do it the other way, you know everything
about your device and goals, and you're sending us small piecemeal
questions. We lack the high level full picture of your device, so
it's hard for us to give you good answers.


2015-11-16 21:28:59

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets

On Mon, Nov 16, 2015 at 10:17 PM, David Miller <[email protected]> wrote:
>
> Not without a complete redesign of the x86 fpu save/restore mechanism.

Urg, okay. I still wonder why irq_fpu_usable() is true when using TCP but not
when using UDP... Any ideas on this?

> The driver is the wrong place to do software cryptographic transforms
> anyways.
> Judging from your other emails, you doing a lot of weird shit in your
> driver.
> Maybe you should just tell us exactly what kind of device it is for,
> exactly what the features and offloads are, and maybe we can tell you
> therefore what kind of facilities would match that situation best.
> You're currently trying to do it the other way, you know everything
> about your device and goals, and you're sending us small piecemeal
> questions. We lack the high level full picture of your device, so
> it's hard for us to give you good answers.

Yes, this is fair to ask. Here it goes:

I'm making a simpler replacement for IPSec that operates as an device
driver on the interface level, rather than the IPSec xfrm method. The
methodology is going to be controversial, so I'm taking my time
perfecting each component, and then I'm planning on writing in with a
big email explaining why, with justifications, and numbers. It has
some real world benefits that are already quantifiable. If you're
curious, I did a talk on it at Kernel Recipes in Paris [1]. But
please, give me some time to finish things and prepare myself. I want
to present it to you in the best way possible. I'd hate for it to be
dismissed too early or too hastily, before it's had its chance.

[1] https://www.youtube.com/watch?v=9Rk4doELmwM

2015-11-16 21:33:32

by David Miller

[permalink] [raw]
Subject: Re: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets

From: "Jason A. Donenfeld" <[email protected]>
Date: Mon, 16 Nov 2015 22:28:52 +0100

> I'm making a simpler replacement for IPSec that operates as an
> device driver on the interface level

Someone already did a Linux implemenation of exactly that two decades
ago, we rejected that design and did what we did.

2015-11-16 21:37:53

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets

On Mon, Nov 16, 2015 at 10:33 PM, David Miller <[email protected]> wrote:
> Someone already did a Linux implemenation of exactly that two decades
> ago, we rejected that design and did what we did.

It's not exactly IPSec though, so don't worry. It's a totally new
thing that combines a lot of different recent ideas, in the virtual
networking arena as well as in crypto. *It's going to be cool.* And
I'm determined to please you with the design and implementation in the
end. So let me finish off my first implementation, and then we can do
some back and forth on the overall high level ideas and iron out some
of the potential implementation issues that you find. I'll incorporate
your feedback iteratively until it's in a state that you like. I've
been working on this day and night for months, and I'm *almost* there.
Bare with me a little longer, and expect a nice "[RFC]" email to be
sent your way soon.

2015-11-16 22:27:13

by Hannes Frederic Sowa

[permalink] [raw]
Subject: Re: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets

Hi Jason,

On Mon, Nov 16, 2015, at 21:58, Jason A. Donenfeld wrote:
> Hi David,
>
> On Mon, Nov 16, 2015 at 9:32 PM, David Miller <[email protected]>
> wrote:
> > Network device driver transmit executes with software interrupts
> > disabled.
> >
> > Therefore on x86, you cannot use the FPU.
>
> That is extremely problematic for me. Is there a way to make this not
> so? A driver flag that would allow this?
>
> Also - how come it irq_fpu_usable() is true when using TCP but not
> when using UDP?
>
> Further, irq_fpu_usable() doesn't only check for interrupts. There are
> two other conditions that allow the FPU's usage, from
> arch/x86/kernel/fpu/core.c:
>
> bool irq_fpu_usable(void)
> {
> return !in_interrupt() ||
> interrupted_user_mode() ||
> interrupted_kernel_fpu_idle();
> }

Use the irqsoff tracer to find the problematic functions which disable
interrupts and try to work around it in case of UDP. This could benefit
the whole stack.

Bye,
Hannes

2015-11-16 23:57:52

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets

Hi Hannes,

Thanks for your response.

On Mon, Nov 16, 2015 at 11:27 PM, Hannes Frederic Sowa
<[email protected]> wrote:
> Use the irqsoff tracer to find the problematic functions which disable
> interrupts and try to work around it in case of UDP. This could benefit
> the whole stack.

I didn't know about the irqsoff tracer. This looks very useful.
Unfortunately, it turns out that David was right: in_interrupt() is
always true, anyway, when ndo_start_xmit is called. This means, based
on this function:

bool irq_fpu_usable(void)
{
return !in_interrupt() ||
interrupted_user_mode() ||
interrupted_kernel_fpu_idle();
}


1. irq_fpu_usable() is true for TCP. Since in_interrupt() is always
true in ndo_start_xmit, this means that in this case, we're lucky and
either interrupted_user_mode() is true, or
interrupted_kernel_fpu_idle() is true.

2. irq_fpu_usable() is FALSE for UDP! Since in_interrupt() is always
true in ndo_start_xmit, this means that in this case, both
interrupted_user_mode() and interrupted_kernel_fpu_idle() are false!

I now need to determine why precisely these are false in that case. Is
there other UDP code that's somehow making use of the FPU? Some
strange accelerated CRC32 perhaps? Or is there a weird situation
happening in which user mode isn't being interrupted? I suspect not,
since tracing this shows an entry point always of a syscall.

Investigating further, will report back.

Jason

2015-11-17 00:04:16

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets

On Tue, Nov 17, 2015 at 12:57 AM, Jason A. Donenfeld <[email protected]> wrote:
> 2. irq_fpu_usable() is FALSE for UDP! Since in_interrupt() is always
> true in ndo_start_xmit, this means that in this case, both
> interrupted_user_mode() and interrupted_kernel_fpu_idle() are false!
> Investigating further, will report back.

GASP, the plot thickens.

It turns out that when it is working, for TCP, interrupted_user_mode()
is false. This means that the only reason it's succeeding for TCP is
because interrupted_kernel_fpu_idle() is true. Therefore, for some
reason for UDP, interrupted_kernel_fpu_idle() is false!

That function is defined as:

static bool interrupted_kernel_fpu_idle(void)
{
if (kernel_fpu_disabled())
return false;

if (use_eager_fpu())
return true;

return !current->thread.fpu.fpregs_active && (read_cr0() & X86_CR0_TS);
}

So now the big question is: what in the UDP pipeline is using the FPU?
And why is that usage not being released by the time it gets to
ndo_start_xmit? Or, alternatively, why would X86_CR0_TS be unset in
the UDP path? Is it possible this is related to UFO?

2015-11-17 12:39:44

by David Laight

[permalink] [raw]
Subject: RE: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets

From: David Miller
> Sent: 16 November 2015 20:32
> From: "Jason A. Donenfeld" <[email protected]>
> Date: Mon, 16 Nov 2015 20:52:28 +0100
>
> > This works fine with, say, iperf3 in TCP mode. The AVX performance
> > is great. However, when using iperf3 in UDP mode, irq_fpu_usable()
> > is mostly false! I added a dump_stack() call to see why, except
> > nothing looks strange; the initial call in the stack trace is
> > entry_SYSCALL_64_fastpath. Why would irq_fpu_usable() return false
> > when we're in a syscall? Doesn't that mean this is in process
> > context?
>
> Network device driver transmit executes with software interrupts
> disabled.
>
> Therefore on x86, you cannot use the FPU.

I had some thoughts about driver access to AVX instructions when
I was adding AVX support to NetBSD.

The fpu state is almost certainly 'lazy switched' this means that
the fpu registers can contain data for a process that is currently
running on a different cpu.
At any time the other cpu might request (by IPI) that they be flushed
to the process data area so that it can reload them.
Kernel code could request them be flushed, but that can only happen once.
If a nested function requires them it would have to supply a local
save area. But the full save area is too big to go on stack.
Not only that, the save and restore instructions are slow.

It is also worth noting that all the AVX registers are callee saved.
This means that the syscall entry need not preserve them, instead
it can mark that they will be 'restored as zero'. However this
isn't true of any other kernel entry points.

Back to my thoughts...

Kernel code is likely to want to use special SSE/AVX instructions (eg
the crc and crypto ones) rather than generic FP calculations.
As such just saving a two of three of AVX registers would suffice.
This could be done using a small on-stack save structure that
can be referenced from the process's save area so that any IPI
can copy over the correct values after saving the full state.

This would allow kernel code (including interrupts) to execute
some AVX instructions without having to save the entire cpu
extended state anywhere.

I suspect there is a big hole in the above though...

David