2015-11-05 10:52:12

by Jason A. Donenfeld

[permalink] [raw]
Subject: GSO with udp_tunnel_xmit_skb

Hi folks,

When sending arbitrary SKBs with udp_tunnel_xmit_skb, the networking
stack does not appear to be utilizing UFO on the outgoing UDP packets,
which significantly caps the transmission speed. I see about 50% CPU
usage in this send path, triggered for every single outgoing packet.
Is there a particular skb option I need to set to enable this? I read
Tom's patch [1] from last year, but this seems to be about setting the
inner packet type. In my case, the inner type is opaque encrypted
data, so there's not a relevant setting.

Thanks,
Jason

[1] http://thread.gmane.org/gmane.linux.network/332194


2015-11-06 07:19:22

by Tom Herbert

[permalink] [raw]
Subject: Re: GSO with udp_tunnel_xmit_skb

On Thu, Nov 5, 2015 at 7:52 PM, Jason A. Donenfeld <[email protected]> wrote:
> Hi folks,
>
> When sending arbitrary SKBs with udp_tunnel_xmit_skb, the networking
> stack does not appear to be utilizing UFO on the outgoing UDP packets,
> which significantly caps the transmission speed. I see about 50% CPU
> usage in this send path, triggered for every single outgoing packet.
> Is there a particular skb option I need to set to enable this? I read
> Tom's patch [1] from last year, but this seems to be about setting the
> inner packet type. In my case, the inner type is opaque encrypted
> data, so there's not a relevant setting.
>
Jason,

Is this about UFO or GSO (in email subject)? UFO should operate
independently encapsulation or inner packet setting.

Tom

> Thanks,
> Jason
>
> [1] http://thread.gmane.org/gmane.linux.network/332194
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-11-06 11:48:21

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: GSO with udp_tunnel_xmit_skb

Hi Tom,

On Fri, Nov 6, 2015 at 8:19 AM, Tom Herbert <[email protected]> wrote:
> Is this about UFO or GSO (in email subject)? UFO should operate
> independently encapsulation or inner packet setting.

I suppose this is about UFO.

Specifically -- let's say I have a list of 500 skbs, which have their
data in place but don't yet have an IP or UDP header etc. I want to
send out these out using udp_tunnel_xmit_skb. Right now, if I just
send them all out, one after another, they don't seem to be getting
assembled into a super packet suitable for UFO. Instead, they're just
sent one at a time, and I get the vast majority of `perf top` CPU
usage in my ethernet card's driver and along the path to it -- the
problem that UFO is supposed to solve.

So my question is -- how can I make UFO happen with udp_tunnel_xmit_skb?

Thanks,
Jason

2015-11-07 17:19:54

by Maciej Żenczykowski

[permalink] [raw]
Subject: Re: GSO with udp_tunnel_xmit_skb

> I suppose this is about UFO.
>
> Specifically -- let's say I have a list of 500 skbs, which have their
> data in place but don't yet have an IP or UDP header etc. I want to
> send out these out using udp_tunnel_xmit_skb. Right now, if I just
> send them all out, one after another, they don't seem to be getting
> assembled into a super packet suitable for UFO. Instead, they're just
> sent one at a time, and I get the vast majority of `perf top` CPU
> usage in my ethernet card's driver and along the path to it -- the
> problem that UFO is supposed to solve.
>
> So my question is -- how can I make UFO happen with udp_tunnel_xmit_skb?

UFO will never collapse multiple (UDP) packets.

It would be incorrect to do so, since UDP has to maintain packet
framing boundaries, and the only way to mark that on the wire is via
individual appropriately sized packets.

UFO prevents the need to do IP fragmentation on overly large
*singular* UDP packets.

The case where UFO (should) help is if you are taking a TCP TSO
segment of 10k and adding UDP headers and sending it out as an
20+8+10k UDP packet.
Without UFO this would now need to be software (potentially
checksummed and) ip fragmented into (8+10k)/(1500-20) packets
(assuming 1500 mtu), with UFO hw offload the nic deals with that (it
does the checksumming and it does the ip fragmentation).

Although note: in the case of UDP+TCP TSO this has reliability issues,
since a loss of a single frame will now lose the entire fragmented IP
UDP datagram and thus lose the entire TCP TSO segment,
meaning that you probably do not want to use this unless your network
is lossless (ie. loopback, veth and other virtual networks come to
mind).

I guess UDP encap of a larger than mtu UDP is probably a valid use
case for UFO, since we'd have ip fragmented anyway, and it's cheaper
to ip fragment on the outer IP header than on the inner.

2015-11-07 22:40:11

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: GSO with udp_tunnel_xmit_skb

Hi Maciej,

Thanks for your reply. Some interesting things to consider here... See
inline below.

On Sat, Nov 7, 2015 at 6:19 PM, Maciej Żenczykowski
<[email protected]> wrote:
>
> UFO will never collapse multiple (UDP) packets.
>
> It would be incorrect to do so, since UDP has to maintain packet
> framing boundaries, and the only way to mark that on the wire is via
> individual appropriately sized packets.

What I was thinking about is this: My driver receives a super-packet.
By calling skb_gso_segment(), I'm given a list of equal sized packets
(of gso_size each), except for the last one which is either the same
size or smaller than the rest. Let's say calling skb_gso_segment()
gives me a list of 1300 byte packets. Next, I do a particular
transformation to the packet. Let's say I encrypt it somehow, and I
add on some additional information. Now all those 1300 byte packets
yield new 1400 byte packets. It is time to send those 1400 byte
packets to a particular destination. Since they're all children of the
same skb_gso_segment()ified packet, they're all destined for the same
destination. So, one solution is to do this:

for each skb in list:
udp_tunnel_xmit_skb(dst, skb);

But this does not perform how I'd like it to perform. The reason is
that now each and every one of these packets has to traverse the whole
networking stack, including various netfilter postrouting hooks and
such, but most importantly, it means the ethernet driver that's
sending the physical packet has to process each and every one.

My hope was that instead of doing the `for each` above, I could
instead do something like:

superpacket->gso_size = 1400
for each skb in list:
add_to_superpacket_as_ufo(skb, superpacket);
udp_tunnel_xmit_skb(dst, superpacket);

And that way, the superpacket would only have to traverse the
networking stack once, leaving it either to the final ethernet driver
to send in a big chunk to the ethernet card, or to the
skb_gso_segment() call in core.c's validate_xmit_skb().

Is this conceptually okay? What you wrote would seem to indicate it
doesn't make sense conceptually, but I'm not sure.
I started to write some code to do that, which isn't really working,
and I outlined it here [1].


> UFO prevents the need to do IP fragmentation on overly large
> *singular* UDP packets.
>
> The case where UFO (should) help is if you are taking a TCP TSO
> segment of 10k and adding UDP headers and sending it out as an
> 20+8+10k UDP packet.
> Without UFO this would now need to be software (potentially
> checksummed and) ip fragmented into (8+10k)/(1500-20) packets
> (assuming 1500 mtu), with UFO hw offload the nic deals with that (it
> does the checksumming and it does the ip fragmentation).

So you mean to say UFO is mostly useful for just IP fragmentation?
Don't some NICs also generate individual UDP packets when you pass it
a big buffer of multiple pieces of data all at once?

Thanks,
Jason

[1] http://www.spinics.net/lists/netdev/msg351400.html

2015-11-07 23:40:18

by Maciej Żenczykowski

[permalink] [raw]
Subject: Re: GSO with udp_tunnel_xmit_skb

> What I was thinking about is this: My driver receives a super-packet.
> By calling skb_gso_segment(), I'm given a list of equal sized packets
> (of gso_size each), except for the last one which is either the same
> size or smaller than the rest. Let's say calling skb_gso_segment()
> gives me a list of 1300 byte packets.

This isn't particularly efficient. This is basically equivalent to doing
GSO before the superpacket reaches your driver (you might get some
savings by not bothering to look at the packet headers of the second
and on packets, but that's most likely minimal savings).

In particular you're allocating a new skb and clearing it for each of those
1300 byte packets (and deallocating the superpacket skb). And then you
are presumably deallocating all those freshly allocated skbs - since
I'm guessing
you are creating new skbs for transmit.

What you really want to do (although of course it's much harder)
is not call skb_gso_segment() at all for packet formats you know how
to handle (ideally you can handle anything you claim to be able to
handle via the features bits)
and instead reach directly into the skb and grab the right portions
of it and handle them directly. This way you only ever have the one
incoming skb,
but yes it requires considerable effort.

This should get you a fair bit of savings.

> Next, I do a particular
> transformation to the packet. Let's say I encrypt it somehow, and I
> add on some additional information. Now all those 1300 byte packets
> yield new 1400 byte packets. It is time to send those 1400 byte
> packets to a particular destination.

Are you in control of the receiver? Can you modify packet format?

> Since they're all children of the
> same skb_gso_segment()ified packet, they're all destined for the same
> destination. So, one solution is to do this:
>
> for each skb in list:
> udp_tunnel_xmit_skb(dst, skb);
>
> But this does not perform how I'd like it to perform. The reason is
> that now each and every one of these packets has to traverse the whole
> networking stack, including various netfilter postrouting hooks and
> such, but most importantly, it means the ethernet driver that's
> sending the physical packet has to process each and every one.

Theoretically you could manually add the proper headers to each
of the new packets, and create a chain and send that, although
honestly I'm not sure if the stack is at all capable of dealing with
that atm.

Alternatively instead of sending through the stack, put on full ethernet
headers and send straight to the nic via the nic's xmit function.

> My hope was that instead of doing the `for each` above, I could
> instead do something like:
>
> superpacket->gso_size = 1400
> for each skb in list:
> add_to_superpacket_as_ufo(skb, superpacket);
> udp_tunnel_xmit_skb(dst, superpacket);

UFO = UDP Fragmentation Offload = really meaning 'UDP transmit
checksum offload + IP fragmentation offload'

so when you send that out you get ip fragments of 1 udp packet, not
many individual udp packets.

> And that way, the superpacket would only have to traverse the
> networking stack once, leaving it either to the final ethernet driver
> to send in a big chunk to the ethernet card, or to the
> skb_gso_segment() call in core.c's validate_xmit_skb().

> Is this conceptually okay? What you wrote would seem to indicate it
> doesn't make sense conceptually, but I'm not sure.

This definitely doesn't make sense with UFO.

---

It is possible some hardware (possibly some intel nics, maybe bnx2x)
could be tricked into doing udp segmentation with their tcp segmentation
engine. Theoretically (based on having glanced at the datasheets) the
intel nic segmentation is pretty generic, and it would appear at first
glance that with the right driver hacks (populating the transmit descriptor
correctly) it could be made to work. I mention bnx2x because
they managed to make tcp segmentation work with tunnels,
so it's possible that the support is generic enough for it to be possible (with
driver changes). Who knows.

It may or may not require putting on a fake 20 byte TCP header.
There's some tunnel spec that basically does that (should be able to find
an RFC online [perhaps I'm thinking of STT - Stateless Transport Tunneling].

I don't think there is currently any way to setup a linux skb with the
right metadata for it to just happen though.

It does seem like something that could be potentially worth adding though.

> So you mean to say UFO is mostly useful for just IP fragmentation?
> Don't some NICs also generate individual UDP packets when you pass it
> a big buffer of multiple pieces of data all at once?

I'm not actually aware of any nics doing that. It's possible if you
take an IP/TCP TSO
superpacket and stuff an extra IP/UDP header on it the existing tunnel offload
stuff in the kernel might make that happen with some nics. Unsure though
(as in unsure whether IP/UDP tunneling is currently supported, I know
IP/GRE is).

2015-11-08 10:37:06

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: GSO with udp_tunnel_xmit_skb

Hi Maciej,

On Sun, Nov 8, 2015 at 12:40 AM, Maciej Żenczykowski
<[email protected]> wrote:
> This isn't particularly efficient. This is basically equivalent to doing
> GSO before the superpacket reaches your driver (you might get some
> savings by not bothering to look at the packet headers of the second
> and on packets, but that's most likely minimal savings).

Actually, in my benchmarking, this results in enormous speedups in two places:

- In fact, I do have to examine the header of each incoming packet in
ndo_start_xmit(), and make a potentially expensive calculation on it
(due to the nature of my particular virtual driver). Having to only do
this once gets me about 100 additional megabits of bandwidth.
- Before sending the packet with udp_tunnel_xmit_skb, I can do only
one ip_route_output_flow() call, and reuse the rtable/dst_entry struct
for each send, instead of having to recompute it each time. This winds
up getting me around 400 more megabits.

> In particular you're allocating a new skb and clearing it for each of those
> 1300 byte packets (and deallocating the superpacket skb). And then you
> are presumably deallocating all those freshly allocated skbs - since
> I'm guessing
> you are creating new skbs for transmit.
>
> What you really want to do (although of course it's much harder)
> is not call skb_gso_segment() at all for packet formats you know how
> to handle (ideally you can handle anything you claim to be able to
> handle via the features bits)
> and instead reach directly into the skb and grab the right portions
> of it and handle them directly. This way you only ever have the one
> incoming skb,
> but yes it requires considerable effort.
>
> This should get you a fair bit of savings.

Yes, I agree wholeheartedly; it would be much nicer to not have to
call skb_gso_segment at all, and just being able to operate on the
superpacket directly. Unfortunately, I'm not able to do this, because
I'm not simply adding or changing a header on the packet. I'm actually
making a calculation on the full bytes of the packet, which includes
the UDP and IP headers that are only added by skb_gso_segment, and
then I'm playing with ("scrambling" in some way) all of the bytes of
the entire packet. So, I really do need to decompose it into
individual packets, unfortunately.

>
> Are you in control of the receiver? Can you modify packet format?

Yes, I am in control of the receiver. I suppose I could augment the
protocol to do this kind of reassembly. But that might conflict with
some other design goals, so I don't think that's going to happen.

>
> Theoretically you could manually add the proper headers to each
> of the new packets, and create a chain and send that, although
> honestly I'm not sure if the stack is at all capable of dealing with
> that atm.
>
> Alternatively instead of sending through the stack, put on full ethernet
> headers and send straight to the nic via the nic's xmit function.

My initial prototype did that, actually, simply because I knew how to
build an ethernet frame but I didn't know (yet) how to use the
kernel's various APIs. This wasn't viable in the end, though, because
I do need to run the packets through netfilter and the full stack.

> UFO = UDP Fragmentation Offload = really meaning 'UDP transmit
> checksum offload + IP fragmentation offload'
>
> so when you send that out you get ip fragments of 1 udp packet, not
> many individual udp packets.

Shucks, really? So UFO really only works for single UDP packets?
That's a shame. I had hoped that since all the packets are the same
size, I could set gso_size to that, and then the splitting would take
place on those boundaries precisely. But I guess since the
fragmentation here actually does IP fragmentation, this would run
counter to my goals, since new UDP headers wouldn't be added in the
end. Total bummer.

Wouldn't there be some significant savings from bundling together
several UDP packets meant for the same destination, and sending those
all as one super-packet, so they don't each have to traverse the whole
networking and netfilter stack? By asking that question, it doesn't
feel as though I've come up with a new idea; is there a reason why
that isn't implemented or why (if) it was rejected?

>
> It is possible some hardware (possibly some intel nics, maybe bnx2x)
> could be tricked into doing udp segmentation with their tcp segmentation
> engine. Theoretically (based on having glanced at the datasheets) the
> intel nic segmentation is pretty generic, and it would appear at first
> glance that with the right driver hacks (populating the transmit descriptor
> correctly) it could be made to work. I mention bnx2x because
> they managed to make tcp segmentation work with tunnels,
> so it's possible that the support is generic enough for it to be possible (with
> driver changes). Who knows.
>
> It may or may not require putting on a fake 20 byte TCP header.
> There's some tunnel spec that basically does that (should be able to find
> an RFC online [perhaps I'm thinking of STT - Stateless Transport Tunneling].
>
> I don't think there is currently any way to setup a linux skb with the
> right metadata for it to just happen though.
>
> It does seem like something that could be potentially worth adding though.

These are glorious dirty tricks. Awesome. It appears, NIC-wise, that
only the neterion driver currently supports UFO natively. I wonder if
the Intel folks will add it to their drivers, since the segmentation
is generic as you said.

Still, though, regardless of NIC support, using superpackets to reduce
the number of skbs that have to traverse the networking stack appears
to be worthwhile. It'd be nice to make this happen for clusters of UDP
packets.


I'm adding to the CC Herbert Xu, who mentioned in another thread:

> I don't see anything fundamentally wrong with your idea. After
> all what you're describing is the basis of GSO, i.e., letting
> data stay in the form of super-packets for as long as we can.
>
> Of course there's going to be a lot of niggly bits that you'll
> have to sort out to get it to work.

So I wonder if he has any ideas about this too.


Anyway, thanks so much for your insight about this. I really
appreciate the pointers.

Regards,
Jason

2015-11-08 10:57:48

by Herbert Xu

[permalink] [raw]
Subject: Re: GSO with udp_tunnel_xmit_skb

On Sun, Nov 08, 2015 at 11:36:53AM +0100, Jason A. Donenfeld wrote:
>
> Wouldn't there be some significant savings from bundling together
> several UDP packets meant for the same destination, and sending those
> all as one super-packet, so they don't each have to traverse the whole
> networking and netfilter stack? By asking that question, it doesn't
> feel as though I've come up with a new idea; is there a reason why
> that isn't implemented or why (if) it was rejected?

UDP carries no ordering information so this doesn't work.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-11-08 14:57:36

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: GSO with udp_tunnel_xmit_skb

On Sun, Nov 8, 2015 at 11:57 AM, Herbert Xu <[email protected]> wrote:
> UDP carries no ordering information so this doesn't work.

But if there's no ordering information, what's the problem? Isn't it
good enough to send the packets in the order they were sendto()d? Or
in any order at all?

2015-11-09 01:41:01

by Herbert Xu

[permalink] [raw]
Subject: Re: GSO with udp_tunnel_xmit_skb

On Sun, Nov 08, 2015 at 03:57:24PM +0100, Jason A. Donenfeld wrote:
> On Sun, Nov 8, 2015 at 11:57 AM, Herbert Xu <[email protected]> wrote:
> > UDP carries no ordering information so this doesn't work.
>
> But if there's no ordering information, what's the problem? Isn't it
> good enough to send the packets in the order they were sendto()d? Or
> in any order at all?

You're right. I don't think the ordering matters.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-11-09 02:23:30

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: GSO with udp_tunnel_xmit_skb

On Mon, Nov 9, 2015 at 2:40 AM, Herbert Xu <[email protected]> wrote:
> You're right. I don't think the ordering matters.

Cool, so we're on the same page then.

In that case, any ideas about constructing UDP super-packets for GSO?
As Maciej pointed out, UFO is actually just IP fragmentation and UDP
checksums, but doesn't actually add on new UDP headers as we'd wish.
But I was digging a bit deeper and I found this gem:
skb_udp_tunnel_segment. This is the segmentation function called for
SKB_GSO_UDP_TUNNEL. For this, it does something sort of neat.

If the skb's inner_protocol_type is ENCAP_TYPE_IPPROTO, then it looks
up an "gso_inner_segment" function based on the skb->inner_ipproto
field. The lookup happens on a list of functions (called
inet6_offloads and inet_offloads) that has some nice setter/getter
methods for various modules to call.

Once it figures out which gso_inner_segment to use, it calls
__skb_udp_tunnel_segment with it, which then does some curious header
calculations on various lengths (that I need to read carefully), and
then proceeds to split the segments using our gso_inner_segment
function of choice, and then adds the length and checksum header
fields. Unfortunately, it doesn't add the UDP source and destination
port header fields. That means I might as well be building the UDP
headers ahead of time myself, which is a bit of a bummer.

Anyway, the idea would be to [ab]use SKB_GSO_UDP_TUNNEL with a
scintillating gso_inner_segment function for a custom inner_ipproto
field, in order to make a superpacket.

How's this looking as a strategy (and an outline of the "niggly bits"
as you put it)?

Jason

2015-11-09 03:18:14

by Maciej Żenczykowski

[permalink] [raw]
Subject: Re: GSO with udp_tunnel_xmit_skb

> Once it figures out which gso_inner_segment to use, it calls
> __skb_udp_tunnel_segment with it, which then does some curious header
> calculations on various lengths (that I need to read carefully), and
> then proceeds to split the segments using our gso_inner_segment
> function of choice, and then adds the length and checksum header
> fields. Unfortunately, it doesn't add the UDP source and destination
> port header fields. That means I might as well be building the UDP
> headers ahead of time myself, which is a bit of a bummer.

I'm guessing the udp src dst port (and ??possibly?? optional gue
headers) are meant to be part of the external headers that are already
pre-populated.

> Anyway, the idea would be to [ab]use SKB_GSO_UDP_TUNNEL with a
> scintillating gso_inner_segment function for a custom inner_ipproto
> field, in order to make a superpacket.

That's probably basically what that was designed for. So doesn't seem
like an abuse.

Tunnel GSO offloads are still very very fresh and actively being
worked on (by Tom and Eric among others).
I'm afraid my knowledge of them at HEAD is very limited.
I've only recently started experimenting in this area myself.

> How's this looking as a strategy (and an outline of the "niggly bits"
> as you put it)?

Looks fine. Devil is in the details. You may discover that the stack
is still missing some things you'll need to add in.

(for example, personally I'm trying to understand if CHECKSUM_PARTIAL
shouldn't carry an extra bit of information specifying whether
we need a TCP or UDP style checksum, since they differ in how a
checksum of 0 is transmitted, it appears this causes nic drivers
to need to redigest the packet to figure it out before they can pass
it on to the hardware)

- Maciej