2004-11-23 20:46:14

by Andi Kleen

[permalink] [raw]
Subject: Re: Linux 2.6.9 pktgen module causes INIT process respawning and sickness

"Jeff V. Merkey" <[email protected]> writes:
> I can sustain full line rate gigabit on two adapters at the tsame time
> with a 12 CLK interpacket gap time and 0 dropped packets at 64
> byte sizes from a Smartbits to Linux provided the adapter ring buffer
> is loaded with static addresses. This demonstrates that it is
> possible to sustain 64 byte packet rates at full line rate with
> current DMA architectures on 400 Mhz buses with Linux.
> (which means it will handle any network loading scenario). The
> bottleneck from my measurements appears to be the
> overhead of serializing writes to the adapter ring buffer IO
> memory. The current drivers also perform interrupt
> coalescing very well with Linux. What's needed is a method for
> submission of ring buffer entries that can be sent in large
> scatter gather listings rather than one at a time. Ring buffers

Batching would also decrease locking overhead on the Linux side (less
spinlocks taken)

We do it already for TCP using TSO for upto 64K packets when
the hardware supports it. There were some ideas some time back
to do it also for routing and other protocols - basically passing
lists of skbs to hard_start_xmit instead of always single ones -
but nobody implemented it so far.

It was one entry in the "ideas to speed up the network stack"
list i posted some time back.

With TSO working fine it doesn't seem to be that pressing.

One problem with the TSO implementation is that TSO only works for a
single connection. If you have hundreds that chatter in small packets
it won't help batching that up. Problem is that batching data from
separate sockets up would need more global lists and add possible SMP
scalability problems from more locks and more shared state. This
is a real concern on Linux now - 512 CPU machines are really unforgiving.

However in practice it doesn't seem to be that big a problem because
it's extremly unlikely that you'll sustain even a gigabit ethernet
with such a multi process load. It has far more non network CPU
overhead than a simple packet generator or pktgen.

So overall I agree with Lincoln that the small packet case is not
that interesting except perhaps for DOS testing.

-Andi


2004-11-23 21:45:17

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: Linux 2.6.9 pktgen module causes INIT process respawning and sickness


Andi,

For network forensics and analysis, it is almost a requirement if you
are using Linux. The bus speeds on these systems
also support 450 MB/S throughput for disk and network I/O. I agree it's
not that interesting if you are
deploying file servers that are remote attached on PPPoE and PPP as a
network server or workstation, given
that NFS and userspace servers like SAMBA are predominant in Linux as
file service. High performance real time
network analysis is a different story. High performance I/O file service
and storage are also
interesting and I can see folks wanting it.

I guess I have a hard time understanding the following statement,

" ... perhaps [supporting 10 GbE and 1GbE for high performance beyond
remote internet access ] is not that interesting ... "

Hope it's not too wet in Germany this time of year. I am heading back to
Stolberg and Heinsberg
to show off our new baby boy born Oct 11, 2004 to his O-ma and O-O-ma (I
guess this is how you spell this)
end of January (I hope). I might be even make it to Nurnberg while I'm
at it. :-)

Implementation of this with skb's would not be trivial. M$ in their
network drivers did this sort of circular list of pages
structure per adapter for receives and use it "pinned" to some of their
proprietary drivers in W2K and would use their
version of an skb as a "pointer" of sorts that could dynamically assign
a filled page from this list as a "receive" then perform
the user space copy from the page and release it back to the adapter.
This allowed them to fill the ring buffers with static
addresses and copy into user space as fast as they could allocate
control blocks.

For linux, I would guess the easiest way to do this same sort of thing
would be to allocate a page per ring buffer
entry, pin the entries, and use allocated skb buffers to point into the
buffer long enough to copy out the data. This would
**HELP** currently but not fix the problem completely, but the approach
would allow linux to easily move to a table driven
method since it would switch from a ring of pinned pages to tables of
pinned pages that could be swapped in and out.

We would need to logically detach the memory from the skb and make the
skb a pointer block into the skb->data
area of the list. M$ does something similiar to what I described. It
does make the whole skb_clone thing
a lot more complicated but for those apps that need to "hold" skb's
which is infrequent for most cases,
someone could just call skb_clone() when they needed a private sopy of
and skb->data pair.

Jeff

Andi Kleen wrote:

>"Jeff V. Merkey" <[email protected]> writes:
>
>
>>I can sustain full line rate gigabit on two adapters at the tsame time
>>with a 12 CLK interpacket gap time and 0 dropped packets at 64
>>byte sizes from a Smartbits to Linux provided the adapter ring buffer
>>is loaded with static addresses. This demonstrates that it is
>>possible to sustain 64 byte packet rates at full line rate with
>>current DMA architectures on 400 Mhz buses with Linux.
>>(which means it will handle any network loading scenario). The
>>bottleneck from my measurements appears to be the
>>overhead of serializing writes to the adapter ring buffer IO
>>memory. The current drivers also perform interrupt
>>coalescing very well with Linux. What's needed is a method for
>>submission of ring buffer entries that can be sent in large
>>scatter gather listings rather than one at a time. Ring buffers
>>
>>
>
>Batching would also decrease locking overhead on the Linux side (less
>spinlocks taken)
>
>We do it already for TCP using TSO for upto 64K packets when
>the hardware supports it. There were some ideas some time back
>to do it also for routing and other protocols - basically passing
>lists of skbs to hard_start_xmit instead of always single ones -
>but nobody implemented it so far.
>
>It was one entry in the "ideas to speed up the network stack"
>list i posted some time back.
>
>With TSO working fine it doesn't seem to be that pressing.
>
>One problem with the TSO implementation is that TSO only works for a
>single connection. If you have hundreds that chatter in small packets
>it won't help batching that up. Problem is that batching data from
>separate sockets up would need more global lists and add possible SMP
>scalability problems from more locks and more shared state. This
>is a real concern on Linux now - 512 CPU machines are really unforgiving.
>
>However in practice it doesn't seem to be that big a problem because
>it's extremly unlikely that you'll sustain even a gigabit ethernet
>with such a multi process load. It has far more non network CPU
>overhead than a simple packet generator or pktgen.
>
>So overall I agree with Lincoln that the small packet case is not
>that interesting except perhaps for DOS testing.
>
>-Andi
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>

2004-11-23 22:30:50

by Andi Kleen

[permalink] [raw]
Subject: Re: Linux 2.6.9 pktgen module causes INIT process respawning and sickness

On Tue, Nov 23, 2004 at 02:57:16PM -0700, Jeff V. Merkey wrote:
> Implementation of this with skb's would not be trivial. M$ in their
> network drivers did this sort of circular list of pages
> structure per adapter for receives and use it "pinned" to some of their
> proprietary drivers in W2K and would use their
> version of an skb as a "pointer" of sorts that could dynamically assign
> a filled page from this list as a "receive" then perform
> the user space copy from the page and release it back to the adapter.
> This allowed them to fill the ring buffers with static
> addresses and copy into user space as fast as they could allocate
> control blocks.

The point is to eliminate the writes for the address and buffer
fields in the ring descriptor right? I don't really see the point
because you have to twiggle at least the owner bit, so you
always have a cacheline sized transaction on the bus.
And that would likely include the ring descriptor anyways, just
implicitely in the read-modify-write cycle.

If you're worried about the latencies of the separate writes
you could always use write combining to combine the writes.

If you write the full cache line you could possibly even
avoid the read in this cae.

On x86-64 it can be enabled for writel/writeq with CONFIG_UNORDERED_IO.
You just have to be careful to add all the required memory
barriers, but the driver should have that already if it works
on IA64/sparc64/alpha/ppc64.

It's an experimental option not enabled by default on x86-64 because
the performance implications haven't been really investigated well.
You could probably do it on i386 too by setting the right MSR
or adding a ioremap_wc()

-Andi

2004-11-23 22:46:39

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: Linux 2.6.9 pktgen module causes INIT process respawning and sickness

Andi Kleen wrote:

>On Tue, Nov 23, 2004 at 02:57:16PM -0700, Jeff V. Merkey wrote:
>
>
>>Implementation of this with skb's would not be trivial. M$ in their
>>network drivers did this sort of circular list of pages
>>structure per adapter for receives and use it "pinned" to some of their
>>proprietary drivers in W2K and would use their
>>version of an skb as a "pointer" of sorts that could dynamically assign
>>a filled page from this list as a "receive" then perform
>>the user space copy from the page and release it back to the adapter.
>>This allowed them to fill the ring buffers with static
>>addresses and copy into user space as fast as they could allocate
>>control blocks.
>>
>>
>
>The point is to eliminate the writes for the address and buffer
>fields in the ring descriptor right? I don't really see the point
>because you have to twiggle at least the owner bit, so you
>always have a cacheline sized transaction on the bus.
>And that would likely include the ring descriptor anyways, just
>implicitely in the read-modify-write cycle.
>
>

True. Without the proposed hardware change to the 1 GbE abd 10GbE adapter,
I doubt this could be eliminated. There would still be the need to free
the descriptor
from the ring buffer and this does require touching this memory. Scrap
that idea.
The long term solution is for the card vendors to enable a batch mode
for submission
of ring buffer entries that do not require clearing any fields, but that
simply would
take an entire slate of newly allocated s/g entries and swap them
between tables.

for sparse conditions, an interrupt when packet(s) are pending is
already instrumented
in these adapters, so adding this capability would not be diffidult.
I've probed around
with some of these vendors with these discussions, and for the Intel
adapters, it would
require a change to the chipset, but not a major one. It's doable.

>If you're worried about the latencies of the separate writes
>you could always use write combining to combine the writes.
>
>If you write the full cache line you could possibly even
>avoid the read in this cae.
>
>On x86-64 it can be enabled for writel/writeq with CONFIG_UNORDERED_IO.
>You just have to be careful to add all the required memory
>barriers, but the driver should have that already if it works
>on IA64/sparc64/alpha/ppc64.
>
>It's an experimental option not enabled by default on x86-64 because
>the performance implications haven't been really investigated well.
>You could probably do it on i386 too by setting the right MSR
>or adding a ioremap_wc()
>
>

I will look at this feature and see how much it helps. Long term, folks
should
inquire from the board vendors if they would be willing to instrument
something
like this. Then the OS's could actually use 10GbE. The buses support the
bandwidth today, and I have measured it.

Jeff

>-Andi
>
>
>

2004-11-25 02:20:52

by Lincoln Dale

[permalink] [raw]
Subject: Re: Linux 2.6.9 pktgen module causes INIT process respawning and sickness

At 09:54 AM 24/11/2004, Jeff V. Merkey wrote:
[..]
>True. Without the proposed hardware change to the 1 GbE abd 10GbE adapter,
>I doubt this could be eliminated. There would still be the need to free
>the descriptor
>from the ring buffer and this does require touching this memory. Scrap
>that idea.
>The long term solution is for the card vendors to enable a batch mode for
>submission
[..]

Jeff,

so the fact still remains: what is so bad about the current approach.
sure -- it can't do wire-rate 1GbE with minimal sized frames -- but even if
it could -- would it be able to do bidirectional 1GbE with minimal sized
frames?

even if you could, can you name a real-world application that would
actually need that?


you make the point of "these things are necessary for 10GbE".
sure, but -- again -- 10GbE NICs are typically an entirely different beast,
with far more offload, RAM , DMA & on-board firmware capabilities.

take a look at any of the 10GbE adapters, either already released,
announced, or in development. they all go well beyond 1GbE NICs for
embedded smarts; they have to.

the ability to wire-rate minimum-packet-size 10GbE is still not going to be
something that any real-world app (that i can think of) requires.
10GbE wire-rate is in the order of ~14.88 million packets/second. that
works out to approximately 1 packet every 67 nanoseconds.



cheers,

lincoln.