2003-11-26 00:15:16

by Mr. BOFH

[permalink] [raw]
Subject: Fire Engine??


Sun has announced that they have redone their TCP/IP stack and is showing
for some instances a 30% improvement over Linux....

http://www.theregister.co.uk/content/61/33440.html



2003-11-26 02:11:20

by Larry McVoy

[permalink] [raw]
Subject: Re: [OT] Re: Fire Engine??

On Wed, Nov 26, 2003 at 12:48:19PM +1100, Nick Piggin wrote:
>
>
> Mr. BOFH wrote:
>
> >Sun has announced that they have redone their TCP/IP stack and is showing
> >for some instances a 30% improvement over Linux....
> >
> >http://www.theregister.co.uk/content/61/33440.html
> >
> >
>
> Thats odd. Since when did Linux's TCP/IP stack become the benchmark? :)
>
> PS. This isn't really appropriate for this list. I'm sure an open and
> verifiable comparison would be welcomed though.

And not to dis my Alma Mater but I tend think the whole TOE idea is a lose.
I used to think otherwise, while I was a Sun employee, and Sun employee #1
pointed out to me that CPUs and memory were getting faster more quickly than
the TOE type answers could come to market. He was right then and he seems
to still be right.

Maybe throwing processors at the problem will make him (and me now) wrong
but I have to think I could do better things with a CPU than offload some
TCP packets.

Linux has it right. Make the normal case fast and lightweight and ignore
the other cases. There are no other cases if the normal path is fast.

Another way to say "fast path" is "our normal path sucks".
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-11-26 02:05:38

by Nick Piggin

[permalink] [raw]
Subject: [OT] Re: Fire Engine??



Mr. BOFH wrote:

>Sun has announced that they have redone their TCP/IP stack and is showing
>for some instances a 30% improvement over Linux....
>
>http://www.theregister.co.uk/content/61/33440.html
>
>

Thats odd. Since when did Linux's TCP/IP stack become the benchmark? :)

PS. This isn't really appropriate for this list. I'm sure an open and
verifiable comparison would be welcomed though.


2003-11-26 02:30:39

by David Miller

[permalink] [raw]
Subject: Re: Fire Engine??

On Tue, 25 Nov 2003 16:15:12 -0800
"Mr. BOFH" <[email protected]> wrote:

> http://www.theregister.co.uk/content/61/33440.html

This was amusing to read, let's read the claim carefuly,
shall we?

"We worked hard on efficiency, and we now measure,
at a given network workload on identical x86 hardware,
we use 30 percent less CPU than Linux."

So his claim is that, in their mesaurements, "CPU utilization"
was lower in their stack. Was he using 2.6.x and TSO capable
cards on the Linux side? If not, it's not apples to apples
against are current upcoming technology.

And while his CPU utilization claim is interesting (I bet that gain
would go to zero if they'd used Linux TSO in 2.6.x), but was the
networking bandwidth and latency any better as a result? I think it's
not by accident that the claim was phrased the way it was.

In fact, I bet their connection setup/teardown latency will go in the
toilet with this stuff and Solaris was already horrible in this area.
It is a well established fact that TOE technologies have this problem
because of how the socket setup/teardown operation with TOE cards
requires the OS to go over the bus a few times.

I'm not worried at all about Sun's fire engine. It's preliminary
technology, and they are going to discover all of the problem TOE
stuff has that I've discussed several times on this list.

They even mention that they don't even support any current generation
shipping TOE cards yet, at least I offer a cpu utilization reduction
optimization (TSO in 2.6.x) with multiple implementation on current
generation hardware (e1000, tg3, etc.).

I fully welcome them to put Linux up against their incredible fire
engine crap in a sanctioned specweb run on identical hardware. :)

2003-11-26 02:48:32

by David Miller

[permalink] [raw]
Subject: Re: [OT] Re: Fire Engine??

On Tue, 25 Nov 2003 18:11:11 -0800
Larry McVoy <[email protected]> wrote:

> I used to think otherwise, while I was a Sun employee, and Sun employee #1
> pointed out to me that CPUs and memory were getting faster more quickly than
> the TOE type answers could come to market. He was right then and he seems
> to still be right.

Maybe this was at least partially the impetus behind his recent
departure from the company. And if not the impetus, a possible straw
that broke the camel's back.

How fast will cpus be when Sun actually deploys this stuff?

A commodity x86 U1 box at that time will probably have 6+ GHZ
cpus in it, and super-duper-DDR or whatever the current memory
technology will be. Why do I need Sun's TOE crap in this box?
Where's all that precious CPU I need to be saving?

This stuff isn't really useful for huge database servers either.

Where do they plan to do, put Solaris10 on iSCSI drives? ROFL! :)

These days Sun is already several laps behind before the green flag
even comes out to start the race.

2003-11-26 03:31:54

by Rik van Riel

[permalink] [raw]
Subject: Re: [OT] Re: Fire Engine??

On Tue, 25 Nov 2003, Larry McVoy wrote:

> And not to dis my Alma Mater but I tend think the whole TOE idea is a
> lose. I used to think otherwise, while I was a Sun employee, and Sun
> employee #1 pointed out to me that CPUs and memory were getting faster
> more quickly than the TOE type answers could come to market. He was
> right then and he seems to still be right.

I guess TCP offloading is a good way to stub your TOE ;)

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2003-11-26 05:41:31

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: Fire Engine??

On Tue, 25 Nov 2003 16:15:12 PST, "Mr. BOFH" <[email protected]> said:
>
> Sun has announced that they have redone their TCP/IP stack and is showing
> for some instances a 30% improvement over Linux....
>
> http://www.theregister.co.uk/content/61/33440.html

Hmm.. IBM tried this same idea with their 8232 Ethernet controller
(basically, an 'industrial' PC with a 3Com card and a bus&tag card)
and offload of some TCP/IP functionality back in 1988 or so.


Attachments:
(No filename) (226.00 B)

2003-11-26 10:47:50

by Andi Kleen

[permalink] [raw]
Subject: Re: Fire Engine??

"David S. Miller" <[email protected]> writes:
>
> So his claim is that, in their mesaurements, "CPU utilization"
> was lower in their stack. Was he using 2.6.x and TSO capable
> cards on the Linux side? If not, it's not apples to apples
> against are current upcoming technology.

Maybe they just have a better copy_to_user(). That eats most time anyways.

I think there are definitely areas of improvements left in current TCP.
It has gotten quite fat over the last years.

Some issues just from the top of my head. I have not done detailed profiling
recently and don't know if any of this would help significantly. It is
just what I remember right now.

- Window computation for incoming packets is quite dumbly coded right now
and could be optimized
- I suspect the copy/process-in--user-context setup needs to be rethought/
rebenchmarked in Gigabit setups. There was at least one test case
where tcp_low_latency=1 helped. It just adds latency that might hurt
and is not very useful when you have hardware checksums anyways
- If they tested TCP-over-NFS then I'm pretty sure Linux lost badly because
the current paths for that are just awfully inefficient.
- Overall IP/TCP could probably have some more instructions and hopefully
cache misses shaved off with some careful going over the fast paths.
- There are too many locks. That hurts when you have slow atomic operations
(like on P4) and together with the next issue.
- We do most things one packet at a time. This means locking and multiple
layer overhead multiplies. Most network operations come in packet bursts
and it would be much more efficient to batch operations: always process
lists of packets instead of single packets. This could probably lower
locking overhead a lot.
- On TX we are inefficient for the same reason. TCP builds one packet
at a time and then goes down through all layers taking all locks (queue,
device driver etc.) and submits the single packet. Then repeats that for
lots of packets because many TCP writes are > MTU. Batching that would
likely help a lot, like it was done in the 2.6 VFS. I think it could
also make hard_start_xmit in many drivers significantly faster.
- The hash tables are too big. This causes unnecessary cache misses all the
time.
- Doing gettimeofday on each incoming packet is just dumb, especially
when you have gettimeofday backed with a slow southbridge timer.
This shows quite badly on many profile logs.
I still think right solution for that would be to only take time stamps
when there is any user for it (= no timestamps in 99% of all systems)
- user copy and checksum could probably also done faster if they were
batched for multiple packets. It is hard to optimize properly for
<= 1.5K copies.
This is especially true for 4/4 split kernels which will eat an
page table look up + lock for each individual copy, but also for others.

-Andi

2003-11-26 11:30:46

by John Bradford

[permalink] [raw]
Subject: Re: Fire Engine??

Quote from Andi Kleen <[email protected]>:
> "David S. Miller" <[email protected]> writes:
> >
> > So his claim is that, in their mesaurements, "CPU utilization"
> > was lower in their stack. Was he using 2.6.x and TSO capable
> > cards on the Linux side? If not, it's not apples to apples
> > against are current upcoming technology.
>
> Maybe they just have a better copy_to_user(). That eats most time anyways.
>
> I think there are definitely areas of improvements left in current TCP.
> It has gotten quite fat over the last years.

On the subject of general networking performance in Linux, I thought
this set of benchmarks was quite interesting:

http://bulk.fefe.de/scalability/

particularly the 2.4 -> 2.6 comparisons.

John.

2003-11-26 15:01:17

by Trond Myklebust

[permalink] [raw]
Subject: Re: Fire Engine??

>>>>> " " == Andi Kleen <[email protected]> writes:

> - If they tested TCP-over-NFS then I'm pretty sure Linux lost
^^^^^^^^^^^^ That would be inefficient 8-)
> badly because the current paths for that are just awfully
> inefficient.

...mind elaborating?

Cheers,
Trond

2003-11-26 18:50:35

by Mike Fedyk

[permalink] [raw]
Subject: Re: Fire Engine??

On Wed, Nov 26, 2003 at 11:35:03AM +0000, John Bradford wrote:
> Quote from Andi Kleen <[email protected]>:
> > "David S. Miller" <[email protected]> writes:
> > >
> > > So his claim is that, in their mesaurements, "CPU utilization"
> > > was lower in their stack. Was he using 2.6.x and TSO capable
> > > cards on the Linux side? If not, it's not apples to apples
> > > against are current upcoming technology.
> >
> > Maybe they just have a better copy_to_user(). That eats most time anyways.
> >
> > I think there are definitely areas of improvements left in current TCP.
> > It has gotten quite fat over the last years.
>
> On the subject of general networking performance in Linux, I thought
> this set of benchmarks was quite interesting:
>
> http://bulk.fefe.de/scalability/

No such file or directory.

2003-11-26 19:18:53

by d.c

[permalink] [raw]
Subject: Re: Fire Engine??

El Wed, 26 Nov 2003 10:50:28 -0800 Mike Fedyk <[email protected]> escribi?:

> > http://bulk.fefe.de/scalability/
>
> No such file or directory.

It works here. I don't know if those numbers represent anything for networking.
Some of the benchmarks look more like "vm benchmarking". And the ones which
are measuring latency are valid, considering that BSDs are lacking "preempt"?
(shooting in the dark)

Diego Calleja.

2003-11-26 19:31:21

by David Miller

[permalink] [raw]
Subject: Re: Fire Engine??

On 26 Nov 2003 10:53:21 +0100
Andi Kleen <[email protected]> wrote:

> Some issues just from the top of my head. I have not done detailed profiling
> recently and don't know if any of this would help significantly. It is
> just what I remember right now.

Thanks for the list Andi, I'll keep it around. I'd like
to comment on one entry though.

> - On TX we are inefficient for the same reason. TCP builds one packet
> at a time and then goes down through all layers taking all locks (queue,
> device driver etc.) and submits the single packet. Then repeats that for
> lots of packets because many TCP writes are > MTU. Batching that would
> likely help a lot, like it was done in the 2.6 VFS. I think it could
> also make hard_start_xmit in many drivers significantly faster.

This is tricky, because of getting all of the queueing stuff right.
All of the packet scheduler APIs would need to change, as would
the classification stuff, not to mention netfilter et al.

You're talking about basically redoing the whole TX path if you
want to really support this.

I'm not saying "don't do this", just that we should be sure we know
what we're getting if we invest the time into this.

> - The hash tables are too big. This causes unnecessary cache misses all the
> time.

I agree. See my comments on this topic in another recent linux-kernel
thread wrt. huge hash tables on numa systems.

> - Doing gettimeofday on each incoming packet is just dumb, especially
> when you have gettimeofday backed with a slow southbridge timer.
> This shows quite badly on many profile logs.
> I still think right solution for that would be to only take time stamps
> when there is any user for it (= no timestamps in 99% of all systems)

Andi, I know this is a problem, but for the millionth time your idea
does not work because we don't know if the user asked for the timestamp
until we are deep within the recvmsg() processing, which is long after
the packet has arrived.

> - user copy and checksum could probably also done faster if they were
> batched for multiple packets. It is hard to optimize properly for
> <= 1.5K copies.
> This is especially true for 4/4 split kernels which will eat an
> page table look up + lock for each individual copy, but also for others.

I disagree partially, especially in the presence of a chip that provides
proper implementations of software initiated prefetching.

2003-11-26 19:59:16

by Paul Menage

[permalink] [raw]
Subject: Re: Fire Engine??

David S. Miller wrote:
>
> Andi, I know this is a problem, but for the millionth time your idea
> does not work because we don't know if the user asked for the timestamp
> until we are deep within the recvmsg() processing, which is long after
> the packet has arrived.

How about tracking the number of current sockets that have had timestamp
requests for them? If this number is zero, don't bother with the
timestamps. The first time you get a SIOCGSTAMP ioctl on a given socket,
bump the count and set a flag; decrement the count when the socket is
destroyed if the flag is set.

The drawback is that the first SIOCGSTAMP on any particular socket will
have to return a bogus value (maybe just the current time?). Ways to
mitigate that are:

- have a /proc option to let the sysadmin enforce timestamps on all
packets (just bump the counter)

- bump the counter whenever an interface is in promiscuous mode (I
imagine that tcpdump et al are the main users of the timestamps?)

Paul

2003-11-26 19:59:12

by Mike Fedyk

[permalink] [raw]
Subject: Re: Fire Engine??

On Wed, Nov 26, 2003 at 08:19:03PM +0100, Diego Calleja Garc?a wrote:
> El Wed, 26 Nov 2003 10:50:28 -0800 Mike Fedyk <[email protected]> escribi?:
>
> > > http://bulk.fefe.de/scalability/
> >
> > No such file or directory.
>
> It works here. I don't know if those numbers represent anything for networking.
> Some of the benchmarks look more like "vm benchmarking". And the ones which
> are measuring latency are valid, considering that BSDs are lacking "preempt"?
> (shooting in the dark)

Grr, that trailing "/" made the difference. :-/

2003-11-26 20:05:40

by David Miller

[permalink] [raw]
Subject: Re: Fire Engine??

On Wed, 26 Nov 2003 20:01:53 +0000
Jamie Lokier <[email protected]> wrote:

> Do the timestamps need to be precise and accurately reflect the
> arrival time in the irq handler?

It would be a regression to make the timestamps less accurate
than those provided now.

> Or, for TCP timestamps,

The timestamps we are talking about are not used for TCP.

> Apart from TCP, precise timestamps are only used for packet capture,
> and it's easy to keep track globally of whether anyone has packet
> sockets open.

We have no knowledge of what an applications requirements are,
that is why we provide as accurate a timestamp as possible.

If we were writing this stuff for the first time now, sure we could
specify things however conveniently we like, but how this stuff behaves
is already well defined.

2003-11-26 20:02:05

by Jamie Lokier

[permalink] [raw]
Subject: Re: Fire Engine??

David S. Miller wrote:
> > - Doing gettimeofday on each incoming packet is just dumb, especially
> > when you have gettimeofday backed with a slow southbridge timer.
> > This shows quite badly on many profile logs.
> > I still think right solution for that would be to only take time stamps
> > when there is any user for it (= no timestamps in 99% of all systems)
>
> Andi, I know this is a problem, but for the millionth time your idea
> does not work because we don't know if the user asked for the timestamp
> until we are deep within the recvmsg() processing, which is long after
> the packet has arrived.

Do the timestamps need to be precise and accurately reflect the
arrival time in the irq handler? Or, for TCP timestamps, would it be
good enough to use the time when the protocol handlers are run, and
only read the hardware clock once for a bunch of received packets? Or
even use jiffies?

Apart from TCP, precise timestamps are only used for packet capture,
and it's easy to keep track globally of whether anyone has packet
sockets open.

-- Jamie

2003-11-26 20:03:58

by David Miller

[permalink] [raw]
Subject: Re: Fire Engine??

On Wed, 26 Nov 2003 11:58:44 -0800
Paul Menage <[email protected]> wrote:

> How about tracking the number of current sockets that have had timestamp
> requests for them? If this number is zero, don't bother with the
> timestamps. The first time you get a SIOCGSTAMP ioctl on a given socket,
> bump the count and set a flag; decrement the count when the socket is
> destroyed if the flag is set.

Reread what I said please, the user can ask for timestamps using CMSG
objects via the recvmsg() system call, there are no ioctls or socket
controls done on the socket. It is completely dynamic and
unpredictable.

2003-11-26 20:22:36

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Fire Engine??

On Wed, Nov 26, 2003 at 11:30:40AM -0800, David S. Miller wrote:
> > - Doing gettimeofday on each incoming packet is just dumb, especially
> > when you have gettimeofday backed with a slow southbridge timer.
> > This shows quite badly on many profile logs.
> > I still think right solution for that would be to only take time stamps
> > when there is any user for it (= no timestamps in 99% of all systems)
>
> Andi, I know this is a problem, but for the millionth time your idea
> does not work because we don't know if the user asked for the timestamp
> until we are deep within the recvmsg() processing, which is long after
> the packet has arrived.

I believe what Andi was suggesting was if there was **no** processes
that are currently requesting timestamps, then we can dispense with
taking the timestamp. If a single user asks for the timestamp, then
we would still end up taking timestamps on all packets. Is this worth
the overhead to keep track of that factor? It's arguable, but some
platforms, probably yes.

- Ted

2003-11-26 21:03:31

by David Miller

[permalink] [raw]
Subject: Re: Fire Engine??

On Wed, 26 Nov 2003 15:22:16 -0500
"Theodore Ts'o" <[email protected]> wrote:

> I believe what Andi was suggesting was if there was **no** processes
> that are currently requesting timestamps, then we can dispense with
> taking the timestamp.

You can predict what the arguments will be for the user's
recvmsg() system call at the time of packet reception? Wow,
show me how :)

2003-11-26 21:24:18

by Jamie Lokier

[permalink] [raw]
Subject: Re: Fire Engine??

David S. Miller wrote:
> > that are currently requesting timestamps, then we can dispense with
> > taking the timestamp.
>
> You can predict what the arguments will be for the user's
> recvmsg() system call at the time of packet reception? Wow,
> show me how :)

recvmsg() doesn't return timestamps until they are requested
using setsockopt(...SO_TIMESTAMP...).

See sock_recv_timestamp() in include/net/sock.h.

-- Jamie

2003-11-26 21:39:28

by David Miller

[permalink] [raw]
Subject: Re: Fire Engine??

On Wed, 26 Nov 2003 21:24:06 +0000
Jamie Lokier <[email protected]> wrote:

> recvmsg() doesn't return timestamps until they are requested
> using setsockopt(...SO_TIMESTAMP...).
>
> See sock_recv_timestamp() in include/net/sock.h.

See MSG_ERRQUEUE and net/ipv4/ip_sockglue.c

2003-11-26 21:34:54

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Fire Engine??

On Wed, 2003-11-26 at 20:30, David S. Miller wrote:

> > - Doing gettimeofday on each incoming packet is just dumb, especially
> > when you have gettimeofday backed with a slow southbridge timer.
> > This shows quite badly on many profile logs.
> > I still think right solution for that would be to only take time stamps
> > when there is any user for it (= no timestamps in 99% of all systems)
>
> Andi, I know this is a problem, but for the millionth time your idea
> does not work because we don't know if the user asked for the timestamp
> until we are deep within the recvmsg() processing, which is long after
> the packet has arrived.

question: do we need a timestamp for every packet or can we do one
timestamp per irq-context entry ? (eg one timestamp at irq entry time we
do anyway and keep that for all packets processed in the softirq)


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2003-11-26 21:55:04

by Pekka Pietikäinen

[permalink] [raw]
Subject: Re: Fire Engine??

On Wed, Nov 26, 2003 at 08:01:53PM +0000, Jamie Lokier wrote:
> > Andi, I know this is a problem, but for the millionth time your idea
> > does not work because we don't know if the user asked for the timestamp
> > until we are deep within the recvmsg() processing, which is long after
> > the packet has arrived.
>
> Do the timestamps need to be precise and accurately reflect the
> arrival time in the irq handler? Or, for TCP timestamps, would it be
> good enough to use the time when the protocol handlers are run, and
> only read the hardware clock once for a bunch of received packets? Or
> even use jiffies?

> Apart from TCP, precise timestamps are only used for packet capture,
> and it's easy to keep track globally of whether anyone has packet
> sockets open.
It should probably noted that really hardcore timestamp users
have their NICs do it for them, since interrupt coalescing
makes timestamps done in the kernel too inaccurate for them even
if rdtsc is used (http://www-didc.lbl.gov/papers/SCNM-PAM03.pdf)
Not that it's anywhere near a univeral solution since more or less only
one brand of NICs supports them.

It would probably be a useful experiment to see whether the performance is
improved in a noticeable way if say jiffies were used. If so, it might be a
reasonable choice for a configurable option, if not then not.
Isn't stuff like this the reason why the experimental network patches tree
that was announced a while back is out there? ;-)

2003-11-26 22:29:14

by Andi Kleen

[permalink] [raw]
Subject: Re: Fire Engine??

On Wed, 26 Nov 2003 12:03:16 -0800
"David S. Miller" <[email protected]> wrote:

> On Wed, 26 Nov 2003 11:58:44 -0800
> Paul Menage <[email protected]> wrote:
>
> > How about tracking the number of current sockets that have had timestamp
> > requests for them? If this number is zero, don't bother with the
> > timestamps. The first time you get a SIOCGSTAMP ioctl on a given socket,
> > bump the count and set a flag; decrement the count when the socket is
> > destroyed if the flag is set.
>
> Reread what I said please, the user can ask for timestamps using CMSG
> objects via the recvmsg() system call, there are no ioctls or socket
> controls done on the socket. It is completely dynamic and
> unpredictable.

The user sets the SO_TIMESTAMP setsockopt to 1 and then you get the cmsg.
That's per socket state. The other way is to use the SIOCGTSTAMP ioctl.
That is a bit more ugly because it has no state, but you can do
a heuristic and assume that an process that does SIOCGTSTAMP once
will do it in future too and set a flag in this case.

The first SIOCGTSTAMP would be inaccurate, but the following (after
all untimestamped packets have been flushed) would be ok.

Doing for IP would be relatively easy, the only major user of the
timestamp seems to be DECnet and the bridge, but I supose those could be
converted to use jiffies too.

-Andi

2003-11-26 22:39:37

by Andi Kleen

[permalink] [raw]
Subject: Re: Fire Engine??

On Wed, 26 Nov 2003 11:30:40 -0800
"David S. Miller" <[email protected]> wrote:

>
> > - On TX we are inefficient for the same reason. TCP builds one packet
> > at a time and then goes down through all layers taking all locks (queue,
> > device driver etc.) and submits the single packet. Then repeats that for
> > lots of packets because many TCP writes are > MTU. Batching that would
> > likely help a lot, like it was done in the 2.6 VFS. I think it could
> > also make hard_start_xmit in many drivers significantly faster.
>
> This is tricky, because of getting all of the queueing stuff right.
> All of the packet scheduler APIs would need to change, as would
> the classification stuff, not to mention netfilter et al.

You only need to do a fast path for the default scheduler at the beginning.
Every complicated "slow" API like advanced queuing or netfilter can still fallback to
one packet at a time until cleaned up (similar strategy as was done with the
non linear skbs)

> You're talking about basically redoing the whole TX path if you
> want to really support this.
>
> I'm not saying "don't do this", just that we should be sure we know
> what we're getting if we invest the time into this.

In some profiling I did some time ago queue locks and device driver
locks were the biggest offenders on TX after copy.

The only tricky part is to get the state machine in tcp_do_sendmsg()
right that decides when to flush.

> - user copy and checksum could probably also done faster if they were
> > batched for multiple packets. It is hard to optimize properly for
> > <= 1.5K copies.
> > This is especially true for 4/4 split kernels which will eat an
> > page table look up + lock for each individual copy, but also for others.
>
> I disagree partially, especially in the presence of a chip that provides
> proper implementations of software initiated prefetching.

Especially for prefetching having a list of packets helps because you
can prefetch the next while you're working on the current one. The CPU
hardware prefetcher cannot do that for you.

I did look seriously at faster csum-copy/copy-to-user for K8, but the conclusion
was that all the tricks are only worth it when you can work with bigger amounts of data.
1.5K at a time is just too small.

Ah yes:

- Investigate more performance through explicit prefetching
(e.g. in the device drivers to optimize eth_type_trans() when you can classify the packet
just by looking at the RX ring state. Instead do a prefetch on the packet data
and hope the data is already in cache when the IP stack gets around to look at it)

could be also added to the list

-Andi (who shuts up now because I don't have any time to code on any of this :-( )

2003-11-26 22:36:57

by David Miller

[permalink] [raw]
Subject: Re: Fire Engine??

On Wed, 26 Nov 2003 23:29:09 +0100
Andi Kleen <[email protected]> wrote:

> The first SIOCGTSTAMP would be inaccurate, but the following (after
> all untimestamped packets have been flushed) would be ok.

I don't think this is acceptable. It's important that all
of the timestamps are as accurate as they were before.

2003-11-26 22:47:27

by David Miller

[permalink] [raw]
Subject: Re: Fire Engine??

On Wed, 26 Nov 2003 23:39:18 +0100
Andi Kleen <[email protected]> wrote:

> You only need to do a fast path for the default scheduler at the beginning.

In the end we're going to have a design and we're going to do it
right, if we decide to do this.

Sun needs fast paths, not us.

> Especially for prefetching having a list of packets helps because you
> can prefetch the next while you're working on the current one. The CPU
> hardware prefetcher cannot do that for you.

The initial prefetches are consumed by the copy implementation
setup instructions. By the time the real loads execute, the
data is there or not very far away.

This I have measured on UltraSPARC, I suspect other cpus can
match that if not do better.

> I did look seriously at faster csum-copy/copy-to-user for K8, but the conclusion
> was that all the tricks are only worth it when you can work with bigger amounts of data.
> 1.5K at a time is just too small.

Not true, once you have ~300 or so bytes you have enough inertia
to get a good stream going in the main loop, really look at the
ultrasparc-III stuff I wrote for heuristics.

You really should write the k8 code before coming to conclusions
about what it would or would not be capable of doing :)

2003-11-26 22:58:48

by Andi Kleen

[permalink] [raw]
Subject: Re: Fire Engine??

On Wed, 26 Nov 2003 14:36:20 -0800
"David S. Miller" <[email protected]> wrote:

> On Wed, 26 Nov 2003 23:29:09 +0100
> Andi Kleen <[email protected]> wrote:
>
> > The first SIOCGTSTAMP would be inaccurate, but the following (after
> > all untimestamped packets have been flushed) would be ok.
>
> I don't think this is acceptable. It's important that all
> of the timestamps are as accurate as they were before.

I disagree on that. The window is small and slowing down 99.99999% of all
users who never care about this for this extremely obscure misdesigned API does
not make much sense to me.

Also if you worry about these you could add an optional sysctl
to always take it, so if anybody really has an application that relies
on the first time stamp being accurate and they cannot use SO_TIMESTAMP
they could set the sysctl.

-Andi

2003-11-26 23:01:50

by Andi Kleen

[permalink] [raw]
Subject: Re: Fire Engine??

On 26 Nov 2003 10:00:09 -0500
Trond Myklebust <[email protected]> wrote:

> >>>>> " " == Andi Kleen <[email protected]> writes:
>
> > - If they tested TCP-over-NFS then I'm pretty sure Linux lost
> ^^^^^^^^^^^^ That would be inefficient 8-)

grin.

> > badly because the current paths for that are just awfully
> > inefficient.
>
> ...mind elaborating?

Current sunrpc does two recvmsgs for each record to first get the record length
and then the payload.

This means you take all the locks and other overhead twice per packet.

Having a special function that peeks directly at the TCP receive
queue would be much faster (and falls back to normal recvmsg when
there is no data waiting)

But that's the really obvious case. I think if you got out an profiler
and optimized carefully you could likely make this path much more
efficient. Same for sunrpc TX probably, although that seems to be
in a better shape already.

-Andi

2003-11-26 22:58:57

by Andi Kleen

[permalink] [raw]
Subject: Re: Fire Engine??

On Wed, 26 Nov 2003 22:34:10 +0100
Arjan van de Ven <[email protected]> wrote:

> On Wed, 2003-11-26 at 20:30, David S. Miller wrote:
>
> > > - Doing gettimeofday on each incoming packet is just dumb, especially
> > > when you have gettimeofday backed with a slow southbridge timer.
> > > This shows quite badly on many profile logs.
> > > I still think right solution for that would be to only take time stamps
> > > when there is any user for it (= no timestamps in 99% of all systems)
> >
> > Andi, I know this is a problem, but for the millionth time your idea
> > does not work because we don't know if the user asked for the timestamp
> > until we are deep within the recvmsg() processing, which is long after
> > the packet has arrived.
>
> question: do we need a timestamp for every packet or can we do one
> timestamp per irq-context entry ? (eg one timestamp at irq entry time we
> do anyway and keep that for all packets processed in the softirq)

If people want the timestamp they usually want it to be accurate
(e.g. for tcpdump etc.). of course there is already a lot of jitter
in this information because it is done relatively late in the device
driver (long after the NIC has received the packet)

Just most people never care about this at all....

-Andi

2003-11-26 23:14:30

by David Miller

[permalink] [raw]
Subject: Re: Fire Engine??

On Wed, 26 Nov 2003 23:56:41 +0100
Andi Kleen <[email protected]> wrote:

> On Wed, 26 Nov 2003 14:36:20 -0800
> "David S. Miller" <[email protected]> wrote:
>
> > I don't think this is acceptable. It's important that all
> > of the timestamps are as accurate as they were before.
>
> I disagree on that. The window is small and slowing down 99.99999% of all
> users who never care about this for this extremely obscure
> misdesigned API does not make much sense to me.

We can't change behavior like this. Every time we've tried to
do it, we've been burnt. Remember nonlocal-bind?

2003-11-26 23:23:26

by Trond Myklebust

[permalink] [raw]
Subject: Re: Fire Engine??

>>>>> " " == Andi Kleen <[email protected]> writes:

> Current sunrpc does two recvmsgs for each record to first get
> the record length and then the payload.

> This means you take all the locks and other overhead twice per
> packet.

> Having a special function that peeks directly at the TCP
> receive queue would be much faster (and falls back to normal
> recvmsg when there is no data waiting)

Oh, right... That would be the server code you are thinking of, then.

The client already does something like this. I've added a function
tcp_read_sock() that is called directly from tcp_data_ready() and
hence fills the page cache directly from within the softirq.

There are a still few inefficiencies with this approach, though. Most
notable is the fact that you need to call kmap_atomic() several times
per page since the socket lower layers will usually be feeding you 1
skb at a time. I thought you might be referring to those (and that you
might have a good solution to propose ;-))

Cheers,
Trond

2003-11-26 23:29:52

by Andi Kleen

[permalink] [raw]
Subject: Re: Fire Engine??

On Wed, 26 Nov 2003 15:13:52 -0800
"David S. Miller" <[email protected]> wrote:

> On Wed, 26 Nov 2003 23:56:41 +0100
> Andi Kleen <[email protected]> wrote:
>
> > On Wed, 26 Nov 2003 14:36:20 -0800
> > "David S. Miller" <[email protected]> wrote:
> >
> > > I don't think this is acceptable. It's important that all
> > > of the timestamps are as accurate as they were before.
> >
> > I disagree on that. The window is small and slowing down 99.99999% of all
> > users who never care about this for this extremely obscure
> > misdesigned API does not make much sense to me.
>
> We can't change behavior like this. Every time we've tried to
> do it, we've been burnt. Remember nonlocal-bind?

The behaviour is not really changed, just the precision of the timestamp
is temporarily (a few tens of ms on a busy network) worse.

And the jitter in this timestamp is already higher than this when
you consider queueing delays and interrupt mitigation in the driver.

-Andi

2003-11-26 23:38:35

by Andi Kleen

[permalink] [raw]
Subject: Re: Fire Engine??

> There are a still few inefficiencies with this approach, though. Most
> notable is the fact that you need to call kmap_atomic() several times
> per page since the socket lower layers will usually be feeding you 1
> skb at a time. I thought you might be referring to those (and that you
> might have a good solution to propose ;-))

For kmap_atomic? Run a x86-64 box ;-)

In general doing things with more than one packet at a time would
be probably a good idea, but I don't have any deep thoughts on how
to implement this for TCP RX.

-Andi

2003-11-26 23:42:11

by Ben Greear

[permalink] [raw]
Subject: Re: Fire Engine??

David S. Miller wrote:
> On Wed, 26 Nov 2003 23:56:41 +0100
> Andi Kleen <[email protected]> wrote:
>
>
>>On Wed, 26 Nov 2003 14:36:20 -0800
>>"David S. Miller" <[email protected]> wrote:
>>
>>
>>>I don't think this is acceptable. It's important that all
>>>of the timestamps are as accurate as they were before.
>>
>>I disagree on that. The window is small and slowing down 99.99999% of all
>>users who never care about this for this extremely obscure
>>misdesigned API does not make much sense to me.
>
>
> We can't change behavior like this. Every time we've tried to
> do it, we've been burnt. Remember nonlocal-bind?

I'll try to write up a patch that uses the TSC and lazy conversion
to timeval as soon as I get the rx-all and rx-fcs code happily
into the kernel....

Assuming TSC is very fast and the conversion is accurate enough, I think
this can give good results....

Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2003-11-26 23:44:34

by Jamie Lokier

[permalink] [raw]
Subject: Re: Fire Engine??

David S. Miller wrote:
> > recvmsg() doesn't return timestamps until they are requested
> > using setsockopt(...SO_TIMESTAMP...).
> >
> > See sock_recv_timestamp() in include/net/sock.h.
>
> See MSG_ERRQUEUE and net/ipv4/ip_sockglue.c

I don't see your point. The test for the SO_TIMESTAMP socket option
is _inside_ sock_recv_timestamp() (the flag is called sk_rcvtstamp).

The MSG_ERRQUEUE code simply calls sock_recv_timestamp(), which in
turn only reports the timestamp if the flag is set.

There are exactly two places where the timestamp is reported to
userspace, and both are at the request of userspace:

1. sock_recv_timestamp(), called from many places including
ip_sockglue.c. It _only_ reports it if SO_TIMESTAMP is
enabled for the socket.

2. inet_ioctl(SIOCGSTAMP)

Nowhere else is the timestamp reported to userspace.

-- Jamie

2003-11-27 00:01:40

by David Miller

[permalink] [raw]
Subject: Fast timestamps

On Wed, 26 Nov 2003 15:41:52 -0800
Ben Greear <[email protected]> wrote:

> I'll try to write up a patch that uses the TSC and lazy conversion
> to timeval as soon as I get the rx-all and rx-fcs code happily
> into the kernel....
>
> Assuming TSC is very fast and the conversion is accurate enough, I think
> this can give good results....

I'm amazed that you will be able to write a fast_timestamp
implementation without even seeing the API I had specified
to the various arch maintainers :-)

====================

But at the base I say we need three things:

1) Some kind of fast_timestamp_t, the property is that this stores
enough information at time "T" such that at time "T + something"
the fast_timestamp_t can be converted what the timeval was back at
time "T".

For networking, make skb->stamp into this type.

2) store_fast_timestamp(fast_timestamp_t *)

For networking, change do_gettimeofday(&skb->stamp) into
store_fast_timestamp(&skb->stamp)

3) fast_timestamp_to_timeval(arch_timestamp_t *, struct timeval *)

For networking, change things that read the skb->stamp value
into calls to fast_timestamp_to_timeval().

It is defined that the timeval given by fast_timestamp_to_timeval()
needs to be the same thing that do_gettimeofday() would have recorded
at the time store_fast_timestamp() was called.

Here is the default generic implementation that would go into
asm-generic/faststamp.h:

1) fast_timestamp_t is struct timeval
2) store_fast_timestamp() is gettimeofday()
3) fast_timestamp_to_timeval() merely copies the fast_timestamp_t
into the passed in timeval.

And here is how an example implementation could work on sparc64:

1) fast_timestamp_t is a u64

2) store_fast_timestamp() reads the cpu cycle counter

3) fast_timestamp_to_timeval() records the difference between the
current cpu cycle counter and the one recorded, it takes a sample
of the current xtime value and adjusts it accordingly to account
for the cpu cycle counter difference.

This only works because sparc64's cpu cycle counters are synchronized
across all cpus, they increase monotonically, and are guarenteed not
to overflow for at least 10 years.

Alpha, for example, cannot do it this way because it's cpu cycle counter
register overflows too quickly to be useful.

Platforms with inter-cpu TSC synchronization issues will have some
troubles doing the same trick too, because one must handle properly
the case where the fast timestamp is converted to a timeval on a different
cpu on which the fast timestamp was recorded.

Regardless, we could put the infrastructure in there now and arch folks
can work on implementations. The generic implementation code, which is
what everyone will end up with at first, will cancel out to what we have
currently.

This is a pretty powerful idea that could be applied to other places,
not just the networking.

2003-11-27 00:25:36

by Mitchell Blank Jr

[permalink] [raw]
Subject: Re: Fast timestamps

David S. Miller wrote:
> Ben Greear <[email protected]> wrote:
> > I'll try to write up a patch that uses the TSC and lazy conversion
> > to timeval as soon as I get the rx-all and rx-fcs code happily
> > into the kernel....
> >
> > Assuming TSC is very fast and the conversion is accurate enough, I think
> > this can give good results....
>
> I'm amazed that you will be able to write a fast_timestamp
> implementation without even seeing the API I had specified
> to the various arch maintainers :-)

Also, anyone interested in doing this should probably re-read the thread
on netdev from a couple months back about this, since we hashed out some
implementation details wrt SMP efficiency:
http://oss.sgi.com/projects/netdev/archive/2003-10/msg00032.html

Although reading this thread I'm feeling that Andi is probably right -
are there really any apps that coudn't cope with a small inaccuracy of the
first ioctl-fetched timestamp? I really doubt it. Basically there's
two common cases:
1. System under reasonably network load: in this case the tcpdump (or
whatever) probably will get the packet soon after it arrived, so
the timestamp we compute for the first packet won't be very far off.
2. System under heavy network load: the card's hardware rx queues are
probably pretty full so our timestamps won't be very accurate
no matter what we do

Given that the timestamp is already inexact it seems like a fine idea to
trade a tiny amount of accuracy for a potentially significant performance
improvement.

-Mitch

2003-11-27 01:57:53

by Ben Greear

[permalink] [raw]
Subject: Re: Fast timestamps

David S. Miller wrote:
> On Wed, 26 Nov 2003 15:41:52 -0800
> Ben Greear <[email protected]> wrote:
>
>
>>I'll try to write up a patch that uses the TSC and lazy conversion
>>to timeval as soon as I get the rx-all and rx-fcs code happily
>>into the kernel....
>>
>>Assuming TSC is very fast and the conversion is accurate enough, I think
>>this can give good results....
>
>
> I'm amazed that you will be able to write a fast_timestamp
> implementation without even seeing the API I had specified
> to the various arch maintainers :-)

Well, I would only aim at x86, with generic code for the
rest of the architectures. The truth is, I'm sure others would
be better/faster at it than me, but we keep discussing it, and it
never gets done, so unless someone beats me to it, I'll take a stab
at it... Might be after Christmas though, busy December coming up!

I agree with your approach below. One thing I was thinking about:
is it possible that two threads ask for the timestamp of a single skb
concurrently? If so, we may need a lock if we want to cache the conversion
to gettimeofday units.... Of course, the case where multiple readers want
the timestamp for a single skb may be too rare to warrant caching...

Ben

>
> ====================
>
> But at the base I say we need three things:
>
> 1) Some kind of fast_timestamp_t, the property is that this stores
> enough information at time "T" such that at time "T + something"
> the fast_timestamp_t can be converted what the timeval was back at
> time "T".
>
> For networking, make skb->stamp into this type.
>
> 2) store_fast_timestamp(fast_timestamp_t *)
>
> For networking, change do_gettimeofday(&skb->stamp) into
> store_fast_timestamp(&skb->stamp)
>
> 3) fast_timestamp_to_timeval(arch_timestamp_t *, struct timeval *)
>
> For networking, change things that read the skb->stamp value
> into calls to fast_timestamp_to_timeval().
>
> It is defined that the timeval given by fast_timestamp_to_timeval()
> needs to be the same thing that do_gettimeofday() would have recorded
> at the time store_fast_timestamp() was called.
>
> Here is the default generic implementation that would go into
> asm-generic/faststamp.h:
>
> 1) fast_timestamp_t is struct timeval
> 2) store_fast_timestamp() is gettimeofday()
> 3) fast_timestamp_to_timeval() merely copies the fast_timestamp_t
> into the passed in timeval.
>
> And here is how an example implementation could work on sparc64:
>
> 1) fast_timestamp_t is a u64
>
> 2) store_fast_timestamp() reads the cpu cycle counter
>
> 3) fast_timestamp_to_timeval() records the difference between the
> current cpu cycle counter and the one recorded, it takes a sample
> of the current xtime value and adjusts it accordingly to account
> for the cpu cycle counter difference.
>
> This only works because sparc64's cpu cycle counters are synchronized
> across all cpus, they increase monotonically, and are guarenteed not
> to overflow for at least 10 years.
>
> Alpha, for example, cannot do it this way because it's cpu cycle counter
> register overflows too quickly to be useful.
>
> Platforms with inter-cpu TSC synchronization issues will have some
> troubles doing the same trick too, because one must handle properly
> the case where the fast timestamp is converted to a timeval on a different
> cpu on which the fast timestamp was recorded.
>
> Regardless, we could put the infrastructure in there now and arch folks
> can work on implementations. The generic implementation code, which is
> what everyone will end up with at first, will cancel out to what we have
> currently.
>
> This is a pretty powerful idea that could be applied to other places,
> not just the networking.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2003-11-27 03:54:45

by Bill Huey

[permalink] [raw]
Subject: Re: Fire Engine??

On Wed, Nov 26, 2003 at 08:19:03PM +0100, Diego Calleja Garc?a wrote:
> It works here. I don't know if those numbers represent anything for networking.
> Some of the benchmarks look more like "vm benchmarking". And the ones which
> are measuring latency are valid, considering that BSDs are lacking "preempt"?
> (shooting in the dark)

FreeBSD-current is fully preemptive. The preempt patch, which add
preemption points, is meaningless in that context.

bill

2003-11-27 12:18:45

by Ingo Oeser

[permalink] [raw]
Subject: Re: Fire Engine??

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wednesday 26 November 2003 23:58, Andi Kleen wrote:
> On Wed, 26 Nov 2003 22:34:10 +0100
> Arjan van de Ven <[email protected]> wrote:
> > question: do we need a timestamp for every packet or can we do one
> > timestamp per irq-context entry ? (eg one timestamp at irq entry time we
> > do anyway and keep that for all packets processed in the softirq)
>
> If people want the timestamp they usually want it to be accurate
> (e.g. for tcpdump etc.). of course there is already a lot of jitter
> in this information because it is done relatively late in the device
> driver (long after the NIC has received the packet)
>
> Just most people never care about this at all....

Yes, these people not caring just open a SOCK_STREAM or SOCK_DGRAM. I
don't see any field in msghdr, which contains the time.

Other people have packet sockets (or other special stuff) opened, which
is usally bound to a device or to a special RX/TX path. So we know,
which device does need it and which not.

If in doubt, there could be an sysctl option for exact time per device
or for all.

But I'm not really that familiar with the networking code, so please
ignore my ignorance on any issues here.


Regards

Ingo Oeser

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/xesDU56oYWuOrkARAr1sAJ9h/EywUCb9wGVCZiW9GbivMiEVsACghj74
dE4EdzeW84U7QcMi/o+Q9qE=
=70Cm
-----END PGP SIGNATURE-----