2001-02-24 23:25:38

by Jeff Garzik

[permalink] [raw]
Subject: New net features for added performance

Disclaimer: This is 2.5, repeat, 2.5 material.



I've talked about the following items with a couple people on this list
in private. I wanted to bring these up again, to see if anyone has
comments on the following suggested netdevice changes for the upcoming
2.5 development series of kernels.


1) Rx Skb recycling. It would be nice to have skbs returned to the
driver after the net core is done with them, rather than have netif_rx
free the skb. Many drivers pre-allocate a number of maximum-sized skbs
into which the net card DMA's data. If netif_rx returned the SKB
instead of freeing it, the driver could simply flip the DescriptorOwned
bit for that buffer, giving it immediately back to the net card.

Advantages: A de-allocation immediately followed by a reallocation is
eliminated, less L1 cache pollution during interrupt handling.
Potentially less DMA traffic between card and host.

Disadvantages?



2) Tx packet grouping. If the net core has knowledge that more packets
will be following the current one being sent to dev->hard_start_xmit(),
it should pass that knowledge on to dev->hard_start_xmit(), either as an
estimated number yet-to-be-sent, or just as a flag that "more is
coming."

Advantages: This lets the net driver make smarter decisions about Tx
interrupt mitigation, Tx buffer queueing, etc.

Disadvantages? Can this sort of knowledge be obtained by a netdevice
right now, without any kernel modifications?



3) Slabbier packet allocation. Even though skb allocation is decently
fast, you are still looking at an skb buffer head grab and a kmalloc,
for each [dev_]alloc_skb call. I was wondering if it would be possible
to create a helper function for drivers which would improve the hot-path
considerably:

static struct skbuff *ether_alloc_skb (int size)
{
if (size >= preallocated_skb_list->skb->size) {
dequeue_skb_from_list()
if (preallocate_size < low_water_limit)
schedule_tasklet(refill_skb_list);
return skb;
}
return dev_alloc_skb(size);
}

The skbs from this list would be allocated by a tasklet in the
background to the maximum size requested by the ethernet driver. If you
wanted to waste even more memory, you could allocate from per-CPU
lists..

Disadvantages? Doing this might increase cache pollution due to
increased code and data size, but I think the hot path is much improved
(dequeue a properly sized, initialized, skb-reserved'd skb off a list)
and would help mitigate the impact of sudden bursts of traffic.



--
Jeff Garzik | "You see, in this world there's two kinds of
Building 1024 | people, my friend: Those with loaded guns
MandrakeSoft | and those who dig. You dig." --Blondie


2001-02-24 23:49:18

by Andi Kleen

[permalink] [raw]
Subject: Re: New net features for added performance

Jeff Garzik <[email protected]> writes:

> Advantages: A de-allocation immediately followed by a reallocation is
> eliminated, less L1 cache pollution during interrupt handling.
> Potentially less DMA traffic between card and host.
>
> Disadvantages?

You need a new mechanism to cope with low memory situations because the
drivers can tie up quite a bit of memory (in fact you gave up unified
memory management).

> 3) Slabbier packet allocation. Even though skb allocation is decently
> fast, you are still looking at an skb buffer head grab and a kmalloc,
> for each [dev_]alloc_skb call. I was wondering if it would be possible
> to create a helper function for drivers which would improve the hot-path
> considerably:
[...]

If you need such a horror it just means there is something wrong with slab.
Better fix slab.


4) Better support for aligned RX by only copying the header, no the whole
packet, to end up with an aligned IP header. Unless the driver knows about
all protocol lengths this means the stack needs to support "parse header
in this buffer, then switch to other buffer with computed offset for data"

-Andi

2001-02-25 00:12:40

by Andi Kleen

[permalink] [raw]
Subject: Re: New net features for added performance

On Sat, Feb 24, 2001 at 07:03:38PM -0500, Jeff Garzik wrote:
> Andi Kleen wrote:
> >
> > Jeff Garzik <[email protected]> writes:
> >
> > > Advantages: A de-allocation immediately followed by a reallocation is
> > > eliminated, less L1 cache pollution during interrupt handling.
> > > Potentially less DMA traffic between card and host.
> > >
> > > Disadvantages?
> >
> > You need a new mechanism to cope with low memory situations because the
> > drivers can tie up quite a bit of memory (in fact you gave up unified
> > memory management).
>
> I think you misunderstand.. netif_rx frees the skb. In this example:
>
> netif_rx(skb); /* free skb of size PKT_BUF_SZ */
> skb = dev_alloc_skb(PKT_BUF_SZ)
>
> an alloc of a PKT_BUF_SZ'd skb immediately follows a free of a
> same-sized skb. 100% of the time.

Free/Alloc gives the mm the chance to throttle it by failing, and also to
recover from fragmentation by packing the slabs. If you don't do it you need
to add a hook somewhere that gets triggered on low memory situations and
frees the buffers.

> > 4) Better support for aligned RX by only copying the header, no the whole
> > packet, to end up with an aligned IP header. Unless the driver knows about
> > all protocol lengths this means the stack needs to support "parse header
> > in this buffer, then switch to other buffer with computed offset for data"
>
> This requires scatter-gather hardware support, right? If so, would this
> support only exist for checksumming hardware -- like the current
> zerocopy -- or would non-checksumming SG hardware like tulip be
> supported too?

It doesn't need any hardware support. In fact it is especially helpful for
the tulip. The idea is that instead of copying the whole packet to get an
aligned header (e.g. on the alpha or other boxes where unaligned accesses
are very expensive) you just copy the first 128 byte that probably contain
the header. For the data it doesn't matter much if it's unaligned; copy_to_user
and csum_copy_to_user can deal with that fine.


-Andi

2001-02-25 00:04:10

by Jeff Garzik

[permalink] [raw]
Subject: Re: New net features for added performance

Andi Kleen wrote:
>
> Jeff Garzik <[email protected]> writes:
>
> > Advantages: A de-allocation immediately followed by a reallocation is
> > eliminated, less L1 cache pollution during interrupt handling.
> > Potentially less DMA traffic between card and host.
> >
> > Disadvantages?
>
> You need a new mechanism to cope with low memory situations because the
> drivers can tie up quite a bit of memory (in fact you gave up unified
> memory management).

I think you misunderstand.. netif_rx frees the skb. In this example:

netif_rx(skb); /* free skb of size PKT_BUF_SZ */
skb = dev_alloc_skb(PKT_BUF_SZ)

an alloc of a PKT_BUF_SZ'd skb immediately follows a free of a
same-sized skb. 100% of the time.

It seems an obvious shortcut to me, to have __netif_rx or similar
-clear- the skb head not free it. No changes to memory management or
additional low memory situations created by this, AFAICS.


> 4) Better support for aligned RX by only copying the header, no the whole
> packet, to end up with an aligned IP header. Unless the driver knows about
> all protocol lengths this means the stack needs to support "parse header
> in this buffer, then switch to other buffer with computed offset for data"

This requires scatter-gather hardware support, right? If so, would this
support only exist for checksumming hardware -- like the current
zerocopy -- or would non-checksumming SG hardware like tulip be
supported too?

Jeff


--
Jeff Garzik | "You see, in this world there's two kinds of
Building 1024 | people, my friend: Those with loaded guns
MandrakeSoft | and those who dig. You dig." --Blondie

2001-02-25 00:13:52

by Jeff Garzik

[permalink] [raw]
Subject: Re: New net features for added performance

Jeff Garzik wrote:
>
> Andi Kleen wrote:
> >
> > Jeff Garzik <[email protected]> writes:
> >
> > > Advantages: A de-allocation immediately followed by a reallocation is
> > > eliminated, less L1 cache pollution during interrupt handling.
> > > Potentially less DMA traffic between card and host.
> > >
> > > Disadvantages?
> >
> > You need a new mechanism to cope with low memory situations because the
> > drivers can tie up quite a bit of memory (in fact you gave up unified
> > memory management).
>
> I think you misunderstand.. netif_rx frees the skb. In this example:
>
> netif_rx(skb); /* free skb of size PKT_BUF_SZ */
> skb = dev_alloc_skb(PKT_BUF_SZ)
>
> an alloc of a PKT_BUF_SZ'd skb immediately follows a free of a
> same-sized skb. 100% of the time.
>
> It seems an obvious shortcut to me, to have __netif_rx or similar
> -clear- the skb head not free it. No changes to memory management or
> additional low memory situations created by this, AFAICS.

Sorry... I should also point out that I was thinking of tulip
architecture and similar architectures, where you have a fixed number of
Skbs allocated at all times, and that number doesn't change for the
lifetime of the driver.

Clearly not all cases would benefit from skb recycling, but there are a
number of rx-ring-based systems where this would be useful, and (AFAICS)
reduce the work needed to be done by the system, and reduce the amount
of overall DMA traffic by a bit.

Jeff



--
Jeff Garzik | "You see, in this world there's two kinds of
Building 1024 | people, my friend: Those with loaded guns
MandrakeSoft | and those who dig. You dig." --Blondie

2001-02-25 00:17:02

by Andi Kleen

[permalink] [raw]
Subject: Re: New net features for added performance

On Sat, Feb 24, 2001 at 07:13:14PM -0500, Jeff Garzik wrote:
> Sorry... I should also point out that I was thinking of tulip
> architecture and similar architectures, where you have a fixed number of
> Skbs allocated at all times, and that number doesn't change for the
> lifetime of the driver.
>
> Clearly not all cases would benefit from skb recycling, but there are a
> number of rx-ring-based systems where this would be useful, and (AFAICS)
> reduce the work needed to be done by the system, and reduce the amount
> of overall DMA traffic by a bit.

A simple way to do it currently is just to compare the new skb with the old
one. If it is the same, do a shortcut. That should usually work out when the
system has enough memory.


-Andi

2001-02-25 01:58:20

by Michael Richardson

[permalink] [raw]
Subject: Re: New net features for added performance


>>>>> "Jeff" == Jeff Garzik <[email protected]> writes:
Jeff> 1) Rx Skb recycling. It would be nice to have skbs returned to the
Jeff> driver after the net core is done with them, rather than have netif_rx
Jeff> free the skb. Many drivers pre-allocate a number of maximum-sized skbs
Jeff> into which the net card DMA's data. If netif_rx returned the SKB
Jeff> instead of freeing it, the driver could simply flip the DescriptorOwned
Jeff> bit for that buffer, giving it immediately back to the net card.

Jeff> Disadvantages?

netif_rx() would have to copy the buffer.

Right now, it just puts it on the queue towards the BH. For it to return
the skb would require that all processing occur inside of netif_rx() (a la BSD),
or that it copy the buffer.

Jeff> 3) Slabbier packet allocation. Even though skb allocation is decently
Jeff> fast, you are still looking at an skb buffer head grab and a

I think that if you had this, and you also returned skb's to this list on
a per device basis (change skb->free, I think) instead of to the general
pool, you probably eliminate your request #1.

] Train travel features AC outlets with no take-off restrictions|gigabit is no[
] Michael Richardson, Solidum Systems Oh where, oh where has|problem with[
] [email protected] http://www.solidum.com the little fishy gone?|PAX.port 1100[
] panic("Just another NetBSD/notebook using, kernel hacking, security guy"); [

2001-02-25 02:37:44

by Jeremy Jackson

[permalink] [raw]
Subject: Re: New net features for added performance

Jeff Garzik wrote:

(about optimizing kernel network code for busmastering NIC's)

> Disclaimer: This is 2.5, repeat, 2.5 material.

Related question: are there any 100Mbit NICs with cpu's onboard?
Something mainstream/affordable?(i.e. not 1G ethernet)
Just recently someone posted asking some technical question about
ARMlinux for and intel card with 2 1G ports, 8 100M ports,
an onboard ARM cpu and 4 other uControllers... seems to me
that ultimately the networking code should go in that direction:
immagine having the *NIC* do most of this... no cache pollution problems...

2001-02-25 02:38:34

by Noah Romer

[permalink] [raw]
Subject: Re: New net features for added performance

On Sat, 24 Feb 2001, Jeff Garzik wrote:

> Disclaimer: This is 2.5, repeat, 2.5 material.
[snip]
> 1) Rx Skb recycling. It would be nice to have skbs returned to the
> driver after the net core is done with them, rather than have netif_rx
> free the skb. Many drivers pre-allocate a number of maximum-sized skbs
> into which the net card DMA's data. If netif_rx returned the SKB
> instead of freeing it, the driver could simply flip the DescriptorOwned
> bit for that buffer, giving it immediately back to the net card.
>
> Advantages: A de-allocation immediately followed by a reallocation is
> eliminated, less L1 cache pollution during interrupt handling.
> Potentially less DMA traffic between card and host.

This could be quite useful for the network driver I maintain (it's made
it to the -ac patch set for 2.4, but not yet into the main kernel
tarball). At the momement, it allocates 127 "buckets" (skb's under linux)
at start of day and posts them to the card. After that, it maintains a
minimum of 80 data buffers available to the card at any one time. There's
a noticable performance hit when the driver has to reallocate new skbs
to keep above the threshold. I try to recycle as much as possible w/in the
driver (i.e. really small incoming packets get a new skb allocated for
them and the original buffer is put back on the queue), but it would be
nice to be able to recycle even more of the skbs.

> Disadvantages?

As has been pointed out, there's a certain loss of control over allocation
of memory (could check for low memory conditions before sending the skb
back to the driver, but . . .). I do see a failure to allocate all 127
skbs, occasionally, when the driver is first loaded (only way to get
around this is to reboot the system).

> 2) Tx packet grouping. If the net core has knowledge that more packets
> will be following the current one being sent to dev->hard_start_xmit(),
> it should pass that knowledge on to dev->hard_start_xmit(), either as an
> estimated number yet-to-be-sent, or just as a flag that "more is
> coming."
>
> Advantages: This lets the net driver make smarter decisions about Tx
> interrupt mitigation, Tx buffer queueing, etc.
>
> Disadvantages? Can this sort of knowledge be obtained by a netdevice
> right now, without any kernel modifications?

In my experience, Tx interrupt mitigation is of little benefit. I actually
saw a performance increase of ~20% when I turned off Tx interrupt
mitigation in my driver (could have been poor implementation on my part).

--
Noah Romer |"Calm down, it's only ones and zeros." - this message
[email protected] |brought to you by The Network
PGP key available |"Time will have its say, it always does." - Celltrex
by finger or email |from Flying to Valhalla by Charles Pellegrino

2001-02-25 03:24:23

by Chris Wedgwood

[permalink] [raw]
Subject: Re: New net features for added performance

On Sat, Feb 24, 2001 at 09:32:59PM -0500, Jeremy Jackson wrote:

Related question: are there any 100Mbit NICs with cpu's onboard?

Yes, but the only ones I've seen to date are magic and do special
things (like VPN or hardware crypto). I'm not sure without 'magic'
requirements there is much point for 100M on modern hardware.

Not affordable and whilst moving some of the IP stack onto the card
(I think this is what are alluding to) would be extremely non-trivial
especially if you want all the components (host OS, multiple networks
cards) to talk to each other asynchronously and you would all have to
deal with buggy hardware that doesn't like doing PCI-PCI transfers
and such like.

That said, it would be an extemely neat thing to do from a technical
perspective, but I don't know if you would ever get really good
performance from it.




--cw

2001-02-25 12:02:23

by Andrew Morton

[permalink] [raw]
Subject: Re: New net features for added performance

Jeff Garzik wrote:
>
>...
> 1) Rx Skb recycling.
>...
> 2) Tx packet grouping.
>...
> 3) Slabbier packet allocation.

Let's see what the profiler says. 10 seconds of TCP xmit
followed by 10 seconds of TCP receive. 100 mbits/sec.
Kernel 2.4.2+ZC.

c0119470 do_softirq 97 0.7132
c020e718 ip_output 99 0.3694
c020a2c8 ip_route_input 103 0.2893
c01fdc4c skb_release_data 113 1.0089
c021312c tcp_sendmsg 113 0.0252
c0129c64 kmalloc 117 0.3953
c0112efc __wake_up_sync 128 0.6667
c01fdd24 __kfree_skb 153 0.6071
c020e824 ip_queue_xmit 154 0.1149
c011be80 del_timer 163 2.2639
c0222fac tcp_v4_rcv 173 0.1022
c010a778 handle_IRQ_event 178 1.4833
c01127fc schedule 200 0.1259
c01d39f8 boomerang_rx 332 0.2823
c024284c csum_partial_copy_generic 564 2.2742
c01d2c84 boomerang_start_xmit 654 0.9033
c0242b3c __generic_copy_from_user 733 12.2167
c01d329c boomerang_interrupt 910 0.8818
c01071f4 poll_idle 41813 1306.6562
00000000 total 48901 0.0367

7088 non-idle ticks.
153+117+113 ticks in skb/memory type functions.

So, naively, the most which can be saved here by optimising
the skb and memory usage is 5% of networking load. (1% of
system load @100 mbps)

Total device driver cost is 27% of the networking load.

All the meat is in the interrupt load. The 3com driver
transfers about three packets per interrupt. Here's
the system load (dual CPU):

Doing 100mbps TCP send with netperf: 14.9%
Doing 100mbps TCP receive with netperf: 23.3%

When tx interrupt mitigation is disabled we get 1.5 packets
per interrupt doing transmit:

Doing 100mbps TCP send with netperf: 16.1%
Doing 100mbps TCP receive with netperf: 24.0%

So a 2x reduction in interrupt frequency on TCP transmit has
saved 1.2% of system load. That's 8% of networking load, and,
presumably, 30% of the driver load. That all seems to make sense.


The moral?

- Tuning skb allocation isn't likely to make much difference.
- At the device-driver level the most effective thing is
to reduce the number of interrupts.
- If we can reduce the driver cost to *zero*, we improve
TCP efficiency by 27%.
- At the system level the most important thing is to rewrite
applications to use sendfile(). (But Rx is more expensive
than Tx, so even this ain't the main game).

I agree that batching skbs into hard_start_xmit() may allow
some driver optimisations. Pass it a vector of skbs rather
than one, and let it return an indication of how many were
actually consumed. But we'd need to go through an exercise
like the above beforehand - it may not be worth the
protocol-level trauma.

I suspect that a thorough analysis of the best way to
use Linux networking, and then a rewrite of important
applications so they use the result of that analysis
would pay dividends.

-

2001-02-25 12:23:39

by Werner Almesberger

[permalink] [raw]
Subject: Re: New net features for added performance

Jeff Garzik wrote:
> 1) Rx Skb recycling.

Sounds like a potentially useful idea. To solve the most immediate memory
pressure problems, maybe VM could provide some function that does a kfree
in cases of memory shortage, and that does nothing otherwise, so the
driver could offer to free the skb after netif_rx. You still need to go
over the list in idle periods, though.

> 2) Tx packet grouping.

Hmm, I think we need an estimate of how long a packet train you'd usually
get. A flag looks reasonably inexpensive. Estimated numbers sound like
over-engineering.

> Disadvantages? Can this sort of knowledge be obtained by a netdevice
> right now, without any kernel modifications?

Question is what the hardware really needs. If you can change the
interrupt point easily, it's probably cheapest to do all the work in
hard_start_xmit.

> 3) Slabbier packet allocation.

Hmm, this may actually be worse during bursts: if you burst exceeds
the preallocated size, you have to perform more expensive/slower
operations (e.g. running a tasklet) to refill your cache.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, ICA, EPFL, CH [email protected] /
/_IN_N_032__Tel_+41_21_693_6621__Fax_+41_21_693_6610_____________________/

2001-02-25 12:42:44

by Werner Almesberger

[permalink] [raw]
Subject: Re: New net features for added performance

Chris Wedgwood wrote:
> That said, it would be an extemely neat thing to do from a technical
> perspective, but I don't know if you would ever get really good
> performance from it.

Well, you'd have to re-design the networking code to support NUMA
architectures, with a fairly fine granularity. I'm not sure you'd gain
anything except possibly for the forwarding fast path.

A cheaper, and probably more useful possibility is hardware assistance for
specific operations. E.g. hardware-accelerated packet classification looks
interesting. I'd also like to see hardware-assistance for shaping on other
media than ATM.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, ICA, EPFL, CH [email protected] /
/_IN_N_032__Tel_+41_21_693_6621__Fax_+41_21_693_6610_____________________/

2001-02-25 13:20:41

by Jonathan Morton

[permalink] [raw]
Subject: Re: New net features for added performance

At 2:32 am +0000 25/2/2001, Jeremy Jackson wrote:
>Jeff Garzik wrote:
>
>(about optimizing kernel network code for busmastering NIC's)
>
>> Disclaimer: This is 2.5, repeat, 2.5 material.
>
>Related question: are there any 100Mbit NICs with cpu's onboard?
>Something mainstream/affordable?(i.e. not 1G ethernet)
>Just recently someone posted asking some technical question about
>ARMlinux for and intel card with 2 1G ports, 8 100M ports,
>an onboard ARM cpu and 4 other uControllers... seems to me
>that ultimately the networking code should go in that direction:
>immagine having the *NIC* do most of this... no cache pollution problems...

Dunno, but the latest Motorola ColdFire microcontroller has Ethernet built
in. I think it's even 100baseTX, but I could be mistaken.

--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: [email protected] (not for attachments)
big-mail: [email protected]
uni-mail: [email protected]

The key to knowledge is not to rely on people to teach you it.

Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/

-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+
-----END GEEK CODE BLOCK-----


2001-02-25 13:58:00

by Chris Wedgwood

[permalink] [raw]
Subject: Re: New net features for added performance

On Sun, Feb 25, 2001 at 01:41:56PM +0100, Werner Almesberger wrote:

Well, you'd have to re-design the networking code to support NUMA
architectures, with a fairly fine granularity. I'm not sure you'd
gain anything except possibly for the forwarding fast path.

I'm not convince for a general purpose OS you would gain anything at
all; but an an intellectual exercise it's a fascinating idea.

I'd make a good PhD thesis.



--cw

2001-02-25 15:16:26

by Jeremy Jackson

[permalink] [raw]
Subject: Re: New net features for added performance

Andrew Morton wrote:

(kernel profile of TCP tx/rx)So, naively, the most which can be saved here by
optimising

> the skb and memory usage is 5% of networking load. (1% of
> system load @100 mbps)
>

For a local tx/rx. (open question) What happens with
a router box with netfilter and queueing? Perhaps
this type of optimisation will help more in that case?

think about a box with 4 1G NICs being able to
route AND do QoS per conntrack connection
(ala RSVP and such)

Really what I'm looking for is something like SGI's
STP (Scheduled Transfer Protocol). mmap your
tcp recieve buffer, and have a card smart enough
to figure out header alignment, (i.e. know header
size based on protocol number) transfer only that,
let the kernel process it, then tell the card to DMA
the data from the buffer right into process memory.
(or other NIC)

Make it possible to have the performance of a
Juniper network processor + flexiblity of Linux.

2001-02-25 23:47:14

by Rusty Russell

[permalink] [raw]
Subject: Re: New net features for added performance

In message <[email protected]> you write:
> Jeff Garzik <[email protected]> writes:
>
> > Advantages: A de-allocation immediately followed by a reallocation is
> > eliminated, less L1 cache pollution during interrupt handling.
> > Potentially less DMA traffic between card and host.
> >
> > Disadvantages?
>
> You need a new mechanism to cope with low memory situations because the
> drivers can tie up quite a bit of memory (in fact you gave up unified
> memory management).

Also, you still need to "clean" the skb (it can hold device and nfct
references).

Rusty.
--
Premature optmztion is rt of all evl. --DK

2001-02-26 23:52:20

by David Miller

[permalink] [raw]
Subject: Re: New net features for added performance


Andi Kleen writes:
> 4) Better support for aligned RX by only copying the header

Andi you can make this now:

1) Add new "post-header data pointer" field in SKB.
2) Change drivers to copy into aligned headroom as
you mention, and they set this new post-header
pointer as appropriate. For normal drivers without
alignment problem, generic code sets the pointer up
just like it does the rest of the SKB header pointers
now.
3) Enforce correct usage of it in all the networking :-)

I would definitely accept such a patch for the 2.5.x
series. It seems to be a nice idea and I currently see
no holes in it.

Later,
David S. Miller
[email protected]

2001-02-27 00:04:22

by Andi Kleen

[permalink] [raw]
Subject: Re: New net features for added performance

On Mon, Feb 26, 2001 at 03:48:31PM -0800, David S. Miller wrote:
>
> Andi Kleen writes:
> > 4) Better support for aligned RX by only copying the header
>
> Andi you can make this now:
>
> 1) Add new "post-header data pointer" field in SKB.

That would imply to let the drivers parse all headers to figure out the length.
I think it's better to have a "header base" and "data base" pointer.
The driver would just copy some standard size that likely contains all of
the header
When you're finished with the header use
skb->database+(skb->hdrptr-skb->hdrbase) to get the start of data.

Or did I misunderstand you?



> 3) Enforce correct usage of it in all the networking :-)

,) -- the tricky part.


-Andi

2001-02-27 00:08:42

by Jeff Garzik

[permalink] [raw]
Subject: Re: New net features for added performance

"David S. Miller" wrote:
> Jeff Garzik writes:
> > 2) Tx packet grouping.
> ...
> > Disadvantages?
>
> See Torvalds vs. world discussion on this list about API entry points
> which pass multiple pages at a time versus simpler ones which pass
> only a single page at a time. :-)

I only want to know if more are coming, not actually pass multiples..

Jeff



--
Jeff Garzik | "You see, in this world there's two kinds of
Building 1024 | people, my friend: Those with loaded guns
MandrakeSoft | and those who dig. You dig." --Blondie

2001-02-26 23:50:10

by David Miller

[permalink] [raw]
Subject: Re: New net features for added performance


Jeff Garzik writes:
> 1) Rx Skb recycling.
...
> Advantages: A de-allocation immediately followed by a reallocation is
> eliminated, less L1 cache pollution during interrupt handling.
> Potentially less DMA traffic between card and host.
...
> Disadvantages?

It simply cannot work, as Alexey stated, in normal circumstances
netif_rx() queues until the user reads the data. This is the whole
basis of our receive packet processing model within softint/user
context.

Secondly, I can argue that skb recycling can give _worse_ cache
performance. If the next use and access by the card to the
skb data is deferred, this gives the cpu a chance to displace those
lines in it's cache naturally via displacement instead of being forced
quickly to do so when the device touches that data.

If the device forces the cache displacement, those cache lines become
empty until filled with something later (smaller utilization of total
cache contents) whereas natural displacement puts useful data into
the cache at the time of the displacement (larger utilization of total
cache contents).

It is an NT/windows driver API rubbish idea, and it is full crap.

> 2) Tx packet grouping.
...
> Disadvantages?

See Torvalds vs. world discussion on this list about API entry points
which pass multiple pages at a time versus simpler ones which pass
only a single page at a time. :-)

> 3) Slabbier packet allocation.
...
> Disadvantages? Doing this might increase cache pollution due to
> increased code and data size, but I think the hot path is much improved
> (dequeue a properly sized, initialized, skb-reserved'd skb off a list)
> and would help mitigate the impact of sudden bursts of traffic.

I don't know what I think about this one, but my hunch is that it will
lead to worse data packing via such an allocator.

Later,
David S. Miller
[email protected]

2001-02-27 00:12:43

by David Miller

[permalink] [raw]
Subject: Re: New net features for added performance


Andi Kleen writes:
> Or did I misunderstand you?

What is wrong with making methods, keyed off of the ethernet protocol
ID, that can do the "I know where/how-long headers are" stuff for that
protocol? Only cards with the problem call into this function vector
or however we arrange it, and then for those that don't have these
problems at all we can make NULL a special value for this
"post-header" pointer.

You can pick some arbitrary number, sure, that is another way to
do it. Such a size would need to be chosen very carefully though.

Later,
David S. Miller
[email protected]

2001-02-27 00:14:16

by David Miller

[permalink] [raw]
Subject: Re: New net features for added performance


Jeff Garzik writes:
> I only want to know if more are coming, not actually pass multiples..

Ok, then my only concern is that the path from "I know more is coming"
down to hard_start_xmit invocation is long. It would mean passing a
new piece of state a long distance inside the stack from SKB origin to
device.

Later,
David S. Miller
[email protected]

2001-02-27 02:58:59

by Jeremy Jackson

[permalink] [raw]
Subject: Re: New net features for added performance

"David S. Miller" wrote:

> Andi Kleen writes:
> > Or did I misunderstand you?
>
> What is wrong with making methods, keyed off of the ethernet protocol
> ID, that can do the "I know where/how-long headers are" stuff for that
> protocol? Only cards with the problem call into this function vector
> or however we arrange it, and then for those that don't have these
> problems at all we can make NULL a special value for this
> "post-header" pointer.
>

I had a dream about a NIC that would do exactly the above by itsself.
The dumb cards would use the above code, and the smart ones' drivers
would overload the functions and allow the NIC to do it.

"Tell me of the waters of your homeworld, Usul" :)

Except the driver interacts differently than netif_rx... knowing the
protocol it DMA's the header only(it knows the length then too)

(SMC's epic100's descriptors can do this, but the card can't
do the de-mux on proto id, forcing the network core to run
in the ISR so the card can finish DMA and not exhaust it's
tiny memory.) The network code can
then do all the routing/netfilter/QoS stuff, and tell the card to DMA
the payload into the TX queue of another NIC (or queue the header
with a pointer to the payload in the PCI address space of the incoming
NIC heh heh) OR into the process' mmap'ed TCP receive buffer
ala SGI's STP.



2001-02-27 20:00:02

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: New net features for added performance

Hello!

> > 3) Enforce correct usage of it in all the networking :-)
>
> ,) -- the tricky part.

No tricks, IP[v6] is already enforced to be clever; all the rest are free
to do this, if they desire. And btw, driver need not to parse anything,
but its internal stuff and even aligning eth II header can be made
in eth_type_trans().

Actually, it is possible now not changing anything but driver.
Fortunately, I removed stupid tulip from alpha, so that I have
no impetus to try this myself. 8)

Alexey

2001-03-01 20:35:13

by Pavel Machek

[permalink] [raw]
Subject: Re: New net features for added performance

Hi!

> > an alloc of a PKT_BUF_SZ'd skb immediately follows a free of a
> > same-sized skb. 100% of the time.
>
> Free/Alloc gives the mm the chance to throttle it by failing, and also to
> recover from fragmentation by packing the slabs. If you don't do it you need
> to add a hook somewhere that gets triggered on low memory situations and
> frees the buffers.

And what? It makes allocation longer lived. Our MM should survive that just
fine.

--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.

2001-03-01 21:11:13

by Jes Sorensen

[permalink] [raw]
Subject: Re: New net features for added performance

>>>>> "Jeff" == Jeff Garzik <[email protected]> writes:

Jeff> 1) Rx Skb recycling. It would be nice to have skbs returned to
Jeff> the driver after the net core is done with them, rather than
Jeff> have netif_rx free the skb. Many drivers pre-allocate a number
Jeff> of maximum-sized skbs into which the net card DMA's data. If
Jeff> netif_rx returned the SKB instead of freeing it, the driver
Jeff> could simply flip the DescriptorOwned bit for that buffer,
Jeff> giving it immediately back to the net card.

Jeff> Advantages: A de-allocation immediately followed by a
Jeff> reallocation is eliminated, less L1 cache pollution during
Jeff> interrupt handling. Potentially less DMA traffic between card
Jeff> and host.

Jeff> Disadvantages?

I already tried this with the AceNIC GigE driver some time ago, and
after Ingo came up with a per-CPU slab patch the gain was gone. I am
not sure the complexity is worth it.

Jes

2001-03-03 23:33:13

by Jes Sorensen

[permalink] [raw]
Subject: Re: New net features for added performance

>>>>> "Noah" == Noah Romer <[email protected]> writes:

Noah> In my experience, Tx interrupt mitigation is of little
Noah> benefit. I actually saw a performance increase of ~20% when I
Noah> turned off Tx interrupt mitigation in my driver (could have been
Noah> poor implementation on my part).

You need to define performance increase here. TX interrupt coalescing
can still be a win in the systems load department.

Jes

2001-03-04 01:07:27

by Steven J. Hill

[permalink] [raw]
Subject: LILO error with 2.4.3-pre1...

Hmm, needed 2.4.3-pre1 and went to install with LILO using
'lilo -v' and got this:

LILO version 21.4-4, Copyright (C) 1992-1998 Werner Almesberger
'lba32' extensions Copyright (C) 1999,2000 John Coffman

Reading boot sector from /dev/hda
Merging with /boot/boot.b
Boot image: /boot/vmlinuz-2.4.2
Added linux *
Boot image: /boot/vmlinuz-2.4.3-pre1
Fatal: geo_comp_addr: Cylinder number is too big (1274 > 1023)

Neato. I don't have time to dig through LILO source code right
now, so here are my system specs:

Linux Distribution: RedHat 6.2 with all latest updates
Hard Disk: Maxtor 52049H3 (20GB) IDE
CPU: Dual PII-266MHz
RAM: 256MB PC100
Result of 'fdisk /dev/hda -l':

Disk /dev/hda: 255 heads, 63 sectors, 2491 cylinders
Units = cylinders of 16065 * 512 bytes

Device Boot Start End Blocks Id System
/dev/hda1 * 1 1513 12153141 83 Linux
/dev/hda2 1514 1530 136552+ 82 Linux swap
/dev/hda3 1531 2491 7719232+ 83 Linux

I have no idea why the 1023 limit is coming up considering 2.4.2 and
LILO were working just fine together and I have a newer BIOS that has
not problems detecting the driver properly. Go ahead, call me idiot :).

-Steve

--
Steven J. Hill - Embedded SW Engineer
Public Key: 'http://www.cotw.com/pubkey.txt'
FPR1: E124 6E1C AF8E 7802 A815
FPR2: 7D72 829C 3386 4C4A E17D

2001-03-04 01:39:59

by Keith Owens

[permalink] [raw]
Subject: Re: LILO error with 2.4.3-pre1...

On Sat, 03 Mar 2001 19:19:28 -0600,
"Steven J. Hill" <[email protected]> wrote:
>I have no idea why the 1023 limit is coming up considering 2.4.2 and
>LILO were working just fine together and I have a newer BIOS that has
>not problems detecting the driver properly. Go ahead, call me idiot :).

OK, you're an idiot :). It only worked before because all the files
that lilo used just happened to be below cylinder 1024. Your partition
goes past cyl 1024 and your new kernel is using space above 1024. Find
a version of lilo that can cope with cyl >= 1024 (is there one?) or
move the kernel below cyl 1024. You might need to repartition your
disk to get / all below 1024.

2001-03-04 02:29:03

by Tom Sightler

[permalink] [raw]
Subject: Re: LILO error with 2.4.3-pre1...


----- Original Message -----
From: "Keith Owens" <[email protected]>
To: <[email protected]>
Cc: <[email protected]>
Sent: Saturday, March 03, 2001 8:39 PM
Subject: Re: LILO error with 2.4.3-pre1...


> On Sat, 03 Mar 2001 19:19:28 -0600,
> "Steven J. Hill" <[email protected]> wrote:
> >I have no idea why the 1023 limit is coming up considering 2.4.2 and
> >LILO were working just fine together and I have a newer BIOS that has
> >not problems detecting the driver properly. Go ahead, call me idiot :).
>
> OK, you're an idiot :). It only worked before because all the files
> that lilo used just happened to be below cylinder 1024. Your partition
> goes past cyl 1024 and your new kernel is using space above 1024.

I would agree with this explanation.

> Find a version of lilo that can cope with cyl >= 1024 (is there one?)

Uh, the version he has can cope with this, see the following:

> LILO version 21.4-4, Copyright (C) 1992-1998 Werner Almesberger
> 'lba32' extensions Copyright (C) 1999,2000 John Coffman

The lba32 extensions should take care of this, of course you have to add
'lba32' as a line in your lilo.conf before lilo actually uses them (and, I
assume, the BIOS must support the LBA extensions, but it seems most modern
ones do).

Give that a try. Works for me.

Later,
Tom


2001-03-04 02:39:46

by Andre Tomt

[permalink] [raw]
Subject: RE: LILO error with 2.4.3-pre1...

> 'lba32' extensions Copyright (C) 1999,2000 John Coffman
^^^^^^

Add lba32 as the top line in lilo.conf. Re-run lilo.

> Fatal: geo_comp_addr: Cylinder number is too big (1274 > 1023)

Before 2.4.3pre1, your kernel just happened to toss itself below cylinder
1024.

> Go ahead, call me idiot :).

Idiot. :-)

--
Regards,
Andre Tomt

2001-03-04 03:20:21

by Steven J. Hill

[permalink] [raw]
Subject: Re: LILO error with 2.4.3-pre1...

Andre Tomt wrote:
>
> > 'lba32' extensions Copyright (C) 1999,2000 John Coffman
> ^^^^^^
>
> Add lba32 as the top line in lilo.conf. Re-run lilo.
>
> > Fatal: geo_comp_addr: Cylinder number is too big (1274 > 1023)
>
> Before 2.4.3pre1, your kernel just happened to toss itself below cylinder
> 1024.
>
> > Go ahead, call me idiot :).
>
> Idiot. :-)
>
And since Andre was the last person to email me and call me an idiot,
I will reply to his response :). Yup, that was the case and I added
'lba32' to my '/etc/lilo.conf' and things work great. I knew it was
something simple, but I just don't pay attention to LILO much anymore.
Thanks everyone.

-Steve

--
Steven J. Hill - Embedded SW Engineer
Public Key: 'http://www.cotw.com/pubkey.txt'
FPR1: E124 6E1C AF8E 7802 A815
FPR2: 7D72 829C 3386 4C4A E17D

2001-03-04 13:33:13

by Alan

[permalink] [raw]
Subject: Re: LILO error with 2.4.3-pre1...

> LILO version 21.4-4, Copyright (C) 1992-1998 Werner Almesberger
> 'lba32' extensions Copyright (C) 1999,2000 John Coffman
>
> Boot image: /boot/vmlinuz-2.4.3-pre1
> Fatal: geo_comp_addr: Cylinder number is too big (1274 > 1023)
>
> I have no idea why the 1023 limit is coming up considering 2.4.2 and
> LILO were working just fine together and I have a newer BIOS that has
> not problems detecting the driver properly. Go ahead, call me idiot :).

You need to specify the lba32 option in your config

2001-03-04 21:36:37

by Mircea Damian

[permalink] [raw]
Subject: Re: LILO error with 2.4.3-pre1...

On Sun, Mar 04, 2001 at 12:39:32PM +1100, Keith Owens wrote:
> On Sat, 03 Mar 2001 19:19:28 -0600,
> "Steven J. Hill" <[email protected]> wrote:
> >I have no idea why the 1023 limit is coming up considering 2.4.2 and
> >LILO were working just fine together and I have a newer BIOS that has
> >not problems detecting the driver properly. Go ahead, call me idiot :).
>
> OK, you're an idiot :). It only worked before because all the files
> that lilo used just happened to be below cylinder 1024. Your partition
> goes past cyl 1024 and your new kernel is using space above 1024. Find
> a version of lilo that can cope with cyl >= 1024 (is there one?) or
> move the kernel below cyl 1024. You might need to repartition your
> disk to get / all below 1024.

Call me idiot too but please explain what is wrong here:

# cat /etc/lilo.conf
boot = /dev/hda
timeout = 150
vga = 4
ramdisk = 0
lba32
append = "hdc=scsi"
prompt


image = /boot/vmlinuz-2.4.2
root = /dev/hda2
read-only
label = Linux

other = /dev/hda3
label = win
table = /dev/hda

# fdisk -l /dev/hda

Disk /dev/hda: 255 heads, 63 sectors, 1650 cylinders
Units = cylinders of 16065 * 512 bytes

Device Boot Start End Blocks Id System
/dev/hda1 1 17 136521 82 Linux swap
/dev/hda2 18 1165 9221310 83 Linux
/dev/hda3 * 1166 1650 3895762+ c Win95 FAT32 (LBA)
root@taz:~# lilo -v
LILO version 21.7, Copyright (C) 1992-1998 Werner Almesberger
Linux Real Mode Interface library Copyright (C) 1998 Josh Vanderhoof
Development beyond version 21 Copyright (C) 1999-2001 John Coffman
Released 24-Feb-2001 and compiled at 18:31:02 on Mar 3 2001.

Reading boot sector from /dev/hda
Merging with /boot/boot.b
Boot image: /boot/vmlinuz-2.4.2
Added Linux *
Boot other: /dev/hda3, on /dev/hda, loader /boot/chain.b
Device 0x0300: Invalid partition table, 3rd entry
3D address: 63/254/141 (2281229)
Linear address: 1/0/1165 (18715725)


Mar 2 20:26:29 taz kernel: hda: IBM-DJNA-371350, ATA DISK drive
Mar 2 20:26:29 taz kernel: hda: 26520480 sectors (13578 MB) w/1966KiB Cache, CHS=1650/255/63


Is anybody able to explain the error?
That partition contains a valid VFAT partition with win98se installed on it (and it works fine,
ofc if I remove lilo from MBR).

--
Mircea Damian
E-mails: [email protected], [email protected]
WebPage: http://taz.mania.k.ro/~dmircea/

2001-03-04 23:06:10

by Guest section DW

[permalink] [raw]
Subject: Re: LILO error with 2.4.3-pre1...

On Sun, Mar 04, 2001 at 11:32:44PM +0200, Mircea Damian wrote:

> Call me idiot too but please explain what is wrong here:

What is wrong is that this is the kernel list, not the LILO list.

> root@taz:~# lilo -v
> LILO version 21.7, Copyright (C) 1992-1998 Werner Almesberger
> Device 0x0300: Invalid partition table, 3rd entry
> 3D address: 63/254/141 (2281229)
> Linear address: 1/0/1165 (18715725)

Read the README in the LILO distribution.

2001-03-12 15:10:07

by Jes Sorensen

[permalink] [raw]
Subject: Re: New net features for added performance

>>>>> "Werner" == Werner Almesberger <[email protected]> writes:

Werner> Jeff Garzik wrote:
>> 3) Slabbier packet allocation.

Werner> Hmm, this may actually be worse during bursts: if you burst
Werner> exceeds the preallocated size, you have to perform more
Werner> expensive/slower operations (e.g. running a tasklet) to refill
Werner> your cache.

You may want to look at how I did this in the acenic driver. If the
water mark goes below a certain level I schedule the tasklet, if it
gets below an urgent watermark I do the allocation in the interrupt
handler itself.

This is of course mainly useful for cards which give you deep
queues.

Jes