[Moving over to netdev]
> Hi Scott,
>
> Would you mind applying the attached patch for the e1000
> driver? The patch adds a "TSO" option which lets you disable
> TSO at boot-time/module-insertion time. This is useful for
> experimentation and, on fast machines, you can actually get
> better netperf numbers with TSO disabled. ;-)
This is interesting. I assume you're trying to get the best throughput
regardless of CPU utilization. Why are we getting lower throughput
rates? It's an open question for netdev. Do you have any data to
share?
> Note that I had to move the e1000_check_options() call to a
> slighly earlier place. You may want to double-check that
> it's really OK.
I'm not too keen on adding another module parameter. Maybe a
CONFIG_E1000_TSO option?
> The patch is relative to 2.5.75.
>
> Thanks,
>
> --david
>
> # This is a BitKeeper generated patch for the following
> project: # Project Name: Linux kernel tree # This patch
> format is intended for GNU patch command version 2.5 or
> higher. # This patch includes the following deltas:
> # ChangeSet 1.1379 -> 1.1380
> # drivers/net/e1000/e1000.h 1.33 -> 1.34
> # drivers/net/e1000/e1000_main.c 1.77 -> 1.78
> # drivers/net/e1000/e1000_param.c 1.21 -> 1.22
> #
> # The following is the BitKeeper ChangeSet Log
> # --------------------------------------------
> # 03/07/14 [email protected] 1.1380
> # Make it possible to disable TSO at module-load time (or
> boot-time). # This is controlled via the TSO parameter (e.g.,
> modprobe e1000 TSO=0,0,0,0 # will disable TSO on the first 4
> e1000 NICs). # --------------------------------------------
> #
> diff -Nru a/drivers/net/e1000/e1000.h b/drivers/net/e1000/e1000.h
> --- a/drivers/net/e1000/e1000.h Mon Jul 14 17:29:30 2003
> +++ b/drivers/net/e1000/e1000.h Mon Jul 14 17:29:30 2003
> @@ -194,6 +194,7 @@
> uint32_t tx_head_addr;
> uint32_t tx_fifo_size;
> atomic_t tx_fifo_stall;
> + boolean_t tso;
>
> /* RX */
> struct e1000_desc_ring rx_ring;
> diff -Nru a/drivers/net/e1000/e1000_main.c
> b/drivers/net/e1000/e1000_main.c
> --- a/drivers/net/e1000/e1000_main.c Mon Jul 14 17:29:30 2003
> +++ b/drivers/net/e1000/e1000_main.c Mon Jul 14 17:29:30 2003
> @@ -417,9 +417,12 @@
> netdev->features = NETIF_F_SG;
> }
>
> + e1000_check_options(adapter);
> +
> #ifdef NETIF_F_TSO
> if((adapter->hw.mac_type >= e1000_82544) &&
> - (adapter->hw.mac_type != e1000_82547))
> + (adapter->hw.mac_type != e1000_82547) &&
> + adapter->tso)
> netdev->features |= NETIF_F_TSO;
> #endif
>
> @@ -469,7 +472,6 @@
>
> printk(KERN_INFO "%s: Intel(R) PRO/1000 Network Connection\n",
> netdev->name);
> - e1000_check_options(adapter);
>
> /* Initial Wake on LAN setting
> * If APM wake is enabled in the EEPROM,
> diff -Nru a/drivers/net/e1000/e1000_param.c
> b/drivers/net/e1000/e1000_param.c
> --- a/drivers/net/e1000/e1000_param.c Mon Jul 14 17:29:30 2003
> +++ b/drivers/net/e1000/e1000_param.c Mon Jul 14 17:29:30 2003
> @@ -192,6 +192,16 @@
>
> E1000_PARAM(InterruptThrottleRate, "Interrupt Throttling Rate");
>
> +/* Enable TSO - TCP Segmentation Offload Enable/Disable
> + *
> + * Valid Range: 0, 1
> + * - 0 - disables TSO
> + * - 1 - enables TSO on NICs that are TSO capable
> + *
> + * Default Value: 1
> + */
> +E1000_PARAM(TSO, "Disable or enable TSO");
> +
> #define AUTONEG_ADV_DEFAULT 0x2F
> #define AUTONEG_ADV_MASK 0x2F
> #define FLOW_CONTROL_DEFAULT FLOW_CONTROL_FULL
> @@ -454,6 +464,18 @@
> } else {
> e1000_validate_option(&adapter->itr, &opt);
> }
> + }
> + { /* TSO Enable/Disable */
> + struct e1000_option opt = {
> + .type = enable_option,
> + .name = "TSO",
> + .err = "defaulting to Enabled",
> + .def = OPTION_ENABLED
> + };
> +
> + int tso = TSO[bd];
> + e1000_validate_option(&tso, &opt);
> + adapter->tso = tso;
> }
>
> switch(adapter->hw.media_type) {
>
On Mon, 14 Jul 2003 21:42:40 -0700
"Feldman, Scott" <[email protected]> wrote:
> > Note that I had to move the e1000_check_options() call to a
> > slighly earlier place. You may want to double-check that
> > it's really OK.
>
> I'm not too keen on adding another module parameter. Maybe a
> CONFIG_E1000_TSO option?
Extend ethtool please :-)
>>>>> On Mon, 14 Jul 2003 21:45:10 -0700, "David S. Miller" <[email protected]> said:
DaveM> On Mon, 14 Jul 2003 21:42:40 -0700 "Feldman, Scott"
DaveM> <[email protected]> wrote:
>> > Note that I had to move the e1000_check_options() call to a >
>> slighly earlier place. You may want to double-check that > it's
>> really OK.
>> I'm not too keen on adding another module parameter. Maybe a
>> CONFIG_E1000_TSO option?
DaveM> Extend ethtool please :-)
ethtool would be ideal, agreed.
I absolutely think that this should be a runtime option, not a
compile-time option.
--david
[Scott, somehow I never received your original response so I'm
replying based on what I saw in the linux.kernel newgroup...]
Scott> Do you have any data to share?
Sure, I don't see why not. Here are the number I got:
TSO disabled:
$ modprobe InterruptThrottleRate=0,0,0,0 TSO=0,0,0,0
$ netperf -l 30 -c -C -H foobar -- -s64K -S64K
TCP STREAM TEST to foobar
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
131070 131072 131072 30.00 897.16 34.07 35.00 3.111 3.196
TSO enabled:
$ modprobe InterruptThrottleRate=0,0,0,0 TSO=1,1,1,1
$ netperf -l 30 -c -C -H foobar -- -s64K -S64K
TCP STREAM TEST to foobar
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
131070 131072 131072 30.00 894.09 11.65 34.48 1.068 3.159
This looks roughly like you'd expect: with TSO, slightly lower
throughput but much less CPU overhead.
With ftp, things get stranger: fetching a 2GByte file via ftp get
(from the remote end):
TSO disabled:
ftp> get big.iso /dev/null
local: /dev/null remote: big.iso
200 PORT command successful.
150 Opening BINARY mode data connection for 'big.iso' (2038628352 bytes).
226 Transfer complete.
2038628352 bytes received in 18.17 secs (109554.5 kB/s)
ftp server CPU utilization: ~ 40%
With TSO enabled:
ftp> get big.iso /dev/null
local: /dev/null remote: big.iso
200 PORT command successful.
150 Opening BINARY mode data connection for 'big.iso' (2038628352 bytes).
226 Transfer complete.
2038628352 bytes received in 21.16 secs (94070.2 kB/s)
ftp server CPU utilization: ~ 15%
So we get almost 15% of throughput drop. This was with plain "netkit
fptd". AFAIK, it does a simple read/write loop (not sendfile()).
--david
On Mon, 14 Jul 2003 22:31:00 -0700
David Mosberger <[email protected]> wrote:
> With TSO enabled:
>
> ftp> get big.iso /dev/null
> local: /dev/null remote: big.iso
> 200 PORT command successful.
> 150 Opening BINARY mode data connection for 'big.iso' (2038628352 bytes).
> 226 Transfer complete.
> 2038628352 bytes received in 21.16 secs (94070.2 kB/s)
>
> ftp server CPU utilization: ~ 15%
>
> So we get almost 15% of throughput drop. This was with plain "netkit
> fptd". AFAIK, it does a simple read/write loop (not sendfile()).
When we use TSO for non-sendfile() applications it really
stresses memory allocations. We do these 64K+ kmalloc()'s
for each packet we construct.
But I don't think that's what is happening here, rather the PCI
controller is "talking" to the CPU's L2 cache with coherency
transactions on all the data of every packet going to the chip.
Whereas with a sendfile() type setup, the PCI controller is going
straight to main memory for the data part of the packets since the
CPU is unlikely to have each page cache page in it's L2 caches. In
the sendmsg() case, it's virtually guarenteed that the cpu will have
all the packet data in it's L2 cache in an unshared-modified state.
I know how this can be fixed, can you use L2-bypassing stores in
your csum_and_copy_from_user() and copy_from_user() implementations
like we do on sparc64? That would exactly eliminate this situation
where the card is talking to the cpu's L2 cache for all the data
during the PCI DMA transation on the send side.
>>>>> On Mon, 14 Jul 2003 22:38:22 -0700, "David S. Miller" <[email protected]> said:
DaveM> But I don't think that's what is happening here, rather the
DaveM> PCI controller is "talking" to the CPU's L2 cache with
DaveM> coherency transactions on all the data of every packet going
DaveM> to the chip.
That's true. But shouldn't it be true for both the TSO and non-TSO
case?
DaveM> Whereas with a sendfile() type setup, the PCI controller is
DaveM> going straight to main memory for the data part of the
DaveM> packets since the CPU is unlikely to have each page cache
DaveM> page in it's L2 caches.
But sendfile() was _not_ used in any of the tests. The ftp server
installed no the machine doesn't use it (not to my knowledge, at
least) and netperf only uses it for the SENDFILE test.
DaveM> I know how this can be fixed, can you use L2-bypassing stores
DaveM> in your csum_and_copy_from_user() and copy_from_user()
DaveM> implementations like we do on sparc64? That would exactly
DaveM> eliminate this situation where the card is talking to the
DaveM> cpu's L2 cache for all the data during the PCI DMA transation
DaveM> on the send side.
We could, but would it always be a win? Especially for
copy_from_user(). Most of the time, that data remains cached, so I
don't think we'd want to use non-temporal stores on those (in
general). csum_and_copy_from_user() isn't well optimized yet. Let's
see if I can find a volunteer... ;-)
--david
On Tue, 15 Jul 2003 16:01:55 -0700
David Mosberger <[email protected]> wrote:
> >>>>> On Mon, 14 Jul 2003 22:38:22 -0700, "David S. Miller" <[email protected]> said:
>
> DaveM> But I don't think that's what is happening here, rather the
> DaveM> PCI controller is "talking" to the CPU's L2 cache with
> DaveM> coherency transactions on all the data of every packet going
> DaveM> to the chip.
>
> That's true. But shouldn't it be true for both the TSO and non-TSO
> case?
The transfers are each longer in the TSO case, so need more
to transfer more data from the bus just to get _one_ of
the sub-packets of the large TSO frame out. It thus makes it
more likely they'll be a delay.
> DaveM> I know how this can be fixed, can you use L2-bypassing stores
> DaveM> in your csum_and_copy_from_user() and copy_from_user()
> DaveM> implementations like we do on sparc64? That would exactly
> DaveM> eliminate this situation where the card is talking to the
> DaveM> cpu's L2 cache for all the data during the PCI DMA transation
> DaveM> on the send side.
>
> We could, but would it always be a win? Especially for
> copy_from_user(). Most of the time, that data remains cached, so I
> don't think we'd want to use non-temporal stores on those (in
> general). csum_and_copy_from_user() isn't well optimized yet. Let's
> see if I can find a volunteer... ;-)
No, I mean "bypass L2 cache on miss" for stores. Don't
tell me IA64 doesn't have that? 8) I certainly didn't mean
"always bypass L2 cache" for stores :-)
>>>>> On Tue, 15 Jul 2003 18:39:11 -0700, "David S. Miller" <[email protected]> said:
>> We could, but would it always be a win? Especially for
>> copy_from_user(). Most of the time, that data remains cached, so
>> I don't think we'd want to use non-temporal stores on those (in
>> general). csum_and_copy_from_user() isn't well optimized yet.
>> Let's see if I can find a volunteer... ;-)
DaveM> No, I mean "bypass L2 cache on miss" for stores. Don't tell
DaveM> me IA64 doesn't have that? 8) I certainly didn't mean "always
DaveM> bypass L2 cache" for stores :-)
What I'm saying is that I almost always want copy_user() to put the
destination data in the cache, even if it isn't cached yet. Many
copy_user() calls are for for data structures that easily fit in the
cache and the data is usually used quickly afterwards.
As for cache-hints supported by IA64: the architecture supports
various non-temporal hints (non-temporal in 1st, 2nd, or all
cache-levels). How these hints are implemented depends on the chip.
On McKinley, non-temporal hints are generally implemented by storing
the data in the cache without updating the LRU info. So if the data
is already there, it will stay cached (until a victim is needed).
--david
On Tue, 15 Jul 2003 23:32:48 -0700
David Mosberger <[email protected]> wrote:
> DaveM> No, I mean "bypass L2 cache on miss" for stores. Don't tell
> DaveM> me IA64 doesn't have that? 8) I certainly didn't mean "always
> DaveM> bypass L2 cache" for stores :-)
>
> What I'm saying is that I almost always want copy_user() to put the
> destination data in the cache, even if it isn't cached yet.
No you don't :-)
If you miss, you do a bypass to main memory. Then when the
app asks for the data (if it even does at all, consider that)
it get's a clean copy in it's L2 cache.
Overall it's more efficient this way.
> Many copy_user() calls are for for data structures that
> easily fit in the cache and the data is usually used quickly afterwards.
Absolutely correct. We can't use the cache bypass-on-miss stores on
sparc64 unless the copy is at least a couple of cachelines in size.
It all works out, don't worry :-)
Hi,
> > So we get almost 15% of throughput drop. This was with plain "netkit
> > fptd". AFAIK, it does a simple read/write loop (not sendfile()).
We've been seeing rather variable results for TSO as well. With TSO off
netperf TCP_STREAM will hit line speed and stay there. With TSO on
some runs will hit line speed and others will be about 100Mbit/sec
slower.
> When we use TSO for non-sendfile() applications it really
> stresses memory allocations. We do these 64K+ kmalloc()'s
> for each packet we construct.
Yep we definitely noticed much more higher allocations when watching
/proc/slab. Playing around with slab tuning didnt seem to help.
Anton