(Added linux-kernel)
Ravikiran G Thirumalai <[email protected]> wrote:
>
> Hi Andrew,
> This patchset contains alloc_percpu + bigrefs + bigrefs for netdevice
> refcount. This patchset improves tbench lo performance by 6% on a 8 way IBM
> x445.
I think we'd need more comprehensive benchmarks than this before adding
large amounts of complex core code.
We'd also need to know how much of any performance improvement was due to
alloc_percpu versus bigrefs, please.
Bigrefs look reasonably sane to me, but a whole new memory allocator is a
big deal. Given that alloc_percpu() is already numa-aware, is that extra
cross-node fetch and pointer hop really worth all that new code? The new
version will have to do a tlb load (including a cross-node fetch)
approximately as often as the old version will get a CPU cache miss on the
percpu array, maybe?
From: Andrew Morton <[email protected]>
Date: Fri, 23 Sep 2005 00:10:13 -0700
> Ravikiran G Thirumalai <[email protected]> wrote:
> >
> > Hi Andrew,
> > This patchset contains alloc_percpu + bigrefs + bigrefs for netdevice
> > refcount. This patchset improves tbench lo performance by 6% on a 8 way IBM
> > x445.
>
> I think we'd need more comprehensive benchmarks than this before adding
> large amounts of complex core code.
I agree. tbench over loopback is about as far from real life
as it gets.
I'm still against expanding these networking datastructures with
bigrefs just for this stuff. Some people have per-cpu and per-node on
the brain, and it's starting to bloat things up a little bit too much.
On Fri, 2005-09-23 at 00:17 -0700, David S. Miller wrote:
> I'm still against expanding these networking datastructures with
> bigrefs just for this stuff. Some people have per-cpu and per-node on
> the brain, and it's starting to bloat things up a little bit too much.
I think for net devices it actually makes sense; most of the time we are
not trying to remove them, so the refcounting is simply overhead. We
also don't alloc and free them very often. The size issue is not really
an issue since we only map for each CPU, and even better: if a bigref
allocation can't get per-cpu data it just degrades beautifully into a
test and an atomic.
Now, that said, I wanted (and wrote, way back when) a far simpler
allocator which only worked for GFP_KERNEL and used the same
__per_cpu_offset[] to fixup dynamic per-cpu ptrs as static ones. Maybe
not as "complete" as this one, but maybe less offensive.
Rusty.
--
A bad analogy is like a leaky screwdriver -- Richard Braakman
--- linux-2.6.14-rc2/include/linux/netdevice.h 2005-09-20 05:00:41.000000000 +0200
+++ linux-2.6.14-rc2-ed/include/linux/netdevice.h 2005-09-23 10:52:43.000000000 +0200
@@ -265,6 +265,8 @@
* the interface.
*/
char name[IFNAMSIZ];
+ /* device name hash chain */
+ struct hlist_node name_hlist;
/*
* I/O specific fields
@@ -292,6 +294,21 @@
/* ------- Fields preinitialized in Space.c finish here ------- */
+ /* Net device features */
+ unsigned long features;
+#define NETIF_F_SG 1 /* Scatter/gather IO. */
+#define NETIF_F_IP_CSUM 2 /* Can checksum only TCP/UDP over IPv4. */
+#define NETIF_F_NO_CSUM 4 /* Does not require checksum. F.e. loopack. */
+#define NETIF_F_HW_CSUM 8 /* Can checksum all the packets. */
+#define NETIF_F_HIGHDMA 32 /* Can DMA to high memory. */
+#define NETIF_F_FRAGLIST 64 /* Scatter/gather IO. */
+#define NETIF_F_HW_VLAN_TX 128 /* Transmit VLAN hw acceleration */
+#define NETIF_F_HW_VLAN_RX 256 /* Receive VLAN hw acceleration */
+#define NETIF_F_HW_VLAN_FILTER 512 /* Receive filtering on VLAN */
+#define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */
+#define NETIF_F_TSO 2048 /* Can offload TCP/IP segmentation */
+#define NETIF_F_LLTX 4096 /* LockLess TX */
+
struct net_device *next_sched;
/* Interface index. Unique device identifier */
@@ -316,9 +333,6 @@
* will (read: may be cleaned up at will).
*/
- /* These may be needed for future network-power-down code. */
- unsigned long trans_start; /* Time (in jiffies) of last Tx */
- unsigned long last_rx; /* Time of last Rx */
unsigned short flags; /* interface flags (a la BSD) */
unsigned short gflags;
@@ -328,15 +342,12 @@
unsigned mtu; /* interface MTU value */
unsigned short type; /* interface hardware type */
unsigned short hard_header_len; /* hardware hdr length */
- void *priv; /* pointer to private data */
struct net_device *master; /* Pointer to master device of a group,
* which this device is member of.
*/
/* Interface address info. */
- unsigned char broadcast[MAX_ADDR_LEN]; /* hw bcast add */
- unsigned char dev_addr[MAX_ADDR_LEN]; /* hw address */
unsigned char perm_addr[MAX_ADDR_LEN]; /* permanent hw address */
unsigned char addr_len; /* hardware address length */
unsigned short dev_id; /* for shared network cards */
@@ -346,8 +357,6 @@
int promiscuity;
int allmulti;
- int watchdog_timeo;
- struct timer_list watchdog_timer;
/* Protocol specific pointers */
@@ -358,32 +367,62 @@
void *ec_ptr; /* Econet specific data */
void *ax25_ptr; /* AX.25 specific data */
- struct list_head poll_list; /* Link to poll list */
+/*
+ * Cache line mostly used on receive path (including eth_type_trans())
+ */
+ struct list_head poll_list ____cacheline_aligned_in_smp;
+ /* Link to poll list */
+
+ int (*poll) (struct net_device *dev, int *quota);
int quota;
int weight;
+ unsigned long last_rx; /* Time of last Rx */
+ /* Interface address info used in eth_type_trans() */
+ unsigned char dev_addr[MAX_ADDR_LEN]; /* hw address, (before bcast
+ because most packets are unicast) */
+
+ unsigned char broadcast[MAX_ADDR_LEN]; /* hw bcast add */
+/*
+ * Cache line mostly used on queue transmit path (qdisc)
+ */
+ /* device queue lock */
+ spinlock_t queue_lock ____cacheline_aligned_in_smp;
struct Qdisc *qdisc;
struct Qdisc *qdisc_sleeping;
- struct Qdisc *qdisc_ingress;
struct list_head qdisc_list;
unsigned long tx_queue_len; /* Max frames per queue allowed */
/* ingress path synchronizer */
spinlock_t ingress_lock;
+ struct Qdisc *qdisc_ingress;
+
+/*
+ * One part is mostly used on xmit path (device)
+ */
/* hard_start_xmit synchronizer */
- spinlock_t xmit_lock;
+ spinlock_t xmit_lock ____cacheline_aligned_in_smp;
/* cpu id of processor entered to hard_start_xmit or -1,
if nobody entered there.
*/
int xmit_lock_owner;
- /* device queue lock */
- spinlock_t queue_lock;
+ void *priv; /* pointer to private data */
+ int (*hard_start_xmit) (struct sk_buff *skb,
+ struct net_device *dev);
+ /* These may be needed for future network-power-down code. */
+ unsigned long trans_start; /* Time (in jiffies) of last Tx */
+
+ int watchdog_timeo; /* used by dev_watchdog() */
+ struct timer_list watchdog_timer;
+
+/*
+ * refcnt is a very hot point, so align it on SMP
+ */
/* Number of references to this device */
- atomic_t refcnt;
+ atomic_t refcnt ____cacheline_aligned_in_smp;
+
/* delayed register/unregister */
struct list_head todo_list;
- /* device name hash chain */
- struct hlist_node name_hlist;
/* device index hash chain */
struct hlist_node index_hlist;
@@ -396,21 +435,6 @@
NETREG_RELEASED, /* called free_netdev */
} reg_state;
- /* Net device features */
- unsigned long features;
-#define NETIF_F_SG 1 /* Scatter/gather IO. */
-#define NETIF_F_IP_CSUM 2 /* Can checksum only TCP/UDP over IPv4. */
-#define NETIF_F_NO_CSUM 4 /* Does not require checksum. F.e. loopack. */
-#define NETIF_F_HW_CSUM 8 /* Can checksum all the packets. */
-#define NETIF_F_HIGHDMA 32 /* Can DMA to high memory. */
-#define NETIF_F_FRAGLIST 64 /* Scatter/gather IO. */
-#define NETIF_F_HW_VLAN_TX 128 /* Transmit VLAN hw acceleration */
-#define NETIF_F_HW_VLAN_RX 256 /* Receive VLAN hw acceleration */
-#define NETIF_F_HW_VLAN_FILTER 512 /* Receive filtering on VLAN */
-#define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */
-#define NETIF_F_TSO 2048 /* Can offload TCP/IP segmentation */
-#define NETIF_F_LLTX 4096 /* LockLess TX */
-
/* Called after device is detached from network. */
void (*uninit)(struct net_device *dev);
/* Called after last user reference disappears. */
@@ -419,10 +443,7 @@
/* Pointers to interface service routines. */
int (*open)(struct net_device *dev);
int (*stop)(struct net_device *dev);
- int (*hard_start_xmit) (struct sk_buff *skb,
- struct net_device *dev);
#define HAVE_NETDEV_POLL
- int (*poll) (struct net_device *dev, int *quota);
int (*hard_header) (struct sk_buff *skb,
struct net_device *dev,
unsigned short type,
Andrew Morton a ?crit :
> Eric Dumazet <[email protected]> wrote:
>
>> Please (re)consider this patch since it really reduce CPU load and/or memory
>> bus trafic between CPUS.
>
>
> Did you actually measure this? If so, you should publish the results.
>
>
Hi Andrew
Well yes I did.
Here is part of the mail sent on netdev one month ago :
Hi David
I played and I have very good results with the following patches.
# tc -s -d qdisc show dev eth0 ; sleep 10 ; tc -s -d qdisc show dev eth0
qdisc pfifo_fast 0: bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
Sent 440173135511 bytes 3211378293 pkt (dropped 240817, overlimits 0
requeues 27028035)
backlog 0b 0p requeues 27028035
qdisc pfifo_fast 0: bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
Sent 440216655667 bytes 3211730812 pkt (dropped 240904, overlimits 0
requeues 27031668)
backlog 0b 0p requeues 27031668
(So about 360 requeues per second, much better than before (12000 / second))
oprofile results give
0.6257 % qdisc_restart (instead of 2.6452 %)
thread is archived here :
http://marc.theaimsgroup.com/?l=linux-netdev&m=112521684415443&w=2
Thank you
Eric
On Fri, Sep 23, 2005 at 06:11:30PM +1000, Rusty Russell wrote:
> On Fri, 2005-09-23 at 00:17 -0700, David S. Miller wrote:
> > I'm still against expanding these networking datastructures with
> > bigrefs just for this stuff. Some people have per-cpu and per-node on
> > the brain, and it's starting to bloat things up a little bit too much.
>
> I think for net devices it actually makes sense; most of the time we are
> not trying to remove them, so the refcounting is simply overhead. We
> also don't alloc and free them very often. The size issue is not really
> an issue since we only map for each CPU, and even better: if a bigref
> allocation can't get per-cpu data it just degrades beautifully into a
> test and an atomic.
I agree, given that it is for a much smaller number of refcounted
objects.
>
> Now, that said, I wanted (and wrote, way back when) a far simpler
> allocator which only worked for GFP_KERNEL and used the same
> __per_cpu_offset[] to fixup dynamic per-cpu ptrs as static ones. Maybe
> not as "complete" as this one, but maybe less offensive.
The GFP_ATOMIC support stuff is needed only for dst entries. However
using per-cpu refcounters in such objects like dentries and dst entries
are problematic and that is why I hadn't tried changing those.
Some of the earlier versions of the allocator were simpler and I think
we need to roll this back and do some more analysis. No GFP_ATOMIC
support, no early use. I haven't got around to look at this
for a while, but I will try.
Thanks
Dipankar
On Fri, Sep 23, 2005 at 12:10:13AM -0700, Andrew Morton wrote:
>
> (Added linux-kernel)
>
> Ravikiran G Thirumalai <[email protected]> wrote:
> >
> > Hi Andrew,
> > This patchset contains alloc_percpu + bigrefs + bigrefs for netdevice
> > refcount. This patchset improves tbench lo performance by 6% on a 8 way IBM
> > x445.
>
> I think we'd need more comprehensive benchmarks than this before adding
> large amounts of complex core code.
>
> We'd also need to know how much of any performance improvement was due to
> alloc_percpu versus bigrefs, please.
>
> Bigrefs look reasonably sane to me, but a whole new memory allocator is a
> big deal. Given that alloc_percpu() is already numa-aware, is that extra
> cross-node fetch and pointer hop really worth all that new code? The new
> version will have to do a tlb load (including a cross-node fetch)
> approximately as often as the old version will get a CPU cache miss on the
> percpu array, maybe?
>
>
The extra pointer dereference saving with the re-implementation saved 10 %
with the dst changes. With just the net_device refcount, there was a
difference of 2 % with the existing implementation. So 1 extra dereference
does matter.
That said, the re-implementation does have other advantages;
1. It avoids the memory bloat caused by the existing implementation as it does
not pad all objects to cacheline size like the existing implementation
does, This helps small objects like bigrefs. Additionally, it results in
better cacheline utilization now as many per-cpu objects share the same
cacheline.
2. It is hotplug friendly. In case you want to offline a node, you can
easily free the per-cpu memory corresponding to that node. You cannot do
that with the current scheme.
3. It can work early. We did some experimentation with struct
vfsmount.mnt_count refcounter (It is per-mount point so it is not too many
objects), and the old alloc_percpu refused to work early. Yes, now I guess
there are workarounds to make cpu_possible_mask bits set for NR_CPUS early,
but wouldn't that result in memory being allocated for cpus which can
never be present?
4. With the re-implementation, it might be possible to make often used
structures like cpu_pda truly node local with alloc_percpu..
As for the benchmarks, tbench on lo was used indicatively. lo performance
does matter for quite a few benchmarks. There are apps which do use lo
extensively. The dst and netdevice changes were made after profiling such
real wold apps. Agreed, per-cpufication of objects which can go up in
size is not the right approach on hindsight, but netdevice.refcount is not
one of those. I can try running a standard mpi benchmark or some
other indicative benchmark if that would help?
Thanks,
Kiran
On Fri, Sep 23, 2005 at 05:06:36PM +0530, Dipankar Sarma wrote:
> On Fri, Sep 23, 2005 at 06:11:30PM +1000, Rusty Russell wrote:
> > On Fri, 2005-09-23 at 00:17 -0700, David S. Miller wrote:
> > Now, that said, I wanted (and wrote, way back when) a far simpler
> > allocator which only worked for GFP_KERNEL and used the same
> > __per_cpu_offset[] to fixup dynamic per-cpu ptrs as static ones. Maybe
> > not as "complete" as this one, but maybe less offensive.
>
> The GFP_ATOMIC support stuff is needed only for dst entries. However
> using per-cpu refcounters in such objects like dentries and dst entries
> are problematic and that is why I hadn't tried changing those.
> Some of the earlier versions of the allocator were simpler and I think
> we need to roll this back and do some more analysis. No GFP_ATOMIC
> support, no early use. I haven't got around to look at this
The patches I have submitted this time does not have GFP_ATOMIC support. The
early use enablement stuff are in seperate patches (patch 5 and 6). The
patchset would work without these two patches too.
Thanks,
Kiran
From: Ravikiran G Thirumalai <[email protected]>
Date: Fri, 23 Sep 2005 11:40:52 -0700
> As for the benchmarks, tbench on lo was used indicatively. lo performance
> does matter for quite a few benchmarks. There are apps which do use lo
> extensively. The dst and netdevice changes were made after profiling such
> real wold apps. Agreed, per-cpufication of objects which can go up in
> size is not the right approach on hindsight, but netdevice.refcount is not
> one of those. I can try running a standard mpi benchmark or some
> other indicative benchmark if that would help?
I worry about real life sites, such as a big web server, that will by
definition have hundreds of thousands of routing cache (and thus
'dst') entries active.
The memory usage will increase, and that's particularly bad in this
kind of case because unlike the 'lo' benchmarks you won't have nodes
and cpus fighting over access to the same routes. In such a load
the bigrefs are just wasted memory and aren't actually needed.
I really would like to encourage a move away from this fascination
with optimizating the loopback interface performance on enormous
systems, yes even if it is hit hard by the benchmarks. It just
means the benchmarks are wrong, not that we should optimize for
them.
On Fri, 23 Sep 2005, Andrew Morton wrote:
> big deal. Given that alloc_percpu() is already numa-aware, is that extra
> cross-node fetch and pointer hop really worth all that new code? The new
> version will have to do a tlb load (including a cross-node fetch)
> approximately as often as the old version will get a CPU cache miss on the
> percpu array, maybe?
The current alloc_percpu() is problematic because it has to allocate
entries for all cpus even those who are not online yet. There is no way to
track alloc_percpu entries and therefore no possibility of adding an entry
for a processor if one comes online or for removing one when a processor
goes away.
An additional complication is the allocation of per cpu entries for
processors whose memory node are not online. The current
implementation will fall back on a unspecific allocation but this means
that the placement of the per cpu entries will not be on the node of the
processor!
On Fri, Sep 23, 2005 at 11:50:58AM -0700, David S. Miller wrote:
> From: Ravikiran G Thirumalai <[email protected]>
> Date: Fri, 23 Sep 2005 11:40:52 -0700
>
>
> I worry about real life sites, such as a big web server, that will by
> definition have hundreds of thousands of routing cache (and thus
> 'dst') entries active.
>
> The memory usage will increase, and that's particularly bad in this
> kind of case because unlike the 'lo' benchmarks you won't have nodes
> and cpus fighting over access to the same routes. In such a load
> the bigrefs are just wasted memory and aren't actually needed.
Point taken. That is the reason I have excluded the dst patches in this
patchset. The problem with dst.__refcount stays and we should probably look
for some other approach rather than thinking per-cpu/per-node counters for
this.
But the patch in question now is net_device refcount. Surely that doesn't
affect webservers with many dst entries.
>
> I really would like to encourage a move away from this fascination
> with optimizating the loopback interface performance on enormous
> systems, yes even if it is hit hard by the benchmarks. It just
> means the benchmarks are wrong, not that we should optimize for
> them.
Benchmarks could be wrong, but we don't control what people run.
And there are apps which use lo (for whatever reason) :(.
Thanks,
Kiran