2005-09-20 21:45:09

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] Adds sys_set_mempolicy() in include/linux/syscalls.h , Re: [PATCH, netfilter] NUMA aware ipv4/netfilter/ip_tables.c

Andi Kleen a écrit :
> On Tuesday 20 September 2005 11:47, Eric Dumazet wrote:
>
>
>
> I would prefer if random code didn't mess with mempolicy internals
> like this. Better just call sys_set_mempolicy()
>
> -Andi
>
>


OK but this prior patch seems necessary :

- Adds sys_set_mempolicy() in include/linux/syscalls.h

Signed-off-by: Eric Dumazet <[email protected]>


2005-09-20 21:46:16

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] Adds sys_set_mempolicy() in include/linux/syscalls.h

--- linux-2.6/include/linux/syscalls.h 2005-09-06 01:20:21.000000000 +0200
+++ linux-2.6-ed/include/linux/syscalls.h 2005-09-20 23:43:07.000000000 +0200
@@ -508,5 +508,7 @@

asmlinkage long sys_ioprio_set(int which, int who, int ioprio);
asmlinkage long sys_ioprio_get(int which, int who);
+asmlinkage long sys_set_mempolicy(int mode, unsigned long __user *nmask,
+ unsigned long maxnode);

#endif


Attachments:
patch_add_sys_set_mempolicy (407.00 B)

2005-09-21 21:25:05

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

Hi

I have reworked net/ipv4/ip_tables.c to boost its performance, and post three
patches.

In oprofile results, ipt_do_table() was at the first position.
It is now at 6th position, using 1/3 of the CPU it was using before.
(Tests done on a dual Xeon i386 and a dual Opteron x86_64)

Short description :

1) No more central rwlock protecting each table (filter, nat, mangle, raw),
but one lock per CPU. It avoids cache line ping pongs for each packet.

2) Loop unrollings and various code optimizations to reduce branches
mispredictions.

3) NUMA aware allocation of memory (this part was posted earlier, but got some
polishing problems)


Long description:

1) No more one rwlock_t protecting the 'curtain'

One major bottleneck on SMP machines is the fact that one rwlock
is taken in ipt_do_table() entry and exit. That 2 atomic operations are
the killer, and even if multiple readers can work together on the same table
(using read_lock_bh()), the cache line containing rwlock still must be
taken exclusively by each CPU at entry/exit of ipt_do_table().

As existing code already use separate copies of the data for each cpu, it is
very easy to convert the central rwlock to separate rwlocks, allocated
percpu (and NUMA aware).

When a cpu enters ipt_do_table(), it can locks its local copy, touching a
cache line that is not used by other cpus. Further operations are done on
'local' copy of the data.

When a sum of all counters must be done, we can write_lock each part at a
time, instead of locking all the parts, reducing the lock contention.

2) Loop unrolling

It seems that with current compilers and CFLAGS, the code from
ip_packet_match() is very bad, using lot of mispredicted conditional branches
I made some patches and generated code on i386 and x86_64
is much better.

3) NUMA allocation.

Part of the performance problem we have with netfilter is memory allocation
is not NUMA aware, but 'only' SMP aware (ie each CPU normally touch
separate cache lines)

Even with small iptables rules, the cost of this misplacement can be high
on common workloads.

Instead of using one vmalloc() area (located in the node of the iptables
process), we now allocate an area for each possible CPU, using NUMA policy
(MPOL_PREFERRED) so that memory should be allocated in the CPU's node
if possible.

If the size of ipt_table is small enough (less than one page), we use
kmalloc_node() instead of vmalloc(), to use less memory and less TLB entries)
in small setups.

Note1 : I also optimized get_counters(), using a SET_COUNTER() for the first
cpu, avoiding a memset() and ADD_COUNTER() if SMP on other cpus.

Note2 : This patch depends on another patch that declares sys_set_mempolicy()
in include/linux/syscalls.h
( http://marc.theaimsgroup.com/?l=linux-kernel&m=112725288622984&w=2 )


Thank you
Eric Dumazet

2005-09-21 21:29:17

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH 1/3] netfilter : 3 patches to boost ip_tables performance

--- linux-2.6/include/linux/netfilter_ipv4/ip_tables.h 2005-09-13 05:12:09.000000000 +0200
+++ linux-2.6-ed/include/linux/netfilter_ipv4/ip_tables.h 2005-09-21 17:42:07.000000000 +0200
@@ -445,7 +445,11 @@
unsigned int valid_hooks;

/* Lock for the curtain */
+#ifdef CONFIG_SMP
+ rwlock_t *lock_p; /* one lock per cpu */
+#else
rwlock_t lock;
+#endif

/* Man behind the curtain... */
struct ipt_table_info *private;
--- linux-2.6.14-rc1.p1/net/ipv4/netfilter/ip_nat_rule.c 2005-09-13 05:12:09.000000000 +0200
+++ linux-2.6.14-rc1/net/ipv4/netfilter/ip_nat_rule.c 2005-09-21 17:44:01.000000000 +0200
@@ -93,7 +93,6 @@
static struct ipt_table nat_table = {
.name = "nat",
.valid_hooks = NAT_VALID_HOOKS,
- .lock = RW_LOCK_UNLOCKED,
.me = THIS_MODULE,
};

--- linux-2.6.14-rc1.p1/net/ipv4/netfilter/iptable_filter.c 2005-09-13 05:12:09.000000000 +0200
+++ linux-2.6.14-rc1/net/ipv4/netfilter/iptable_filter.c 2005-09-21 17:45:20.000000000 +0200
@@ -77,7 +77,6 @@
static struct ipt_table packet_filter = {
.name = "filter",
.valid_hooks = FILTER_VALID_HOOKS,
- .lock = RW_LOCK_UNLOCKED,
.me = THIS_MODULE
};

--- linux-2.6.14-rc1.p1/net/ipv4/netfilter/iptable_mangle.c 2005-09-13 05:12:09.000000000 +0200
+++ linux-2.6.14-rc1/net/ipv4/netfilter/iptable_mangle.c 2005-09-21 17:45:20.000000000 +0200
@@ -107,7 +107,6 @@
static struct ipt_table packet_mangler = {
.name = "mangle",
.valid_hooks = MANGLE_VALID_HOOKS,
- .lock = RW_LOCK_UNLOCKED,
.me = THIS_MODULE,
};

--- linux-2.6.14-rc1.p1/net/ipv4/netfilter/iptable_raw.c 2005-09-13 05:12:09.000000000 +0200
+++ linux-2.6.14-rc1/net/ipv4/netfilter/iptable_raw.c 2005-09-21 17:45:20.000000000 +0200
@@ -82,7 +82,6 @@
static struct ipt_table packet_raw = {
.name = "raw",
.valid_hooks = RAW_VALID_HOOKS,
- .lock = RW_LOCK_UNLOCKED,
.me = THIS_MODULE
};

--- linux-2.6.14-rc1.p1/net/ipv4/netfilter/ip_tables.c 2005-09-19 19:56:12.000000000 +0200
+++ linux-2.6.14-rc1/net/ipv4/netfilter/ip_tables.c 2005-09-21 23:49:13.000000000 +0200
@@ -110,6 +110,7 @@
static LIST_HEAD(ipt_target);
static LIST_HEAD(ipt_match);
static LIST_HEAD(ipt_tables);
+#define SET_COUNTER(c,b,p) do { (c).bcnt = (b); (c).pcnt = (p); } while(0)
#define ADD_COUNTER(c,b,p) do { (c).bcnt += (b); (c).pcnt += (p); } while(0)

#ifdef CONFIG_SMP
@@ -273,6 +274,9 @@
const char *indev, *outdev;
void *table_base;
struct ipt_entry *e, *back;
+#ifdef CONFIG_SMP
+ rwlock_t *lock_p;
+#endif

/* Initialization */
ip = (*pskb)->nh.iph;
@@ -287,7 +291,18 @@
* match it. */
offset = ntohs(ip->frag_off) & IP_OFFSET;

+#ifdef CONFIG_SMP
+ /*
+ * We expand read_lock_bh() here because we need to get
+ * the address of this cpu rwlock_t
+ */
+ local_bh_disable();
+ preempt_disable();
+ lock_p = per_cpu_ptr(table->lock_p, smp_processor_id());
+ _raw_read_lock(lock_p);
+#else
read_lock_bh(&table->lock);
+#endif
IP_NF_ASSERT(table->valid_hooks & (1 << hook));
table_base = (void *)table->private->entries
+ TABLE_OFFSET(table->private, smp_processor_id());
@@ -396,7 +411,11 @@
#ifdef CONFIG_NETFILTER_DEBUG
((struct ipt_entry *)table_base)->comefrom = 0xdead57ac;
#endif
+#ifdef CONFIG_SMP
+ read_unlock_bh(lock_p);
+#else
read_unlock_bh(&table->lock);
+#endif

#ifdef DEBUG_ALLOW_ALL
return NF_ACCEPT;
@@ -930,6 +949,30 @@
return ret;
}

+static void ipt_table_global_lock(struct ipt_table *table)
+{
+#ifdef CONFIG_SMP
+ int cpu;
+ for_each_cpu(cpu) {
+ write_lock_bh(per_cpu_ptr(table->lock_p, cpu));
+ }
+#else
+ write_lock_bh(&table->lock);
+#endif
+}
+
+static void ipt_table_global_unlock(struct ipt_table *table)
+{
+#ifdef CONFIG_SMP
+ int cpu;
+ for_each_cpu(cpu) {
+ write_unlock_bh(per_cpu_ptr(table->lock_p, cpu));
+ }
+#else
+ write_unlock_bh(&table->lock);
+#endif
+}
+
static struct ipt_table_info *
replace_table(struct ipt_table *table,
unsigned int num_counters,
@@ -954,24 +997,25 @@
#endif

/* Do the substitution. */
- write_lock_bh(&table->lock);
+ ipt_table_global_lock(table);
/* Check inside lock: is the old number correct? */
if (num_counters != table->private->number) {
duprintf("num_counters != table->private->number (%u/%u)\n",
num_counters, table->private->number);
- write_unlock_bh(&table->lock);
+ ipt_table_global_unlock(table);
*error = -EAGAIN;
return NULL;
}
oldinfo = table->private;
table->private = newinfo;
newinfo->initial_entries = oldinfo->initial_entries;
- write_unlock_bh(&table->lock);
+ ipt_table_global_unlock(table);

return oldinfo;
}

/* Gets counters. */
+#ifdef CONFIG_SMP
static inline int
add_entry_to_counter(const struct ipt_entry *e,
struct ipt_counters total[],
@@ -982,22 +1026,71 @@
(*i)++;
return 0;
}
+#endif
+static inline int
+set_entry_to_counter(const struct ipt_entry *e,
+ struct ipt_counters total[],
+ unsigned int *i)
+{
+ SET_COUNTER(total[*i], e->counters.bcnt, e->counters.pcnt);
+
+ (*i)++;
+ return 0;
+}

+/*
+ * if table is NULL, no locking should be done
+ */
static void
-get_counters(const struct ipt_table_info *t,
+get_counters(struct ipt_table *table,
+ const struct ipt_table_info *t,
struct ipt_counters counters[])
{
- unsigned int cpu;
unsigned int i;
+#ifdef CONFIG_SMP
+ unsigned int cpu;
+ unsigned int curcpu;
+
+ curcpu = get_cpu();
+
+ if (table)
+ write_lock_bh(per_cpu_ptr(table->lock_p, curcpu));
+ i = 0;
+ IPT_ENTRY_ITERATE(t->entries + TABLE_OFFSET(t, curcpu),
+ t->size,
+ set_entry_to_counter,
+ counters,
+ &i);
+ if (table)
+ write_unlock_bh(per_cpu_ptr(table->lock_p, curcpu));

- for (cpu = 0; cpu < num_possible_cpus(); cpu++) {
+ for_each_cpu(cpu) {
+ if (cpu == curcpu)
+ continue;
+ if (table)
+ write_lock_bh(per_cpu_ptr(table->lock_p, cpu));
i = 0;
IPT_ENTRY_ITERATE(t->entries + TABLE_OFFSET(t, cpu),
t->size,
add_entry_to_counter,
counters,
&i);
+ if (table)
+ write_unlock_bh(per_cpu_ptr(table->lock_p, cpu));
}
+ put_cpu();
+#else
+ if (table)
+ write_lock_bh(&table->lock);
+ i = 0;
+ IPT_ENTRY_ITERATE(t->entries,
+ t->size,
+ set_entry_to_counter,
+ counters,
+ &i);
+ if (table)
+ write_unlock_bh(&table->lock);
+#endif
}

static int
@@ -1020,10 +1113,7 @@
return -ENOMEM;

/* First, sum counters... */
- memset(counters, 0, countersize);
- write_lock_bh(&table->lock);
- get_counters(table->private, counters);
- write_unlock_bh(&table->lock);
+ get_counters(table, table->private, counters);

/* ... then copy entire thing from CPU 0... */
if (copy_to_user(userptr, table->private->entries, total_size) != 0) {
@@ -1183,7 +1273,7 @@
module_put(t->me);

/* Get the old counters. */
- get_counters(oldinfo, counters);
+ get_counters(NULL, oldinfo, counters);
/* Decrease module usage counts and free resource */
IPT_ENTRY_ITERATE(oldinfo->entries, oldinfo->size, cleanup_entry,NULL);
vfree(oldinfo);
@@ -1235,6 +1325,9 @@
struct ipt_counters_info tmp, *paddc;
struct ipt_table *t;
int ret = 0;
+#ifdef CONFIG_SMP
+ rwlock_t *lockp;
+#endif

if (copy_from_user(&tmp, user, sizeof(tmp)) != 0)
return -EFAULT;
@@ -1257,7 +1350,12 @@
goto free;
}

+#ifdef CONFIG_SMP
+ lockp = per_cpu_ptr(t->lock_p, get_cpu());
+ write_lock_bh(lockp);
+#else
write_lock_bh(&t->lock);
+#endif
if (t->private->number != paddc->num_counters) {
ret = -EINVAL;
goto unlock_up_free;
@@ -1270,7 +1368,12 @@
paddc->counters,
&i);
unlock_up_free:
+#ifdef CONFIG_SMP
+ write_unlock_bh(lockp);
+ put_cpu();
+#else
write_unlock_bh(&t->lock);
+#endif
up(&ipt_mutex);
module_put(t->me);
free:
@@ -1456,7 +1559,17 @@
struct ipt_table_info *newinfo;
static struct ipt_table_info bootstrap
= { 0, 0, 0, { 0 }, { 0 }, { } };
-
+#ifdef CONFIG_SMP
+ int cpu;
+ if (!table->lock_p) {
+ if ((table->lock_p = alloc_percpu(rwlock_t)) == NULL)
+ return -ENOMEM;
+ }
+ for_each_cpu(cpu)
+ rwlock_init(per_cpu_ptr(table->lock_p, cpu));
+#else
+ rwlock_init(&table->lock);
+#endif
newinfo = vmalloc(sizeof(struct ipt_table_info)
+ SMP_ALIGN(repl->size) * num_possible_cpus());
if (!newinfo)
@@ -1497,7 +1610,6 @@
/* save number of initial entries */
table->private->initial_entries = table->private->number;

- rwlock_init(&table->lock);
list_prepend(&ipt_tables, table);

unlock:
@@ -1519,6 +1631,10 @@
IPT_ENTRY_ITERATE(table->private->entries, table->private->size,
cleanup_entry, NULL);
vfree(table->private);
+#ifdef CONFIG_SMP
+ free_percpu(table->lock_p);
+ table->lock_p = NULL;
+#endif
}

/* Returns 1 if the port is matched by the range, 0 otherwise */


Attachments:
patch_ip_tables_numa.1 (8.45 kB)

2005-09-21 21:32:31

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH 2/3] netfilter : 3 patches to boost ip_tables performance

--- linux-2.6/net/ipv4/netfilter/ip_tables.c 2005-09-21 23:55:30.000000000 +0200
+++ linux-2.6-ed/net/ipv4/netfilter/ip_tables.c 2005-09-22 00:39:29.000000000 +0200
@@ -125,6 +125,27 @@
#define up(x) do { printk("UP:%u:" #x "\n", __LINE__); up(x); } while(0)
#endif

+/*
+ * Special macro to compare IFNAMSIZ 'strings', with mask.
+ * Best results if feeded with 3 args pointing to 'unsigned long' types
+ * Loop is manually unrolled for performance reasons.
+ * gcc generates code without branches, IFNAMSIZ being a constant
+ */
+#define compare_if_lstrings(ret, devname, expect, expect_mask) \
+ ret = ((devname)[0] ^ (expect)[0]) & (expect_mask)[0]; \
+ if (IFNAMSIZ > sizeof(*devname)) \
+ ret |= ((devname)[1] ^ (expect)[1]) & (expect_mask)[1]; \
+ if (IFNAMSIZ > 2 * sizeof(*devname)) \
+ ret |= ((devname)[2] ^ (expect)[2]) & (expect_mask)[2]; \
+ if (IFNAMSIZ > 3 * sizeof(*devname)) \
+ ret |= ((devname)[3] ^ (expect)[3]) & (expect_mask)[3]; \
+ /* just in case IFNAMSIZ is enlarged */ \
+ if (IFNAMSIZ > 4 * sizeof(*devname)) { \
+ int i; \
+ for (i = 4 ; (i < IFNAMSIZ/sizeof(*devname)); i++) \
+ ret |= ((devname)[i] ^ (expect)[i]) & (expect_mask)[i]; \
+ }
+
/* Returns whether matches rule or not. */
static inline int
ip_packet_match(const struct iphdr *ip,
@@ -133,16 +154,22 @@
const struct ipt_ip *ipinfo,
int isfrag)
{
- size_t i;
+ int bool1, bool2;
unsigned long ret;

#define FWINV(bool,invflg) ((bool) ^ !!(ipinfo->invflags & invflg))

- if (FWINV((ip->saddr&ipinfo->smsk.s_addr) != ipinfo->src.s_addr,
- IPT_INV_SRCIP)
- || FWINV((ip->daddr&ipinfo->dmsk.s_addr) != ipinfo->dst.s_addr,
- IPT_INV_DSTIP)) {
- dprintf("Source or dest mismatch.\n");
+ bool1 = ((ip->saddr&ipinfo->smsk.s_addr) != ipinfo->src.s_addr);
+ bool1 ^= !!(ipinfo->invflags & IPT_INV_SRCIP);
+
+ bool2 = ((ip->daddr&ipinfo->dmsk.s_addr) != ipinfo->dst.s_addr);
+ bool2 ^= !!(ipinfo->invflags & IPT_INV_DSTIP);
+
+ if ((bool1 | bool2) != 0) {
+ if (bool1)
+ dprintf("Source%s mismatch.\n", bool2 ? " and Dest":"");
+ else
+ dprintf("Dest mismatch.\n");

dprintf("SRC: %u.%u.%u.%u. Mask: %u.%u.%u.%u. Target: %u.%u.%u.%u.%s\n",
NIPQUAD(ip->saddr),
@@ -157,27 +184,26 @@
return 0;
}

- /* Look for ifname matches; this should unroll nicely. */
- for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
- ret |= (((const unsigned long *)indev)[i]
- ^ ((const unsigned long *)ipinfo->iniface)[i])
- & ((const unsigned long *)ipinfo->iniface_mask)[i];
- }
+ compare_if_lstrings(ret,
+ (const unsigned long *)indev,
+ (const unsigned long *)ipinfo->iniface,
+ (const unsigned long *)ipinfo->iniface_mask);

- if (FWINV(ret != 0, IPT_INV_VIA_IN)) {
+ bool1 = FWINV(ret != 0, IPT_INV_VIA_IN);
+ if (bool1) {
dprintf("VIA in mismatch (%s vs %s).%s\n",
indev, ipinfo->iniface,
ipinfo->invflags&IPT_INV_VIA_IN ?" (INV)":"");
return 0;
}

- for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
- ret |= (((const unsigned long *)outdev)[i]
- ^ ((const unsigned long *)ipinfo->outiface)[i])
- & ((const unsigned long *)ipinfo->outiface_mask)[i];
- }
+ compare_if_lstrings(ret,
+ (const unsigned long *)outdev,
+ (const unsigned long *)ipinfo->outiface,
+ (const unsigned long *)ipinfo->outiface_mask);

- if (FWINV(ret != 0, IPT_INV_VIA_OUT)) {
+ bool1 = FWINV(ret != 0, IPT_INV_VIA_OUT);
+ if (bool1) {
dprintf("VIA out mismatch (%s vs %s).%s\n",
outdev, ipinfo->outiface,
ipinfo->invflags&IPT_INV_VIA_OUT ?" (INV)":"");
@@ -185,8 +211,9 @@
}

/* Check specific protocol */
- if (ipinfo->proto
- && FWINV(ip->protocol != ipinfo->proto, IPT_INV_PROTO)) {
+ bool1 = FWINV(ip->protocol != ipinfo->proto, IPT_INV_PROTO) ;
+ bool1 &= (ipinfo->proto != 0);
+ if (bool1) {
dprintf("Packet protocol %hi does not match %hi.%s\n",
ip->protocol, ipinfo->proto,
ipinfo->invflags&IPT_INV_PROTO ? " (INV)":"");


Attachments:
patch_ip_tables_numa.2 (3.87 kB)

2005-09-21 21:37:27

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH 3/3] netfilter : 3 patches to boost ip_tables performance

--- linux-2.6/net/ipv4/netfilter/ip_tables.c 2005-09-22 00:44:34.000000000 +0200
+++ linux-2.6-ed/net/ipv4/netfilter/ip_tables.c 2005-09-22 00:57:15.000000000 +0200
@@ -17,6 +17,8 @@
#include <linux/skbuff.h>
#include <linux/kmod.h>
#include <linux/vmalloc.h>
+#include <linux/mempolicy.h>
+#include <linux/syscalls.h>
#include <linux/netdevice.h>
#include <linux/module.h>
#include <linux/tcp.h>
@@ -82,11 +84,6 @@
context stops packets coming through and allows user context to read
the counters or update the rules.

- To be cache friendly on SMP, we arrange them like so:
- [ n-entries ]
- ... cache-align padding ...
- [ n-entries ]
-
Hence the start of any table is given by get_table() below. */

/* The table itself */
@@ -104,7 +101,7 @@
unsigned int underflow[NF_IP_NUMHOOKS];

/* ipt_entry tables: one per CPU */
- char entries[0] ____cacheline_aligned;
+ void *entries[NR_CPUS];
};

static LIST_HEAD(ipt_target);
@@ -113,12 +110,6 @@
#define SET_COUNTER(c,b,p) do { (c).bcnt = (b); (c).pcnt = (p); } while(0)
#define ADD_COUNTER(c,b,p) do { (c).bcnt += (b); (c).pcnt += (p); } while(0)

-#ifdef CONFIG_SMP
-#define TABLE_OFFSET(t,p) (SMP_ALIGN((t)->size)*(p))
-#else
-#define TABLE_OFFSET(t,p) 0
-#endif
-
#if 0
#define down(x) do { printk("DOWN:%u:" #x "\n", __LINE__); down(x); } while(0)
#define down_interruptible(x) ({ int __r; printk("DOWNi:%u:" #x "\n", __LINE__); __r = down_interruptible(x); if (__r != 0) printk("ABORT-DOWNi:%u\n", __LINE__); __r; })
@@ -331,8 +322,7 @@
read_lock_bh(&table->lock);
#endif
IP_NF_ASSERT(table->valid_hooks & (1 << hook));
- table_base = (void *)table->private->entries
- + TABLE_OFFSET(table->private, smp_processor_id());
+ table_base = (void *)table->private->entries[smp_processor_id()];
e = get_entry(table_base, table->private->hook_entry[hook]);

#ifdef CONFIG_NETFILTER_DEBUG
@@ -608,7 +598,7 @@
/* Figures out from what hook each rule can be called: returns 0 if
there are loops. Puts hook bitmask in comefrom. */
static int
-mark_source_chains(struct ipt_table_info *newinfo, unsigned int valid_hooks)
+mark_source_chains(struct ipt_table_info *newinfo, unsigned int valid_hooks, void *entry0)
{
unsigned int hook;

@@ -617,7 +607,7 @@
for (hook = 0; hook < NF_IP_NUMHOOKS; hook++) {
unsigned int pos = newinfo->hook_entry[hook];
struct ipt_entry *e
- = (struct ipt_entry *)(newinfo->entries + pos);
+ = (struct ipt_entry *)(entry0 + pos);

if (!(valid_hooks & (1 << hook)))
continue;
@@ -667,13 +657,13 @@
goto next;

e = (struct ipt_entry *)
- (newinfo->entries + pos);
+ (entry0 + pos);
} while (oldpos == pos + e->next_offset);

/* Move along one */
size = e->next_offset;
e = (struct ipt_entry *)
- (newinfo->entries + pos + size);
+ (entry0 + pos + size);
e->counters.pcnt = pos;
pos += size;
} else {
@@ -690,7 +680,7 @@
newpos = pos + e->next_offset;
}
e = (struct ipt_entry *)
- (newinfo->entries + newpos);
+ (entry0 + newpos);
e->counters.pcnt = pos;
pos = newpos;
}
@@ -900,6 +890,7 @@
translate_table(const char *name,
unsigned int valid_hooks,
struct ipt_table_info *newinfo,
+ void *entry0,
unsigned int size,
unsigned int number,
const unsigned int *hook_entries,
@@ -920,11 +911,11 @@
duprintf("translate_table: size %u\n", newinfo->size);
i = 0;
/* Walk through entries, checking offsets. */
- ret = IPT_ENTRY_ITERATE(newinfo->entries, newinfo->size,
+ ret = IPT_ENTRY_ITERATE(entry0, newinfo->size,
check_entry_size_and_hooks,
newinfo,
- newinfo->entries,
- newinfo->entries + size,
+ entry0,
+ entry0 + size,
hook_entries, underflows, &i);
if (ret != 0)
return ret;
@@ -952,25 +943,24 @@
}
}

- if (!mark_source_chains(newinfo, valid_hooks))
+ if (!mark_source_chains(newinfo, valid_hooks, entry0))
return -ELOOP;

/* Finally, each sanity check must pass */
i = 0;
- ret = IPT_ENTRY_ITERATE(newinfo->entries, newinfo->size,
+ ret = IPT_ENTRY_ITERATE(entry0, newinfo->size,
check_entry, name, size, &i);

if (ret != 0) {
- IPT_ENTRY_ITERATE(newinfo->entries, newinfo->size,
+ IPT_ENTRY_ITERATE(entry0, newinfo->size,
cleanup_entry, &i);
return ret;
}

/* And one copy for every other CPU */
- for (i = 1; i < num_possible_cpus(); i++) {
- memcpy(newinfo->entries + SMP_ALIGN(newinfo->size)*i,
- newinfo->entries,
- SMP_ALIGN(newinfo->size));
+ for_each_cpu(i) {
+ if (newinfo->entries[i] && newinfo->entries[i] != entry0)
+ memcpy(newinfo->entries[i], entry0, newinfo->size);
}

return ret;
@@ -1010,15 +1000,12 @@

#ifdef CONFIG_NETFILTER_DEBUG
{
- struct ipt_entry *table_base;
- unsigned int i;
+ int cpu;

- for (i = 0; i < num_possible_cpus(); i++) {
- table_base =
- (void *)newinfo->entries
- + TABLE_OFFSET(newinfo, i);
-
- table_base->comefrom = 0xdead57ac;
+ for_each_cpu(cpu) {
+ struct ipt_entry *table_base = newinfo->entries[cpu];
+ if (table_base)
+ table_base->comefrom = 0xdead57ac;
}
}
#endif
@@ -1083,7 +1070,7 @@
if (table)
write_lock_bh(per_cpu_ptr(table->lock_p, curcpu));
i = 0;
- IPT_ENTRY_ITERATE(t->entries + TABLE_OFFSET(t, curcpu),
+ IPT_ENTRY_ITERATE(t->entries[curcpu],
t->size,
set_entry_to_counter,
counters,
@@ -1097,7 +1084,7 @@
if (table)
write_lock_bh(per_cpu_ptr(table->lock_p, cpu));
i = 0;
- IPT_ENTRY_ITERATE(t->entries + TABLE_OFFSET(t, cpu),
+ IPT_ENTRY_ITERATE(t->entries[cpu],
t->size,
add_entry_to_counter,
counters,
@@ -1110,7 +1097,7 @@
if (table)
write_lock_bh(&table->lock);
i = 0;
- IPT_ENTRY_ITERATE(t->entries,
+ IPT_ENTRY_ITERATE(t->entries[0],
t->size,
set_entry_to_counter,
counters,
@@ -1129,6 +1116,7 @@
struct ipt_entry *e;
struct ipt_counters *counters;
int ret = 0;
+ void *loc_cpu_entry;

/* We need atomic snapshot of counters: rest doesn't change
(other than comefrom, which userspace doesn't care
@@ -1142,8 +1130,12 @@
/* First, sum counters... */
get_counters(table, table->private, counters);

- /* ... then copy entire thing from CPU 0... */
- if (copy_to_user(userptr, table->private->entries, total_size) != 0) {
+ /*
+ * choose the copy that is on our node/cpu,
+ */
+ loc_cpu_entry = table->private->entries[get_cpu()];
+ /* ... then copy entire thing ... */
+ if (copy_to_user(userptr, loc_cpu_entry, total_size) != 0) {
ret = -EFAULT;
goto free_counters;
}
@@ -1155,7 +1147,7 @@
struct ipt_entry_match *m;
struct ipt_entry_target *t;

- e = (struct ipt_entry *)(table->private->entries + off);
+ e = (struct ipt_entry *)(loc_cpu_entry + off);
if (copy_to_user(userptr + off
+ offsetof(struct ipt_entry, counters),
&counters[num],
@@ -1192,6 +1184,7 @@
}

free_counters:
+ put_cpu();
vfree(counters);
return ret;
}
@@ -1224,6 +1217,60 @@
return ret;
}

+static void free_table_info(struct ipt_table_info *info)
+{
+ int cpu;
+ for_each_cpu(cpu) {
+ if (info->size <= PAGE_SIZE)
+ kfree(info->entries[cpu]);
+ else
+ vfree(info->entries[cpu]);
+ }
+ kfree(info);
+}
+
+static struct ipt_table_info *alloc_table_info(unsigned int size)
+{
+struct ipt_table_info *newinfo;
+int cpu;
+ newinfo = kzalloc(sizeof(struct ipt_table_info), GFP_KERNEL);
+ if (!newinfo)
+ return NULL;
+ newinfo->size = size;
+ for_each_cpu(cpu) {
+ if (size <= PAGE_SIZE) {
+ newinfo->entries[cpu] = kmalloc_node(size,
+ GFP_KERNEL,
+ cpu_to_node(cpu));
+ } else {
+#ifdef CONFIG_NUMA
+ struct mempolicy *oldpol;
+ mm_segment_t oldfs = get_fs();
+ DECLARE_BITMAP(mynode, MAX_NUMNODES);
+
+ oldpol = current->mempolicy;
+ mpol_get(oldpol);
+ bitmap_zero(mynode, MAX_NUMNODES);
+ set_bit(cpu_to_node(cpu), mynode);
+ set_fs(KERNEL_DS);
+ sys_set_mempolicy(MPOL_PREFERRED, mynode, MAX_NUMNODES);
+ set_fs(oldfs);
+#endif
+ newinfo->entries[cpu] = vmalloc(size);
+#ifdef CONFIG_NUMA
+ mpol_free(current->mempolicy);
+ current->mempolicy = oldpol;
+#endif
+ }
+ if (newinfo->entries[cpu] == 0) {
+ free_table_info(newinfo);
+ return NULL;
+ }
+ }
+ return newinfo;
+}
+
+
static int
do_replace(void __user *user, unsigned int len)
{
@@ -1232,6 +1279,7 @@
struct ipt_table *t;
struct ipt_table_info *newinfo, *oldinfo;
struct ipt_counters *counters;
+ void *loc_cpu_entry, *loc_cpu_old_entry;

if (copy_from_user(&tmp, user, sizeof(tmp)) != 0)
return -EFAULT;
@@ -1244,12 +1292,14 @@
if ((SMP_ALIGN(tmp.size) >> PAGE_SHIFT) + 2 > num_physpages)
return -ENOMEM;

- newinfo = vmalloc(sizeof(struct ipt_table_info)
- + SMP_ALIGN(tmp.size) * num_possible_cpus());
+ newinfo = alloc_table_info(tmp.size);
if (!newinfo)
return -ENOMEM;
-
- if (copy_from_user(newinfo->entries, user + sizeof(tmp),
+ /*
+ * choose the copy that is on our node/cpu
+ */
+ loc_cpu_entry = newinfo->entries[get_cpu()];
+ if (copy_from_user(loc_cpu_entry, user + sizeof(tmp),
tmp.size) != 0) {
ret = -EFAULT;
goto free_newinfo;
@@ -1260,10 +1310,9 @@
ret = -ENOMEM;
goto free_newinfo;
}
- memset(counters, 0, tmp.num_counters * sizeof(struct ipt_counters));

ret = translate_table(tmp.name, tmp.valid_hooks,
- newinfo, tmp.size, tmp.num_entries,
+ newinfo, loc_cpu_entry, tmp.size, tmp.num_entries,
tmp.hook_entry, tmp.underflow);
if (ret != 0)
goto free_newinfo_counters;
@@ -1302,8 +1351,10 @@
/* Get the old counters. */
get_counters(NULL, oldinfo, counters);
/* Decrease module usage counts and free resource */
- IPT_ENTRY_ITERATE(oldinfo->entries, oldinfo->size, cleanup_entry,NULL);
- vfree(oldinfo);
+ loc_cpu_old_entry = oldinfo->entries[smp_processor_id()];
+ IPT_ENTRY_ITERATE(loc_cpu_old_entry, oldinfo->size, cleanup_entry,NULL);
+ put_cpu();
+ free_table_info(oldinfo);
if (copy_to_user(tmp.counters, counters,
sizeof(struct ipt_counters) * tmp.num_counters) != 0)
ret = -EFAULT;
@@ -1315,11 +1366,12 @@
module_put(t->me);
up(&ipt_mutex);
free_newinfo_counters_untrans:
- IPT_ENTRY_ITERATE(newinfo->entries, newinfo->size, cleanup_entry,NULL);
+ IPT_ENTRY_ITERATE(loc_cpu_entry, newinfo->size, cleanup_entry,NULL);
free_newinfo_counters:
vfree(counters);
free_newinfo:
- vfree(newinfo);
+ put_cpu();
+ free_table_info(newinfo);
return ret;
}

@@ -1352,6 +1404,7 @@
struct ipt_counters_info tmp, *paddc;
struct ipt_table *t;
int ret = 0;
+ void *loc_cpu_entry;
#ifdef CONFIG_SMP
rwlock_t *lockp;
#endif
@@ -1389,7 +1442,11 @@
}

i = 0;
- IPT_ENTRY_ITERATE(t->private->entries,
+ /*
+ * choose the copy that is on our node,
+ */
+ loc_cpu_entry = t->private->entries[smp_processor_id()];
+ IPT_ENTRY_ITERATE(loc_cpu_entry,
t->private->size,
add_counter_to_entry,
paddc->counters,
@@ -1584,8 +1641,15 @@
{
int ret;
struct ipt_table_info *newinfo;
- static struct ipt_table_info bootstrap
- = { 0, 0, 0, { 0 }, { 0 }, { } };
+ static struct ipt_table_info bootstrap = {
+ .size = 0,
+ .number = 0,
+ .initial_entries = 0,
+ .hook_entry = { 0 },
+ .underflow = { 0 },
+ .entries = {NULL }
+ };
+ void *loc_cpu_entry;
#ifdef CONFIG_SMP
int cpu;
if (!table->lock_p) {
@@ -1597,26 +1661,30 @@
#else
rwlock_init(&table->lock);
#endif
- newinfo = vmalloc(sizeof(struct ipt_table_info)
- + SMP_ALIGN(repl->size) * num_possible_cpus());
+
+ newinfo = alloc_table_info(repl->size);
if (!newinfo)
return -ENOMEM;
-
- memcpy(newinfo->entries, repl->entries, repl->size);
+ /*
+ * choose the copy that is on our node/cpu
+ */
+ loc_cpu_entry = newinfo->entries[get_cpu()];
+ memcpy(loc_cpu_entry, repl->entries, repl->size);

ret = translate_table(table->name, table->valid_hooks,
- newinfo, repl->size,
+ newinfo, loc_cpu_entry, repl->size,
repl->num_entries,
repl->hook_entry,
repl->underflow);
+ put_cpu();
if (ret != 0) {
- vfree(newinfo);
+ free_table_info(newinfo);
return ret;
}

ret = down_interruptible(&ipt_mutex);
if (ret != 0) {
- vfree(newinfo);
+ free_table_info(newinfo);
return ret;
}

@@ -1644,20 +1712,25 @@
return ret;

free_unlock:
- vfree(newinfo);
+ free_table_info(newinfo);
goto unlock;
}

void ipt_unregister_table(struct ipt_table *table)
{
+ void *loc_cpu_entry;
down(&ipt_mutex);
LIST_DELETE(&ipt_tables, table);
up(&ipt_mutex);

- /* Decrease module usage counts and free resources */
- IPT_ENTRY_ITERATE(table->private->entries, table->private->size,
+ /* Decrease module usage counts and free resources
+ * choose the copy that is on our node/cpu
+ */
+ loc_cpu_entry = table->private->entries[get_cpu()];
+ IPT_ENTRY_ITERATE(loc_cpu_entry, table->private->size,
cleanup_entry, NULL);
- vfree(table->private);
+ put_cpu();
+ free_table_info(table->private);
#ifdef CONFIG_SMP
free_percpu(table->lock_p);
table->lock_p = NULL;


Attachments:
patch_ip_tables_numa.3 (12.68 kB)

2005-09-21 22:43:53

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Wed, 21 Sep 2005, Eric Dumazet wrote:

> Instead of using one vmalloc() area (located in the node of the iptables
> process), we now allocate an area for each possible CPU, using NUMA policy
> (MPOL_PREFERRED) so that memory should be allocated in the CPU's node
> if possible.
>
> If the size of ipt_table is small enough (less than one page), we use
> kmalloc_node() instead of vmalloc(), to use less memory and less TLB entries)
> in small setups.

Maybe we better introduce vmalloc_node() instead of improvising this for
several subsystems? The e1000 driver has similar issues.

2005-09-22 00:34:42

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

From: Christoph Lameter <[email protected]>
Date: Wed, 21 Sep 2005 15:43:29 -0700 (PDT)

> Maybe we better introduce vmalloc_node() instead of improvising this for
> several subsystems? The e1000 driver has similar issues.

I agree.

2005-09-22 01:45:19

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Wed, 21 Sep 2005, David S. Miller wrote:

> From: Christoph Lameter <[email protected]>
> Date: Wed, 21 Sep 2005 15:43:29 -0700 (PDT)
>
> > Maybe we better introduce vmalloc_node() instead of improvising this for
> > several subsystems? The e1000 driver has similar issues.
>
> I agree.

I did an implementation in June.

See http://marc.theaimsgroup.com/?l=linux-mm&m=111766643127530&w=2

Not sure if this will fit the bill. Never really tested it.

2005-09-22 04:18:10

by James Morris

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Wed, 21 Sep 2005, Eric Dumazet wrote:

> I have reworked net/ipv4/ip_tables.c to boost its performance, and post three
> patches.

Do you have any performance measurements?


- James
--
James Morris
<[email protected]>

2005-09-22 05:08:03

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

James Morris a ?crit :
>
> Do you have any performance measurements?

Yes, as I said in the first mail :

>In oprofile results, ipt_do_table() was at the first position.
>It is now at 6th position, using 1/3 of the CPU it was using before.
>(Tests done on a dual Xeon i386 and a dual Opteron x86_64)

On the dual opteron machine, with 40.000 packets coming per second, and 35.000
sent per second, the numbers were : 12.8 % before the patches, 4.4 % after the
patches.

I dont have separate perf measurements for each patch.

Considering the fact that I inlined the read_lock_bh() call (not displayed in
oprofile results, probably because of the special .spinlock.text section) that
should have increased the profile of ipt_do_table(), thats a lot of CPU cycles
and mem bandwitdh that are available for other jobs.

Eric

2005-09-22 12:11:50

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

--- linux-2.6.14-rc2/include/linux/vmalloc.h 2005-09-20 05:00:41.000000000 +0200
+++ linux-2.6.14-rc2-ed/include/linux/vmalloc.h 2005-09-22 11:34:55.000000000 +0200
@@ -1,6 +1,7 @@
#ifndef _LINUX_VMALLOC_H
#define _LINUX_VMALLOC_H

+#include <linux/config.h> /* vmalloc_node() needs CONFIG_ options */
#include <linux/spinlock.h>
#include <asm/page.h> /* pgprot_t */

@@ -32,6 +33,14 @@
* Highlevel APIs for driver use
*/
extern void *vmalloc(unsigned long size);
+#ifdef CONFIG_NUMA
+extern void *vmalloc_node(unsigned long size, int node);
+#else
+static inline void *vmalloc_node(unsigned long size, int node)
+{
+ return vmalloc(size);
+}
+#endif
extern void *vmalloc_exec(unsigned long size);
extern void *vmalloc_32(unsigned long size);
extern void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask, pgprot_t prot);
--- linux-2.6.14-rc2/mm/vmalloc.c 2005-09-20 05:00:41.000000000 +0200
+++ linux-2.6.14-rc2-ed/mm/vmalloc.c 2005-09-22 11:55:19.000000000 +0200
@@ -19,6 +19,9 @@
#include <asm/uaccess.h>
#include <asm/tlbflush.h>

+#ifdef CONFIG_NUMA
+#include <linux/mempolicy.h>
+#endif

DEFINE_RWLOCK(vmlist_lock);
struct vm_struct *vmlist;
@@ -471,7 +474,7 @@
* Allocate enough pages to cover @size from the page level
* allocator and map them into contiguous kernel virtual space.
*
- * For tight cotrol over page level allocator and protection flags
+ * For tight control over page level allocator and protection flags
* use __vmalloc() instead.
*/
void *vmalloc(unsigned long size)
@@ -481,6 +484,40 @@

EXPORT_SYMBOL(vmalloc);

+#ifdef CONFIG_NUMA
+/**
+ * vmalloc_node - allocate virtually contiguous memory
+ *
+ * @size: allocation size
+ * @node: preferred node
+ *
+ * This vmalloc variant try to allocate memory from a preferred node.
+ */
+void *vmalloc_node(unsigned long size, int node)
+{
+ void *result;
+ struct mempolicy *oldpol = current->mempolicy;
+ mm_segment_t oldfs = get_fs();
+ DECLARE_BITMAP(prefnode, MAX_NUMNODES);
+
+ mpol_get(oldpol);
+ bitmap_zero(prefnode, MAX_NUMNODES);
+ set_bit(node, prefnode);
+
+ set_fs(KERNEL_DS);
+ sys_set_mempolicy(MPOL_PREFERRED, prefnode, MAX_NUMNODES);
+ set_fs(oldfs);
+
+ result = vmalloc(size);
+
+ mpol_free(current->mempolicy);
+ current->mempolicy = oldpol;
+ return result;
+}
+
+EXPORT_SYMBOL(vmalloc_node);
+#endif
+
#ifndef PAGE_KERNEL_EXEC
# define PAGE_KERNEL_EXEC PAGE_KERNEL
#endif


Attachments:
vmalloc_node (2.36 kB)

2005-09-22 12:48:12

by Harald Welte

[permalink] [raw]
Subject: Re: [PATCH 2/3] netfilter : 3 patches to boost ip_tables performance

On Wed, Sep 21, 2005 at 11:32:24PM +0200, Eric Dumazet wrote:
> Patch 2/3 (please apply after Patch 1/3)
>
> 2) Loop unrolling

First of all, thanks for your performance analysis and patches. I'm
very inclined of merging them.

> It seems that with current compilers and CFLAGS, the code from
> ip_packet_match() is very bad, using lot of mispredicted conditional branches I made some patches
> and generated code on i386 and x86_64
> is much better.

This only describes your "compare_if_lstrings()" changes, and I'm happy
to merge them.

However, you also removed the use of the FWINV macro and replaced it by
explicit code (including the bool1/bool2 variables, which are not really
named intuitively). Why was this neccessary?

--
- Harald Welte <[email protected]> http://netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie


Attachments:
(No filename) (1.07 kB)
(No filename) (189.00 B)
Download all attachments

2005-09-22 12:50:08

by Harald Welte

[permalink] [raw]
Subject: Re: [PATCH 3/3] netfilter : 3 patches to boost ip_tables performance

On Wed, Sep 21, 2005 at 11:37:20PM +0200, Eric Dumazet wrote:
> Patch 3/3 (please apply after Patch 2/3)

Thanks for your patches, again.

> 3) NUMA allocation.

I'll wait with applying this patch until a decision has been made on
merging something like vmalloc_node() to mainline.

--
- Harald Welte <[email protected]> http://netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie


Attachments:
(No filename) (648.00 B)
(No filename) (189.00 B)
Download all attachments

2005-09-22 12:49:49

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Thu, Sep 22, 2005 at 02:11:26PM +0200, Eric Dumazet wrote:
> +#ifdef CONFIG_NUMA
> +/**
> + * vmalloc_node - allocate virtually contiguous memory
> + *
> + * @size: allocation size
> + * @node: preferred node
> + *
> + * This vmalloc variant try to allocate memory from a preferred node.
> + */
> +void *vmalloc_node(unsigned long size, int node)
> +{
> + void *result;
> + struct mempolicy *oldpol = current->mempolicy;
> + mm_segment_t oldfs = get_fs();
> + DECLARE_BITMAP(prefnode, MAX_NUMNODES);
> +
> + mpol_get(oldpol);
> + bitmap_zero(prefnode, MAX_NUMNODES);
> + set_bit(node, prefnode);
> +
> + set_fs(KERNEL_DS);
> + sys_set_mempolicy(MPOL_PREFERRED, prefnode, MAX_NUMNODES);
> + set_fs(oldfs);
> +
> + result = vmalloc(size);
> +
> + mpol_free(current->mempolicy);
> + current->mempolicy = oldpol;
> + return result;
> +}

No way, sorry. If you want a vmalloc node do it right.

2005-09-22 12:54:46

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Thursday 22 September 2005 14:49, Christoph Hellwig wrote:
> On Thu, Sep 22, 2005 at 02:11:26PM +0200, Eric Dumazet wrote:
> > +#ifdef CONFIG_NUMA
> > +/**
> > + * vmalloc_node - allocate virtually contiguous memory
> > + *
> > + * @size: allocation size
> > + * @node: preferred node
> > + *
> > + * This vmalloc variant try to allocate memory from a preferred node.
> > + */
> > +void *vmalloc_node(unsigned long size, int node)
> > +{
> > + void *result;
> > + struct mempolicy *oldpol = current->mempolicy;
> > + mm_segment_t oldfs = get_fs();
> > + DECLARE_BITMAP(prefnode, MAX_NUMNODES);
> > +
> > + mpol_get(oldpol);
> > + bitmap_zero(prefnode, MAX_NUMNODES);
> > + set_bit(node, prefnode);
> > +
> > + set_fs(KERNEL_DS);
> > + sys_set_mempolicy(MPOL_PREFERRED, prefnode, MAX_NUMNODES);
> > + set_fs(oldfs);
> > +
> > + result = vmalloc(size);
> > +
> > + mpol_free(current->mempolicy);
> > + current->mempolicy = oldpol;
> > + return result;
> > +}
>
> No way, sorry. If you want a vmalloc node do it right.

The implementation looks fine to me, so I think it's already right.

-Andi

2005-09-22 12:57:29

by Harald Welte

[permalink] [raw]
Subject: Re: [PATCH 1/3] netfilter : 3 patches to boost ip_tables performance

On Wed, Sep 21, 2005 at 11:29:13PM +0200, Eric Dumazet wrote:
> Patch 1/3
>
> 1) No more one rwlock_t protecting the 'curtain'

I have no problem with this change "per se", but with the
implementation.

As of now, we live without any ugly #ifdef CONFIG_SMP / #endif sections
in the code - and if possible, I would continue this good tradition.

For example the get_counters() function. Wouldn't all the smp specific
code (for_each_cpu(), ...) be #defined to nothing anyway?

And if we really need the #ifdef's, I would appreciate if those
sectionas are as small as possible. in get_counters() the section can
definitely be smaller, rather than basically having the whole function
body separate for smp and non-smp cases.

Also, how much would we loose in runtime performance if we were using a
"rwlock_t *" even in the UP case?. I mean, it's just one more pointer
dereference of something that is expected to be in cache anyway, isn't
it? This gets rid of another huge set of #ifdefs that make the code
unreadable and prone to errors being introduced later on.

--
- Harald Welte <[email protected]> http://netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie


Attachments:
(No filename) (1.40 kB)
(No filename) (189.00 B)
Download all attachments

2005-09-22 12:58:56

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Thu, Sep 22, 2005 at 02:54:22PM +0200, Andi Kleen wrote:
> > > +void *vmalloc_node(unsigned long size, int node)
> > > +{
> > > + void *result;
> > > + struct mempolicy *oldpol = current->mempolicy;
> > > + mm_segment_t oldfs = get_fs();
> > > + DECLARE_BITMAP(prefnode, MAX_NUMNODES);
> > > +
> > > + mpol_get(oldpol);
> > > + bitmap_zero(prefnode, MAX_NUMNODES);
> > > + set_bit(node, prefnode);
> > > +
> > > + set_fs(KERNEL_DS);
> > > + sys_set_mempolicy(MPOL_PREFERRED, prefnode, MAX_NUMNODES);
> > > + set_fs(oldfs);
> > > +
> > > + result = vmalloc(size);
> > > +
> > > + mpol_free(current->mempolicy);
> > > + current->mempolicy = oldpol;
> > > + return result;
> > > +}
> >
> > No way, sorry. If you want a vmalloc node do it right.
>
> The implementation looks fine to me, so I think it's already right.

Umm, no - adding set_fs/get_fs mess for things like that is not right.
If we want to go down the mempolicy-based route we need to add a proper
kernel entry point for setting a mempolicy

2005-09-22 13:03:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance


> 1) No more central rwlock protecting each table (filter, nat, mangle, raw),
> but one lock per CPU. It avoids cache line ping pongs for each packet.

Another useful change would be to not take the lock when there are no
rules. Currently just loading iptables has a large overhead.

-Andi

2005-09-22 13:05:55

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 2/3] netfilter : 3 patches to boost ip_tables performance

Harald Welte a ?crit :
> On Wed, Sep 21, 2005 at 11:32:24PM +0200, Eric Dumazet wrote:
>
>>Patch 2/3 (please apply after Patch 1/3)
>>
>>2) Loop unrolling
>
>
> First of all, thanks for your performance analysis and patches. I'm
> very inclined of merging them.
>
>
>>It seems that with current compilers and CFLAGS, the code from
>>ip_packet_match() is very bad, using lot of mispredicted conditional branches I made some patches
>>and generated code on i386 and x86_64
>>is much better.
>
>
> This only describes your "compare_if_lstrings()" changes, and I'm happy
> to merge them.
>
> However, you also removed the use of the FWINV macro and replaced it by
> explicit code (including the bool1/bool2 variables, which are not really
> named intuitively). Why was this neccessary?
>

It was necessary to get the best code with gcc-3.4.4 on i386 and gcc-4.0.1 on
x86_64

For example :

bool1 = FWINV(ret != 0, IPT_INV_VIA_OUT);
if (bool1) {

gives a better code than :

if (FWINV(ret != 0, IPT_INV_VIA_OUT)) {

(one less conditional branch)

Dont ask me why, it is shocking but true :(

Eric

2005-09-22 13:05:13

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Thursday 22 September 2005 14:58, Christoph Hellwig wrote:

> Umm, no - adding set_fs/get_fs mess for things like that is not right.

I think it's fine. We're using it for various other interfaces too. In fact
sys_set_mempolicy is already used elsewhere in the kernel too.

-Andi

2005-09-22 13:17:56

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 1/3] netfilter : 3 patches to boost ip_tables performance

Harald Welte a ?crit :
> On Wed, Sep 21, 2005 at 11:29:13PM +0200, Eric Dumazet wrote:
>
>>Patch 1/3
>>
>>1) No more one rwlock_t protecting the 'curtain'
>
>
> I have no problem with this change "per se", but with the
> implementation.
>
> As of now, we live without any ugly #ifdef CONFIG_SMP / #endif sections
> in the code - and if possible, I would continue this good tradition.
>
> For example the get_counters() function. Wouldn't all the smp specific
> code (for_each_cpu(), ...) be #defined to nothing anyway?

Well... not exactly, but you are right only the first loop (SET_COUNTER) will
really do something. The if (cpu == curcpu) will be true but the compiler wont
know that, cpu and curcpu are still C variables.


>
> And if we really need the #ifdef's, I would appreciate if those
> sectionas are as small as possible. in get_counters() the section can
> definitely be smaller, rather than basically having the whole function
> body separate for smp and non-smp cases.

get_counters() is not critical, so I agree with you we can stick the general
version (not the UP optimized one)

>
> Also, how much would we loose in runtime performance if we were using a
> "rwlock_t *" even in the UP case?. I mean, it's just one more pointer
> dereference of something that is expected to be in cache anyway, isn't
> it? This gets rid of another huge set of #ifdefs that make the code
> unreadable and prone to errors being introduced later on.
>

Well, in UP case, the rwlock_t is a nulldef.

I was inspired by another use of percpu data in include/linux/genhd.h
#ifdef CONFIG_SMP
struct disk_stats *dkstats;
#else
struct disk_stats dkstats;
#endif

But if you dislike this, we can use pointer for all cases.

Eric

2005-09-22 13:30:43

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

Andi Kleen a écrit :
>>1) No more central rwlock protecting each table (filter, nat, mangle, raw),
>> but one lock per CPU. It avoids cache line ping pongs for each packet.
>
>
> Another useful change would be to not take the lock when there are no
> rules. Currently just loading iptables has a large overhead.
>

Unfortunatly there are allways rules, after the loading of iptables, at least
for the "packet_filter" table.

Eric

2005-09-22 15:38:04

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Thu, 22 Sep 2005, Andi Kleen wrote:

> On Thursday 22 September 2005 14:58, Christoph Hellwig wrote:
>
> > Umm, no - adding set_fs/get_fs mess for things like that is not right.
>
> I think it's fine. We're using it for various other interfaces too. In fact
> sys_set_mempolicy is already used elsewhere in the kernel too.

It should really be do_set_mempolicy instead to be cleaner. I got a patch
here that fixes the policy layer.

But still I agree with Christoph that a real vmalloc_node is better. There
will be no fuzzing around with memory policies etc and its certainly
better performance wise.

2005-09-22 15:51:25

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

Christoph Lameter a ?crit :
>
> It should really be do_set_mempolicy instead to be cleaner. I got a patch
> here that fixes the policy layer.
>
> But still I agree with Christoph that a real vmalloc_node is better. There
> will be no fuzzing around with memory policies etc and its certainly
> better performance wise.

vmalloc_node() should be seldom used, at driver init, or when a new ip_tables
is loaded. If it happens to be a performance problem, then we can optimize it.
Why should we spend days of work for a function that is yet to be used ?

Eric

2005-09-22 15:55:33

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Thu, 22 Sep 2005, Eric Dumazet wrote:

> vmalloc_node() should be seldom used, at driver init, or when a new ip_tables
> is loaded. If it happens to be a performance problem, then we can optimize it.

Allright. However, there are a couple of uses of vmalloc_node that I can
see right now and they will likely increase in the future.

> Why should we spend days of work for a function that is yet to be used ?

I already did a vmalloc_node patch. So no need for spending days of work.
The patch can wait until it becomes performance critical.


2005-09-23 04:05:07

by Willy Tarreau

[permalink] [raw]
Subject: Re: [PATCH 2/3] netfilter : 3 patches to boost ip_tables performance

On Thu, Sep 22, 2005 at 03:05:50PM +0200, Eric Dumazet wrote:
(...)
> It was necessary to get the best code with gcc-3.4.4 on i386 and
> gcc-4.0.1 on x86_64
>
> For example :
>
> bool1 = FWINV(ret != 0, IPT_INV_VIA_OUT);
> if (bool1) {
>
> gives a better code than :
>
> if (FWINV(ret != 0, IPT_INV_VIA_OUT)) {
>
> (one less conditional branch)
>
> Dont ask me why, it is shocking but true :(

I also noticed many times that gcc's optimization of "if (complex condition)"
is rather poor and it's often better to put it in a variable before. I even
remember that if you use an intermediate variable, it can often generate a
CMOV instruction on processors which support it, while it produces cond tests
and jumps without the variable. Generally speaking, if you want fast code,
you have to write it as a long sequence of small instructions, just as if
you were writing assembly. As you said, shocking but true.

BTW, cheers for your optimizations !

Regards,
Willy

2005-09-23 05:14:41

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 2/3] netfilter : 3 patches to boost ip_tables performance

Willy Tarreau a ?crit :
> On Thu, Sep 22, 2005 at 03:05:50PM +0200, Eric Dumazet wrote:
> (...)
>
>>It was necessary to get the best code with gcc-3.4.4 on i386 and
>>gcc-4.0.1 on x86_64
>>
>>For example :
>>
>>bool1 = FWINV(ret != 0, IPT_INV_VIA_OUT);
>>if (bool1) {
>>
>>gives a better code than :
>>
>>if (FWINV(ret != 0, IPT_INV_VIA_OUT)) {
>>
>>(one less conditional branch)
>>
>>Dont ask me why, it is shocking but true :(
>
>
> I also noticed many times that gcc's optimization of "if (complex condition)"
> is rather poor and it's often better to put it in a variable before. I even
> remember that if you use an intermediate variable, it can often generate a
> CMOV instruction on processors which support it, while it produces cond tests
> and jumps without the variable. Generally speaking, if you want fast code,
> you have to write it as a long sequence of small instructions, just as if
> you were writing assembly. As you said, shocking but true.

Even without CMOV support, the suggested patch helps :

Here is the code generated with gcc-3.4.4 on a pentium4 (i686) for :

/********************/
bool1 = ((ip->saddr&ipinfo->smsk.s_addr) != ipinfo->src.s_addr);
bool1 ^= !!(ipinfo->invflags & IPT_INV_SRCIP);

bool2 = ((ip->daddr&ipinfo->dmsk.s_addr) != ipinfo->dst.s_addr);
bool2 ^= !!(ipinfo->invflags & IPT_INV_DSTIP);

if ((bool1 | bool2) != 0) {

/********************/
cb: 0f b6 56 53 movzbl 0x53(%esi),%edx
cf: 8b 46 08 mov 0x8(%esi),%eax #ip->saddr
d2: 23 47 0c and 0xc(%edi),%eax #ipinfo->smsk.s_addr
d5: 0f b6 da movzbl %dl,%ebx
d8: 3b 06 cmp (%esi),%eax #ipinfo->src.s_addr
da: 88 55 cf mov %dl,0xffffffcf(%ebp)
dd: 89 da mov %ebx,%edx
df: 0f 95 c0 setne %al
e2: c1 ea 03 shr $0x3,%edx
e5: 31 c2 xor %eax,%edx
e7: 8b 46 0c mov 0xc(%esi),%eax #ip->daddr&ipinfo
ea: 23 47 10 and 0x10(%edi),%eax #ipinfo->dmsk.s_addr
ed: 3b 46 04 cmp 0x4(%esi),%eax #ipinfo->dst.s_addr
f0: 89 d8 mov %ebx,%eax
f2: 0f 95 c1 setne %cl
f5: c1 e8 04 shr $0x4,%eax
f8: 31 c8 xor %ecx,%eax
fa: 09 d0 or %edx,%eax
fc: a8 01 test $0x1,%al
fe: 0f 85 95 00 00 00 jne dest // only one conditional branch

As you can see the whole sequence is rather good : only one conditional branch
(No CMOV instructions as you can see, so even on a i486 the code should be
roughly the same)

Now here is the code generated for the original code :
/********************/
if (FWINV((ip->saddr&ipinfo->smsk.s_addr) != ipinfo->src.s_addr,
IPT_INV_SRCIP)
|| FWINV((ip->daddr&ipinfo->dmsk.s_addr) != ipinfo->dst.s_addr,
IPT_INV_DSTIP)) {
/********************/
cb: 0f b6 4e 53 movzbl 0x53(%esi),%ecx
cf: f6 c1 08 test $0x8,%cl
d2: 0f 84 af 01 00 00 je 287 <ipt_do_table+0x25d>
d8: 8b 46 08 mov 0x8(%esi),%eax
db: 23 47 0c and 0xc(%edi),%eax
de: 3b 06 cmp (%esi),%eax
e0: 0f 84 b0 01 00 00 je 296 <ipt_do_table+0x26c>
e6: f6 c1 10 test $0x10,%cl
e9: 0f 84 d4 01 00 00 je 2c3 <ipt_do_table+0x299>
ef: 8b 46 0c mov 0xc(%esi),%eax
f2: 23 47 10 and 0x10(%edi),%eax
f5: 3b 46 04 cmp 0x4(%esi),%eax
f8: 0f 84 98 01 00 00 je 296 <ipt_do_table+0x26c>

...

287: 8b 46 08 mov 0x8(%esi),%eax
28a: 23 47 0c and 0xc(%edi),%eax
28d: 3b 06 cmp (%esi),%eax
28f: 2e 0f 84 50 fe ff ff je,pn e6 <ipt_do_table+0xbc>
296: 0f b7 46 5a movzwl 0x5a(%esi),%eax
29a: 01 c6 add %eax,%esi
29c: 8b 4d f0 mov 0xfffffff0(%ebp),%ecx
29f: 85 c9 test %ecx,%ecx
2a1: 0f 84 24 fe ff ff je cb <ipt_do_table+0xa1>

...

2c3: 8b 46 0c mov 0xc(%esi),%eax
2c6: 23 47 10 and 0x10(%edi),%eax
2c9: 3b 46 04 cmp 0x4(%esi),%eax
2cc: 75 c8 jne 296 <ipt_do_table+0x26c>
2ce: e9 2b fe ff ff jmp fe <ipt_do_table+0xd4>


/******************/
As you can see, that a lot of conditional branches, that cannot be predicted
correctly by the cpu, unless consecutives iptables rules generate the same flow.


Eric

2005-09-23 11:36:12

by Willy Tarreau

[permalink] [raw]
Subject: Re: [PATCH 2/3] netfilter : 3 patches to boost ip_tables performance

On Fri, Sep 23, 2005 at 07:14:24AM +0200, Eric Dumazet wrote:

> Even without CMOV support, the suggested patch helps :
>
> Here is the code generated with gcc-3.4.4 on a pentium4 (i686) for :
>
> /********************/
> bool1 = ((ip->saddr&ipinfo->smsk.s_addr) != ipinfo->src.s_addr);
> bool1 ^= !!(ipinfo->invflags & IPT_INV_SRCIP);
>
> bool2 = ((ip->daddr&ipinfo->dmsk.s_addr) != ipinfo->dst.s_addr);
> bool2 ^= !!(ipinfo->invflags & IPT_INV_DSTIP);
>
> if ((bool1 | bool2) != 0) {
>
> /********************/

(...)

I totally agree with your demonstration. It would be interesting to compare
she same code on an architecture with more registers (eg: sparc). One of
the reasons of bad optimization of such constructs on x86 seems to be the
lack of registers for the number of variables and intermediate results.
When you write the code like above, you show the workflow to the compiler
(and I often use the same technique).

Regards,
Willy

2005-09-23 14:00:30

by Tim Mattox

[permalink] [raw]
Subject: Re: [PATCH 2/3] netfilter : 3 patches to boost ip_tables performance

AFAIK, the usual reason gcc can generate better code when you write
bool = complex conditional
if (bool) ...
is because of C's short-circuit conditional evaluation rules for || and &&. By
moving the complex expression to an assignment statement, you are telling
gcc that it is okay and safe to evaluate every part of the expression.
If gcc was smart about being able to prove that all parts of the complex
expression where safe (no possibly null pointer references) and had no
side effects, then it could generate the same code with
if (complex conditional) ...

However, there are many situations where even a magical compiler
couldn't prove that there were no possible side effects, etc., and would
have to generate multiple conditional branches to properly meet the
short-circuit conditional evaluation rules for && and ||.

However, in the specific cases in this thread using FWINV without
|| and && operators, an optimizing compiler "should" be smart enough
to generate more linear code for today's heavily pipelined CPUs.
For now, I guess it's still the duty of the programmer to use coding
style to force the compiler to generate more linear machine code.

On 9/23/05, Willy Tarreau <[email protected]> wrote:
> On Thu, Sep 22, 2005 at 03:05:50PM +0200, Eric Dumazet wrote:
> (...)
> > It was necessary to get the best code with gcc-3.4.4 on i386 and
> > gcc-4.0.1 on x86_64
> >
> > For example :
> >
> > bool1 = FWINV(ret != 0, IPT_INV_VIA_OUT);
> > if (bool1) {
> >
> > gives a better code than :
> >
> > if (FWINV(ret != 0, IPT_INV_VIA_OUT)) {
> >
> > (one less conditional branch)
> >
> > Dont ask me why, it is shocking but true :(
>
> I also noticed many times that gcc's optimization of "if (complex condition)"
> is rather poor and it's often better to put it in a variable before. I even
> remember that if you use an intermediate variable, it can often generate a
> CMOV instruction on processors which support it, while it produces cond tests
> and jumps without the variable. Generally speaking, if you want fast code,
> you have to write it as a long sequence of small instructions, just as if
> you were writing assembly. As you said, shocking but true.
>
> BTW, cheers for your optimizations !
>
> Regards,
> Willy
>
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

--
Tim Mattox - [email protected]
http://homepage.mac.com/tmattox/
I'm a bright... http://www.the-brights.net/

2005-09-23 17:09:15

by Harald Welte

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Thu, Sep 22, 2005 at 03:03:21PM +0200, Andi Kleen wrote:
>
> > 1) No more central rwlock protecting each table (filter, nat, mangle, raw),
> > but one lock per CPU. It avoids cache line ping pongs for each packet.
>
> Another useful change would be to not take the lock when there are no
> rules. Currently just loading iptables has a large overhead.

This is partially due to the netfilter hooks that are registered (so we
always take nf_hook_slow() in the NF_HOOK() macro).

The default policies inside an iptables chain are internally implemented
as a rule. Thus, policies as built-in rules have packet/byte counters.

Therefore, without making a semantic change, we cannot do any of the
following optimizations:

1) not take a lock when the chain is empty
2) not register at the netfilter hook when the chain is empty.

This is well-known, but I don't think we can change the semantics for
the user during a stable kernel series. That's one point where not
having 2.7.x really hurts.

--
- Harald Welte <[email protected]> http://netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie


Attachments:
(No filename) (1.33 kB)
(No filename) (189.00 B)
Download all attachments

2005-09-23 17:11:27

by Harald Welte

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Thu, Sep 22, 2005 at 05:50:49PM +0200, Eric Dumazet wrote:
> Christoph Lameter a ?crit :
> >It should really be do_set_mempolicy instead to be cleaner. I got a patch here that fixes the
> >policy layer.
> >But still I agree with Christoph that a real vmalloc_node is better. There will be no fuzzing
> >around with memory policies etc and its certainly better performance wise.
>
> vmalloc_node() should be seldom used, at driver init, or when a new
> ip_tables is loaded. If it happens to be a performance problem, then
> we can optimize it. Why should we spend days of work for a function
> that is yet to be used ?

I see a contradiction in your sentence. "a new ip_tables is loaded"
every time a user changes a single rule. There are numerous setups that
dynamically change the ruleset (e.g. at interface up/down point, or even
think of your typical wlan hotspot, where once a user is authorized,
he'll get different rules.

--
- Harald Welte <[email protected]> http://netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie


Attachments:
(No filename) (1.27 kB)
(No filename) (189.00 B)
Download all attachments

2005-09-23 17:45:21

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

Here is an updated version of the vmalloc_node patch:

This patch adds

vmalloc_node(size, node) -> Allocate necessary memory on the specified node

and

get_vm_area_node(size, flags, node)

and the other functions that it depends on.

Index: linux-2.6.14-rc2/include/linux/vmalloc.h
===================================================================
--- linux-2.6.14-rc2.orig/include/linux/vmalloc.h 2005-09-19 20:00:41.000000000 -0700
+++ linux-2.6.14-rc2/include/linux/vmalloc.h 2005-09-23 10:28:37.000000000 -0700
@@ -32,22 +32,35 @@ struct vm_struct {
* Highlevel APIs for driver use
*/
extern void *vmalloc(unsigned long size);
+extern void *vmalloc_node(unsigned long size, int node);
extern void *vmalloc_exec(unsigned long size);
extern void *vmalloc_32(unsigned long size);
-extern void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask, pgprot_t prot);
-extern void *__vmalloc_area(struct vm_struct *area, unsigned int __nocast gfp_mask, pgprot_t prot);
-extern void vfree(void *addr);
+extern void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask,
+ pgprot_t prot, int node);
+extern void *__vmalloc_area(struct vm_struct *area, unsigned int __nocast gfp_mask,
+ pgprot_t prot, int node);

+extern void vfree(void *addr);
extern void *vmap(struct page **pages, unsigned int count,
unsigned long flags, pgprot_t prot);
extern void vunmap(void *addr);
-
-/*
- * Lowlevel-APIs (not for driver use!)
+
+/**
+ * get_vm_area - reserve a contingous kernel virtual area
+ *
+ * @size: size of the area
+ * @flags: %VM_IOREMAP for I/O mappings or VM_ALLOC
+ *
+ * Search an area of @size in the kernel virtual mapping area,
+ * and reserved it for out purposes. Returns the area descriptor
+ * on success or %NULL on failure.
*/
-extern struct vm_struct *get_vm_area(unsigned long size, unsigned long flags);
extern struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags,
- unsigned long start, unsigned long end);
+ unsigned long start, unsigned long end, int node);
+#define get_vm_area(__size, __flags) __get_vm_area((__size), (__flags), VMALLOC_START, \
+ VMALLOC_END, -1)
+#define get_vm_area_node(__size, __flags, __node) __get_vm_area((__size), (__flags), \
+ VMALLOC_START, VMALLOC_END, __node)
extern struct vm_struct *remove_vm_area(void *addr);
extern struct vm_struct *__remove_vm_area(void *addr);
extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
Index: linux-2.6.14-rc2/mm/vmalloc.c
===================================================================
--- linux-2.6.14-rc2.orig/mm/vmalloc.c 2005-09-19 20:00:41.000000000 -0700
+++ linux-2.6.14-rc2/mm/vmalloc.c 2005-09-23 10:43:07.000000000 -0700
@@ -5,6 +5,7 @@
* Support of BIGMEM added by Gerhard Wichert, Siemens AG, July 1999
* SMP-safe vmalloc/vfree/ioremap, Tigran Aivazian <[email protected]>, May 2000
* Major rework to support vmap/vunmap, Christoph Hellwig, SGI, August 2002
+ * Numa awareness, Christoph Lameter, SGI, June 2005
*/

#include <linux/mm.h>
@@ -159,7 +160,7 @@ int map_vm_area(struct vm_struct *area,
}

struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end, int node)
{
struct vm_struct **p, *tmp, *area;
unsigned long align = 1;
@@ -178,7 +179,7 @@ struct vm_struct *__get_vm_area(unsigned
addr = ALIGN(start, align);
size = PAGE_ALIGN(size);

- area = kmalloc(sizeof(*area), GFP_KERNEL);
+ area = kmalloc_node(sizeof(*area), GFP_KERNEL, node);
if (unlikely(!area))
return NULL;

@@ -231,21 +232,6 @@ out:
return NULL;
}

-/**
- * get_vm_area - reserve a contingous kernel virtual area
- *
- * @size: size of the area
- * @flags: %VM_IOREMAP for I/O mappings or VM_ALLOC
- *
- * Search an area of @size in the kernel virtual mapping area,
- * and reserved it for out purposes. Returns the area descriptor
- * on success or %NULL on failure.
- */
-struct vm_struct *get_vm_area(unsigned long size, unsigned long flags)
-{
- return __get_vm_area(size, flags, VMALLOC_START, VMALLOC_END);
-}
-
/* Caller must hold vmlist_lock */
struct vm_struct *__remove_vm_area(void *addr)
{
@@ -395,7 +381,8 @@ void *vmap(struct page **pages, unsigned

EXPORT_SYMBOL(vmap);

-void *__vmalloc_area(struct vm_struct *area, unsigned int __nocast gfp_mask, pgprot_t prot)
+void *__vmalloc_area(struct vm_struct *area, unsigned int __nocast gfp_mask,
+ pgprot_t prot, int node)
{
struct page **pages;
unsigned int nr_pages, array_size, i;
@@ -406,9 +393,9 @@ void *__vmalloc_area(struct vm_struct *a
area->nr_pages = nr_pages;
/* Please note that the recursion is strictly bounded. */
if (array_size > PAGE_SIZE)
- pages = __vmalloc(array_size, gfp_mask, PAGE_KERNEL);
+ pages = __vmalloc(array_size, gfp_mask, PAGE_KERNEL, node);
else
- pages = kmalloc(array_size, (gfp_mask & ~__GFP_HIGHMEM));
+ pages = kmalloc_node(array_size, (gfp_mask & ~__GFP_HIGHMEM), node);
area->pages = pages;
if (!area->pages) {
remove_vm_area(area->addr);
@@ -418,7 +405,10 @@ void *__vmalloc_area(struct vm_struct *a
memset(area->pages, 0, array_size);

for (i = 0; i < area->nr_pages; i++) {
- area->pages[i] = alloc_page(gfp_mask);
+ if (node < 0)
+ area->pages[i] = alloc_page(gfp_mask);
+ else
+ area->pages[i] = alloc_pages_node(node, gfp_mask, 0);
if (unlikely(!area->pages[i])) {
/* Successfully allocated i pages, free them in __vunmap() */
area->nr_pages = i;
@@ -446,7 +436,7 @@ fail:
* allocator with @gfp_mask flags. Map them into contiguous
* kernel virtual space, using a pagetable protection of @prot.
*/
-void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask, pgprot_t prot)
+void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask, pgprot_t prot, int node)
{
struct vm_struct *area;

@@ -454,13 +444,12 @@ void *__vmalloc(unsigned long size, unsi
if (!size || (size >> PAGE_SHIFT) > num_physpages)
return NULL;

- area = get_vm_area(size, VM_ALLOC);
+ area = get_vm_area_node(size, VM_ALLOC, node);
if (!area)
return NULL;

- return __vmalloc_area(area, gfp_mask, prot);
+ return __vmalloc_area(area, gfp_mask, prot, node);
}
-
EXPORT_SYMBOL(__vmalloc);

/**
@@ -476,11 +465,30 @@ EXPORT_SYMBOL(__vmalloc);
*/
void *vmalloc(unsigned long size)
{
- return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL);
+ return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL, -1);
}

EXPORT_SYMBOL(vmalloc);

+/**
+ * vmalloc_node - allocate memory on a specific node
+ *
+ * @size: allocation size
+ * @node; numa node
+ *
+ * Allocate enough pages to cover @size from the page level
+ * allocator and map them into contiguous kernel virtual space.
+ *
+ * For tight cotrol over page level allocator and protection flags
+ * use __vmalloc() instead.
+ */
+void *vmalloc_node(unsigned long size, int node)
+{
+ return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL, node);
+}
+
+EXPORT_SYMBOL(vmalloc_node);
+
#ifndef PAGE_KERNEL_EXEC
# define PAGE_KERNEL_EXEC PAGE_KERNEL
#endif
@@ -500,7 +508,7 @@ EXPORT_SYMBOL(vmalloc);

void *vmalloc_exec(unsigned long size)
{
- return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC);
+ return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC, -1);
}

/**
@@ -513,7 +521,7 @@ void *vmalloc_exec(unsigned long size)
*/
void *vmalloc_32(unsigned long size)
{
- return __vmalloc(size, GFP_KERNEL, PAGE_KERNEL);
+ return __vmalloc(size, GFP_KERNEL, PAGE_KERNEL, -1);
}

EXPORT_SYMBOL(vmalloc_32);
Index: linux-2.6.14-rc2/fs/xfs/linux-2.6/kmem.c
===================================================================
--- linux-2.6.14-rc2.orig/fs/xfs/linux-2.6/kmem.c 2005-09-19 20:00:41.000000000 -0700
+++ linux-2.6.14-rc2/fs/xfs/linux-2.6/kmem.c 2005-09-23 10:17:20.000000000 -0700
@@ -55,7 +55,7 @@ kmem_alloc(size_t size, unsigned int __n
if (size < MAX_SLAB_SIZE || retries > MAX_VMALLOCS)
ptr = kmalloc(size, lflags);
else
- ptr = __vmalloc(size, lflags, PAGE_KERNEL);
+ ptr = __vmalloc(size, lflags, PAGE_KERNEL, -1);
if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
return ptr;
if (!(++retries % 100))
Index: linux-2.6.14-rc2/mm/page_alloc.c
===================================================================
--- linux-2.6.14-rc2.orig/mm/page_alloc.c 2005-09-19 20:00:41.000000000 -0700
+++ linux-2.6.14-rc2/mm/page_alloc.c 2005-09-23 10:17:20.000000000 -0700
@@ -2542,7 +2542,7 @@ void *__init alloc_large_system_hash(con
if (flags & HASH_EARLY)
table = alloc_bootmem(size);
else if (hashdist)
- table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
+ table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL, -1);
else {
unsigned long order;
for (order = 0; ((1UL << order) << PAGE_SHIFT) < size; order++)

2005-09-23 17:48:01

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

Harald Welte a ?crit :
> On Thu, Sep 22, 2005 at 05:50:49PM +0200, Eric Dumazet wrote:
>
>>Christoph Lameter a ?crit :
>>
>>>It should really be do_set_mempolicy instead to be cleaner. I got a patch here that fixes the
>>>policy layer.
>>>But still I agree with Christoph that a real vmalloc_node is better. There will be no fuzzing
>>>around with memory policies etc and its certainly better performance wise.
>>
>>vmalloc_node() should be seldom used, at driver init, or when a new
>>ip_tables is loaded. If it happens to be a performance problem, then
>>we can optimize it. Why should we spend days of work for a function
>>that is yet to be used ?
>
>
> I see a contradiction in your sentence. "a new ip_tables is loaded"
> every time a user changes a single rule. There are numerous setups that
> dynamically change the ruleset (e.g. at interface up/down point, or even
> think of your typical wlan hotspot, where once a user is authorized,
> he'll get different rules.
>

But a user changing a single rule usually calls (fork()/exec()) a program
called iptables. The underlying cost of all this, plus copying the rules to
user space, so that iptables change them and reload them in the kernel is far
more important than an hypothetical vmalloc_node() performance problem.

Eric

2005-09-23 18:01:28

by Kyle Moffett

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Sep 23, 2005, at 13:47:53, Eric Dumazet wrote:
> Harald Welte a ?crit :
>> I see a contradiction in your sentence. "a new ip_tables is
>> loaded" every time a user changes a single rule. There are
>> numerous setups that dynamically change the ruleset (e.g. at
>> interface up/down point, or even think of your typical wlan
>> hotspot, where once a user is authorized, he'll get different rules.
>
> But a user changing a single rule usually calls (fork()/exec()) a
> program called iptables. The underlying cost of all this, plus
> copying the rules to user space, so that iptables change them and
> reload them in the kernel is far more important than an
> hypothetical vmalloc_node() performance problem.

Yeah, if you're really worried about the cost of iptables
manipulations, you should probably write your own happy little C
program to atomically load, update, and store the rules. Even then,
the cost of copying the whole ruleset to userspace for modification
is probably greater than that of memory allocation issues, especially
if the ruleset is large enough that memory allocation issues cause
problems :-D

Cheers,
Kyle Moffett

-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a18 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$ L++++(+
++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+ PGP+++ t+(+
++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$ r !y?(-)
------END GEEK CODE BLOCK------


2005-09-23 18:05:09

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Fri, 2005-09-23 at 10:44 -0700, Christoph Lameter wrote:
> for (i = 0; i < area->nr_pages; i++) {
> - area->pages[i] = alloc_page(gfp_mask);
> + if (node < 0)
> + area->pages[i] = alloc_page(gfp_mask);
> + else
> + area->pages[i] = alloc_pages_node(node, gfp_mask, 0);
> if (unlikely(!area->pages[i])) {
> /* Successfully allocated i pages, free them in __vunmap() */
> area->nr_pages = i;
...
> void *vmalloc_exec(unsigned long size)
> {
> - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC);
> + return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC, -1);
> }

Instead of hard-coding all of those -1's for the node to specify a
default allocation, and changing all of those callers, why not:

void *__vmalloc_node(unsigned long size, unsigned int __nocast gfp_mask,
pgprot_t prot, int node)
{
... existing vmalloc code here
}

void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask,
pgprot_t prot)
{
__vmalloc_node(size, gfp_mask, prot, -1);
}

A named macro is probably better than -1, but if it is only used in one
place, it is hard to complain.

-- Dave

2005-09-26 17:59:03

by Christoph Lameter

[permalink] [raw]
Subject: vmalloc_node

On Fri, 23 Sep 2005, Dave Hansen wrote:

> Instead of hard-coding all of those -1's for the node to specify a
> default allocation, and changing all of those callers, why not:

Done.

> __vmalloc_node(size, gfp_mask, prot, -1);
> A named macro is probably better than -1, but if it is only used in one
> place, it is hard to complain.

-1 is used consistently in the *_node functions to indicate that the node
is not specified. Should I replace -1 throughout the kernel with a
constant?

Here is the updated vmalloc_node patch:

--

This patch adds

vmalloc_node(size, node) -> Allocate necessary memory on the specified node

and

get_vm_area_node(size, flags, node)

and the other functions that it depends on.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.14-rc2/include/linux/vmalloc.h
===================================================================
--- linux-2.6.14-rc2.orig/include/linux/vmalloc.h 2005-09-19 20:00:41.000000000 -0700
+++ linux-2.6.14-rc2/include/linux/vmalloc.h 2005-09-26 10:40:57.000000000 -0700
@@ -32,22 +32,37 @@ struct vm_struct {
* Highlevel APIs for driver use
*/
extern void *vmalloc(unsigned long size);
+extern void *vmalloc_node(unsigned long size, int node);
extern void *vmalloc_exec(unsigned long size);
extern void *vmalloc_32(unsigned long size);
-extern void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask, pgprot_t prot);
-extern void *__vmalloc_area(struct vm_struct *area, unsigned int __nocast gfp_mask, pgprot_t prot);
-extern void vfree(void *addr);
+extern void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask,
+ pgprot_t prot);
+extern void *__vmalloc_node(unsigned long size, unsigned int __nocast gfp_mask,
+ pgprot_t prot, int node);
+extern void *__vmalloc_area(struct vm_struct *area, unsigned int __nocast gfp_mask,
+ pgprot_t prot, int node);

+extern void vfree(void *addr);
extern void *vmap(struct page **pages, unsigned int count,
unsigned long flags, pgprot_t prot);
extern void vunmap(void *addr);
-
-/*
- * Lowlevel-APIs (not for driver use!)
+
+/**
+ * get_vm_area - reserve a contingous kernel virtual area
+ *
+ * @size: size of the area
+ * @flags: %VM_IOREMAP for I/O mappings or VM_ALLOC
+ *
+ * Search an area of @size in the kernel virtual mapping area,
+ * and reserved it for out purposes. Returns the area descriptor
+ * on success or %NULL on failure.
*/
-extern struct vm_struct *get_vm_area(unsigned long size, unsigned long flags);
extern struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags,
- unsigned long start, unsigned long end);
+ unsigned long start, unsigned long end, int node);
+#define get_vm_area(__size, __flags) __get_vm_area((__size), (__flags), VMALLOC_START, \
+ VMALLOC_END, -1)
+#define get_vm_area_node(__size, __flags, __node) __get_vm_area((__size), (__flags), \
+ VMALLOC_START, VMALLOC_END, __node)
extern struct vm_struct *remove_vm_area(void *addr);
extern struct vm_struct *__remove_vm_area(void *addr);
extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
Index: linux-2.6.14-rc2/mm/vmalloc.c
===================================================================
--- linux-2.6.14-rc2.orig/mm/vmalloc.c 2005-09-19 20:00:41.000000000 -0700
+++ linux-2.6.14-rc2/mm/vmalloc.c 2005-09-26 10:46:14.000000000 -0700
@@ -5,6 +5,7 @@
* Support of BIGMEM added by Gerhard Wichert, Siemens AG, July 1999
* SMP-safe vmalloc/vfree/ioremap, Tigran Aivazian <[email protected]>, May 2000
* Major rework to support vmap/vunmap, Christoph Hellwig, SGI, August 2002
+ * Numa awareness, Christoph Lameter, SGI, June 2005
*/

#include <linux/mm.h>
@@ -159,7 +160,7 @@ int map_vm_area(struct vm_struct *area,
}

struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end, int node)
{
struct vm_struct **p, *tmp, *area;
unsigned long align = 1;
@@ -178,7 +179,7 @@ struct vm_struct *__get_vm_area(unsigned
addr = ALIGN(start, align);
size = PAGE_ALIGN(size);

- area = kmalloc(sizeof(*area), GFP_KERNEL);
+ area = kmalloc_node(sizeof(*area), GFP_KERNEL, node);
if (unlikely(!area))
return NULL;

@@ -231,21 +232,6 @@ out:
return NULL;
}

-/**
- * get_vm_area - reserve a contingous kernel virtual area
- *
- * @size: size of the area
- * @flags: %VM_IOREMAP for I/O mappings or VM_ALLOC
- *
- * Search an area of @size in the kernel virtual mapping area,
- * and reserved it for out purposes. Returns the area descriptor
- * on success or %NULL on failure.
- */
-struct vm_struct *get_vm_area(unsigned long size, unsigned long flags)
-{
- return __get_vm_area(size, flags, VMALLOC_START, VMALLOC_END);
-}
-
/* Caller must hold vmlist_lock */
struct vm_struct *__remove_vm_area(void *addr)
{
@@ -395,7 +381,8 @@ void *vmap(struct page **pages, unsigned

EXPORT_SYMBOL(vmap);

-void *__vmalloc_area(struct vm_struct *area, unsigned int __nocast gfp_mask, pgprot_t prot)
+void *__vmalloc_area(struct vm_struct *area, unsigned int __nocast gfp_mask,
+ pgprot_t prot, int node)
{
struct page **pages;
unsigned int nr_pages, array_size, i;
@@ -406,9 +393,9 @@ void *__vmalloc_area(struct vm_struct *a
area->nr_pages = nr_pages;
/* Please note that the recursion is strictly bounded. */
if (array_size > PAGE_SIZE)
- pages = __vmalloc(array_size, gfp_mask, PAGE_KERNEL);
+ pages = __vmalloc_node(array_size, gfp_mask, PAGE_KERNEL, node);
else
- pages = kmalloc(array_size, (gfp_mask & ~__GFP_HIGHMEM));
+ pages = kmalloc_node(array_size, (gfp_mask & ~__GFP_HIGHMEM), node);
area->pages = pages;
if (!area->pages) {
remove_vm_area(area->addr);
@@ -418,7 +405,10 @@ void *__vmalloc_area(struct vm_struct *a
memset(area->pages, 0, array_size);

for (i = 0; i < area->nr_pages; i++) {
- area->pages[i] = alloc_page(gfp_mask);
+ if (node < 0)
+ area->pages[i] = alloc_page(gfp_mask);
+ else
+ area->pages[i] = alloc_pages_node(node, gfp_mask, 0);
if (unlikely(!area->pages[i])) {
/* Successfully allocated i pages, free them in __vunmap() */
area->nr_pages = i;
@@ -436,17 +426,18 @@ fail:
}

/**
- * __vmalloc - allocate virtually contiguous memory
+ * __vmalloc_node - allocate virtually contiguous memory
*
* @size: allocation size
* @gfp_mask: flags for the page level allocator
* @prot: protection mask for the allocated pages
+ * @node node to use for allocation or -1
*
* Allocate enough pages to cover @size from the page level
* allocator with @gfp_mask flags. Map them into contiguous
* kernel virtual space, using a pagetable protection of @prot.
*/
-void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask, pgprot_t prot)
+void *__vmalloc_node(unsigned long size, unsigned int __nocast gfp_mask, pgprot_t prot, int node)
{
struct vm_struct *area;

@@ -454,13 +445,18 @@ void *__vmalloc(unsigned long size, unsi
if (!size || (size >> PAGE_SHIFT) > num_physpages)
return NULL;

- area = get_vm_area(size, VM_ALLOC);
+ area = get_vm_area_node(size, VM_ALLOC, node);
if (!area)
return NULL;

- return __vmalloc_area(area, gfp_mask, prot);
+ return __vmalloc_area(area, gfp_mask, prot, node);
}
+EXPORT_SYMBOL(__vmalloc_node);

+void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask, pgprot_t prot)
+{
+ return __vmalloc_node(size, gfp_mask, prot, -1);
+}
EXPORT_SYMBOL(__vmalloc);

/**
@@ -481,6 +477,25 @@ void *vmalloc(unsigned long size)

EXPORT_SYMBOL(vmalloc);

+/**
+ * vmalloc_node - allocate memory on a specific node
+ *
+ * @size: allocation size
+ * @node; numa node
+ *
+ * Allocate enough pages to cover @size from the page level
+ * allocator and map them into contiguous kernel virtual space.
+ *
+ * For tight cotrol over page level allocator and protection flags
+ * use __vmalloc() instead.
+ */
+void *vmalloc_node(unsigned long size, int node)
+{
+ return __vmalloc_node(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL, node);
+}
+
+EXPORT_SYMBOL(vmalloc_node);
+
#ifndef PAGE_KERNEL_EXEC
# define PAGE_KERNEL_EXEC PAGE_KERNEL
#endif

2005-09-26 18:10:47

by Dave Hansen

[permalink] [raw]
Subject: Re: vmalloc_node

On Mon, 2005-09-26 at 10:58 -0700, Christoph Lameter wrote:
> On Fri, 23 Sep 2005, Dave Hansen wrote:
> > Instead of hard-coding all of those -1's for the node to specify a
> > default allocation, and changing all of those callers, why not:
>
> Done.

That looks much nicer. Thanks!

> > __vmalloc_node(size, gfp_mask, prot, -1);
> > A named macro is probably better than -1, but if it is only used in one
> > place, it is hard to complain.
>
> -1 is used consistently in the *_node functions to indicate that the node
> is not specified. Should I replace -1 throughout the kernel with a
> constant?

I certainly wouldn't mind. Giving it a name like NODE_ANY or
NODE_UNSPECIFIED would certainly keep anyone from having to go dig into
the allocator functions to decide what it actually does.

-- Dave

2005-09-27 16:23:28

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Friday 23 September 2005 19:09, Harald Welte wrote:
> On Thu, Sep 22, 2005 at 03:03:21PM +0200, Andi Kleen wrote:
> > > 1) No more central rwlock protecting each table (filter, nat, mangle,
> > > raw), but one lock per CPU. It avoids cache line ping pongs for each
> > > packet.
> >
> > Another useful change would be to not take the lock when there are no
> > rules. Currently just loading iptables has a large overhead.
>
> This is partially due to the netfilter hooks that are registered (so we
> always take nf_hook_slow() in the NF_HOOK() macro).

Not sure it's that. nf_hook_slow uses RCU, so it should be quite
fast.

> The default policies inside an iptables chain are internally implemented
> as a rule. Thus, policies as built-in rules have packet/byte counters.

That could be special cased and done lockless, with the counting
done per CPU.

-Andi

2005-09-28 00:25:47

by Henrik Nordstrom

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Tue, 27 Sep 2005, Andi Kleen wrote:

> That could be special cased and done lockless, with the counting
> done per CPU.

It's also not very hard for iptables when verifying the table to conclude
that there really isn't any "real" rules for a certain hook and then
delete that hook registration (only policy ACCEPT rule found). Allowing
you to have as many ip tables modules you like in the kernel, but only
using the hooks where you have rules. Drawback is that you loose the
packet counters on the policy.

Exception: iptable_nat. Needs the hooks for other purposes as well, not
just the iptable so here the hooks can not be deactivated when there is no
rules.

Regards
Henrik

2005-09-28 08:32:55

by Harald Welte

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Wed, Sep 28, 2005 at 02:25:16AM +0200, Henrik Nordstrom wrote:

> >That could be special cased and done lockless, with the counting
> >done per CPU.
>
> It's also not very hard for iptables when verifying the table to
> conclude that there really isn't any "real" rules for a certain hook

The detection should be quite straight-forward, yes.

> Allowing you to have as many ip tables modules you like in the kernel,
> but only using the hooks where you have rules.

I totally agree, that from a current perspective, I think the concept of
just loading a module (that has usage count 0) having severe impact on
system performance is just wrong. But then, users are used to the
current behaviour for almost five years now. Every decent script and/or
piece of documentation should by now refer to the fact that loading a
module implicitly registers it at the hooks, and that you only load
those modules that you really need.

In an ideal world, you would load iptables, and userspace would
configure a couple of chains, and then explicitly attach those chains to
the netfilter hooks.

Therefore: Let's do this right next time, but live with that fact for
now.

> Drawback is that you loose the packet counters on the policy.

That's the big problem. People rely on that fact, because they're used
to it. If we introduce some new tool, or something that somehow works
differently, I don't have a change. But silently changing the semantics
with a kernel upgrade is not something that I'm willing to do on
purpose.

Just imagine all those poor sysadmins who know nothing about current
kernel development, and who upgrade their kernel because their
distributor provides a new one - suddenly their accounting (which might
be relevant for their business) doesn't work anymore :(

> Exception: iptable_nat. Needs the hooks for other purposes as well,
> not just the iptable so here the hooks can not be deactivated when
> there is no rules.

yes, even though we could make the ip_nat / iptable_nat split (that is
introduced in 2.6.14) in a way that would make ip_nat.ko register those
hooks that are implicitly needed, and iptable_nat only the user-visible
chain-related hooks.

--
- Harald Welte <[email protected]> http://netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie


Attachments:
(No filename) (2.49 kB)
(No filename) (189.00 B)
Download all attachments

2005-09-28 08:37:05

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Wednesday 28 September 2005 10:32, Harald Welte wrote:

> I totally agree, that from a current perspective, I think the concept of
> just loading a module (that has usage count 0) having severe impact on
> system performance is just wrong. But then, users are used to the
> current behaviour for almost five years now.

That doesn't mean it cannot be improved - and I think it should.

In a sense it's even getting worse: For example us losing the CONFIG
option to disable local conntrack (Patrick has disabled it some time ago
without even a comment why he did it) has a really bad impact in some cases.

> Therefore: Let's do this right next time, but live with that fact for
> now.

Even with a "quite straight-forward" (quoting you) fix?

> Just imagine all those poor sysadmins who know nothing about current
> kernel development, and who upgrade their kernel because their
> distributor provides a new one - suddenly their accounting (which might
> be relevant for their business) doesn't work anymore :(

Accounting with per CPU counters can be done fully lockless.

-Andi

2005-09-28 10:34:42

by Henrik Nordstrom

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Wed, 28 Sep 2005, Harald Welte wrote:

> Just imagine all those poor sysadmins who know nothing about current
> kernel development, and who upgrade their kernel because their
> distributor provides a new one - suddenly their accounting (which might
> be relevant for their business) doesn't work anymore :(

You seriously expect anyone is using the counters on the policy rule in
otherwise completely empty rulesets? These counters is also available
elsewhere (route and interface statistics) and more intuitively so.

> yes, even though we could make the ip_nat / iptable_nat split (that is
> introduced in 2.6.14) in a way that would make ip_nat.ko register those
> hooks that are implicitly needed, and iptable_nat only the user-visible
> chain-related hooks.

Which adds additional overhead, so this approach is a bit
counter-productive I would say.

Regards
Henrik

2005-10-05 16:37:48

by Patrick McHardy

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

Andi Kleen wrote:
> In a sense it's even getting worse: For example us losing the CONFIG
> option to disable local conntrack (Patrick has disabled it some time ago
> without even a comment why he did it) has a really bad impact in some cases.

It was necessary to correctly handle locally generated ICMP errors.

2005-10-05 16:57:29

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Tuesday 04 October 2005 19:01, Patrick McHardy wrote:
> Andi Kleen wrote:
> > In a sense it's even getting worse: For example us losing the CONFIG
> > option to disable local conntrack (Patrick has disabled it some time ago
> > without even a comment why he did it) has a really bad impact in some
> > cases.
>
> It was necessary to correctly handle locally generated ICMP errors.

Well you most likely wrecked local performance then when it's enabled.

-Andi

2005-10-06 17:30:56

by Harald Welte

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Wed, Oct 05, 2005 at 06:53:31PM +0200, Andi Kleen wrote:
> On Tuesday 04 October 2005 19:01, Patrick McHardy wrote:
> > Andi Kleen wrote:
> > > In a sense it's even getting worse: For example us losing the CONFIG
> > > option to disable local conntrack (Patrick has disabled it some time ago
> > > without even a comment why he did it) has a really bad impact in some
> > > cases.
> >
> > It was necessary to correctly handle locally generated ICMP errors.
>
> Well you most likely wrecked local performance then when it's enabled.

so you would favour a system that incorrectly deals with ICMP errors but
has higher performance?

--
- Harald Welte <[email protected]> http://netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie


Attachments:
(No filename) (0.98 kB)
(No filename) (189.00 B)
Download all attachments

2005-10-06 18:00:04

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Fri, Oct 07, 2005 at 04:38:02AM +0200, Harald Welte wrote:
> On Wed, Oct 05, 2005 at 06:53:31PM +0200, Andi Kleen wrote:
> > On Tuesday 04 October 2005 19:01, Patrick McHardy wrote:
> > > Andi Kleen wrote:
> > > > In a sense it's even getting worse: For example us losing the CONFIG
> > > > option to disable local conntrack (Patrick has disabled it some time ago
> > > > without even a comment why he did it) has a really bad impact in some
> > > > cases.
> > >
> > > It was necessary to correctly handle locally generated ICMP errors.
> >
> > Well you most likely wrecked local performance then when it's enabled.
>
> so you would favour a system that incorrectly deals with ICMP errors but
> has higher performance?

I would favour a system where development doesn't lose sight of performance.
Perhaps there would be other ways to fix this problem without impacting
performance unduly? Can you describe it in detail?

-Andi

2005-10-07 17:10:27

by Patrick McHardy

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

Andi Kleen wrote:
> On Fri, Oct 07, 2005 at 04:38:02AM +0200, Harald Welte wrote:
>
>>On Wed, Oct 05, 2005 at 06:53:31PM +0200, Andi Kleen wrote:
>>
>>>Well you most likely wrecked local performance then when it's enabled.

There are lots of other hooks and conntrack/NAT already have a
quite large negative influence on performance. Do you have numbers
that show that enabling this actually causes more than a slight
decrease in performance? Besides, most distributors enable all
these options anyway, so it only makes a difference for a small
group of users.

>>so you would favour a system that incorrectly deals with ICMP errors but
>>has higher performance?
>
> I would favour a system where development doesn't lose sight of performance.

I don't think we do.

> Perhaps there would be other ways to fix this problem without impacting
> performance unduly? Can you describe it in detail?

When an ICMP error is send by the firewall itself, the inner
packet needs to be restored to its original state. That means
both DNAT and SNAT which might have been applied need to be
reversed. DNAT is reversed at places where we usually do
SNAT (POST_ROUTING), SNAT is reversed where usually DNAT is
done (PRE_ROUTING/LOCAL_OUT). Since locally generated packets
never go through PRE_ROUTING, it is done in LOCAL_OUT, which
required enabling NAT in LOCAL_OUT unconditionally. It might
be possible to move this to some different hook, I didn't
investigate it.

2005-10-07 17:19:59

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

On Friday 07 October 2005 19:08, Patrick McHardy wrote:

> There are lots of other hooks and conntrack/NAT already have a
> quite large negative influence on performance. Do you have numbers
> that show that enabling this actually causes more than a slight
> decrease in performance? Besides, most distributors enable all
> these options anyway, so it only makes a difference for a small
> group of users.

I don't know about other distributions but SUSE at some point
found that some web benchmarks dramatically improved in the default
configuration when local conntrack was off. It was off then since ever.


> > Perhaps there would be other ways to fix this problem without impacting
> > performance unduly? Can you describe it in detail?
>
> When an ICMP error is send by the firewall itself, the inner
> packet needs to be restored to its original state. That means
> both DNAT and SNAT which might have been applied need to be
> reversed. DNAT is reversed at places where we usually do
> SNAT (POST_ROUTING), SNAT is reversed where usually DNAT is
> done (PRE_ROUTING/LOCAL_OUT). Since locally generated packets
> never go through PRE_ROUTING, it is done in LOCAL_OUT, which
> required enabling NAT in LOCAL_OUT unconditionally. It might
> be possible to move this to some different hook, I didn't
> investigate it.

This sounds wrong anyways. You shouldn't be touching conntrack state for ICMPs
generated by routers because they can be temporary errors (e.g. during a
routing flap when the route moves). Only safe way to handle this is to wait
for the timeout which doesn't need local handling. And the firewall cannot be
an endhost here.

-Andi

2005-10-07 18:00:29

by Patrick McHardy

[permalink] [raw]
Subject: Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

Andi Kleen wrote:
> On Friday 07 October 2005 19:08, Patrick McHardy wrote:
>
> I don't know about other distributions but SUSE at some point
> found that some web benchmarks dramatically improved in the default
> configuration when local conntrack was off. It was off then since ever.

Interesting ..

>>When an ICMP error is send by the firewall itself, the inner
>>packet needs to be restored to its original state. That means
>>both DNAT and SNAT which might have been applied need to be
>>reversed. DNAT is reversed at places where we usually do
>>SNAT (POST_ROUTING), SNAT is reversed where usually DNAT is
>>done (PRE_ROUTING/LOCAL_OUT). Since locally generated packets
>>never go through PRE_ROUTING, it is done in LOCAL_OUT, which
>>required enabling NAT in LOCAL_OUT unconditionally. It might
>>be possible to move this to some different hook, I didn't
>>investigate it.
>
> This sounds wrong anyways. You shouldn't be touching conntrack state for ICMPs
> generated by routers because they can be temporary errors (e.g. during a
> routing flap when the route moves). Only safe way to handle this is to wait
> for the timeout which doesn't need local handling. And the firewall cannot be
> an endhost here.

You misunderstood me. Its not about conntrack state, its about
the inner packet contained in the ICMP message. If it was
NATed by the box itself before the ICMP message is generated,
it needs to be restored to its original state, otherwise the
receipient will ignore it.