LinuxLists.cc - [PATCH] improved locking performance in rt_run

2007-05-12 16:36:59

Subject: [PATCH] improved locking performance in rt_run_flush()

While testing adding/deleting large numbers of interfaces, I found
rt_run_flush() was the #1 cpu user in a kernel profile by far.

The below patch changes rt_run_flush() to only take each spinlock
protecting the rt_hash_table once instead of taking a spinlock for
every hash table bucket (and ending up taking the same small set
of locks over and over).

Deleting 256 interfaces on a 4-way SMP system with 16K buckets reduced
overall cpu-time more than 50% and reduced wall-time about 33%. I
suspect systems with large amounts of memory (and more buckets) will
see an even greater benefit.

Note there is a small change in that rt_free() is called while the
lock is held where before it was called without the lock held. I
don't think this should be an issue.

Signed-off-by: Dave Johnson <[email protected]>

===== net/ipv4/route.c 1.162 vs edited =====
--- 1.162/net/ipv4/route.c 2007-02-14 11:09:54 -05:00
+++ edited/net/ipv4/route.c 2007-05-04 17:53:33 -04:00
@@ -237,9 +237,17 @@
for (i = 0; i < RT_HASH_LOCK_SZ; i++) \
spin_lock_init(&rt_hash_locks[i]); \
}
+# define rt_hash_lock_for_each(lockaddr) \
+ for (lockaddr = rt_hash_lock_addr(RT_HASH_LOCK_SZ-1); lockaddr >= rt_hash_locks; lockaddr--)
+# define rt_hash_lock_for_each_item(locknum, slot) \
+ for (slot = locknum-rt_hash_locks; slot <= rt_hash_mask; slot += RT_HASH_LOCK_SZ)
#else
# define rt_hash_lock_addr(slot) NULL
# define rt_hash_lock_init()
+# define rt_hash_lock_for_each(lockaddr) \
+ lockaddr = NULL;
+# define rt_hash_lock_for_each_item(locknum, slot) \
+ for (slot = rt_hash_mask; slot >= 0; slot--)
#endif

static struct rt_hash_bucket *rt_hash_table;
@@ -691,23 +699,26 @@
static void rt_run_flush(unsigned long dummy)
{
int i;
+ spinlock_t *lockaddr;
struct rtable *rth, *next;

rt_deadline = 0;

get_random_bytes(&rt_hash_rnd, 4);

- for (i = rt_hash_mask; i >= 0; i--) {
- spin_lock_bh(rt_hash_lock_addr(i));
- rth = rt_hash_table[i].chain;
- if (rth)
- rt_hash_table[i].chain = NULL;
- spin_unlock_bh(rt_hash_lock_addr(i));
-
- for (; rth; rth = next) {
- next = rth->u.dst.rt_next;
- rt_free(rth);
+ rt_hash_lock_for_each(lockaddr) {
+ spin_lock_bh(lockaddr);
+ rt_hash_lock_for_each_item(lockaddr, i) {
+ rth = rt_hash_table[i].chain;
+ if (rth) {
+ rt_hash_table[i].chain = NULL;
+ for (; rth; rth = next) {
+ next = rth->u.dst.rt_next;
+ rt_free(rth);
+ }
+ }
}
+ spin_unlock_bh(lockaddr);
}
}

2007-05-14 10:04:19

by David Miller

[permalink] [raw]

Subject: Re: [PATCH] improved locking performance in rt_run_flush()

From: Dave Johnson <[email protected]>
Date: Sat, 12 May 2007 12:36:47 -0400

>
> While testing adding/deleting large numbers of interfaces, I found
> rt_run_flush() was the #1 cpu user in a kernel profile by far.
>
> The below patch changes rt_run_flush() to only take each spinlock
> protecting the rt_hash_table once instead of taking a spinlock for
> every hash table bucket (and ending up taking the same small set
> of locks over and over).
>
> Deleting 256 interfaces on a 4-way SMP system with 16K buckets reduced
> overall cpu-time more than 50% and reduced wall-time about 33%. I
> suspect systems with large amounts of memory (and more buckets) will
> see an even greater benefit.
>
> Note there is a small change in that rt_free() is called while the
> lock is held where before it was called without the lock held. I
> don't think this should be an issue.
>
> Signed-off-by: Dave Johnson <[email protected]>

Thanks for this patch.

I'm not ignoring it I'm just trying to brainstorm whether there
is a better way to resolve this inefficiency. :-)

2007-05-20 05:12:19

by Herbert Xu

[permalink] [raw]

Subject: Re: [PATCH] improved locking performance in rt_run_flush()

David Miller <[email protected]> wrote:
> From: Dave Johnson <[email protected]>
>>
>> The below patch changes rt_run_flush() to only take each spinlock
>> protecting the rt_hash_table once instead of taking a spinlock for
>> every hash table bucket (and ending up taking the same small set
>> of locks over and over).

...

> I'm not ignoring it I'm just trying to brainstorm whether there
> is a better way to resolve this inefficiency. :-)

The main problem I see with this is having to walk and free each
chain with the lock held. We could avoid this if we had a pointer
in struct rtable to chain them up for freeing later.

I just checked and struct rtable is 236 bytes long on 32-bit but
the slab cache pads it to 256 bytes so we've got some free space.
I suspect 64-bit should be similar.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2007-05-31 23:26:23

by David Miller

[permalink] [raw]

Subject: Re: [PATCH] improved locking performance in rt_run_flush()

From: Herbert Xu <[email protected]>
Date: Sun, 20 May 2007 15:11:48 +1000

> David Miller <[email protected]> wrote:
> > From: Dave Johnson <[email protected]>
> >>
> >> The below patch changes rt_run_flush() to only take each spinlock
> >> protecting the rt_hash_table once instead of taking a spinlock for
> >> every hash table bucket (and ending up taking the same small set
> >> of locks over and over).
>
> ...
>
> > I'm not ignoring it I'm just trying to brainstorm whether there
> > is a better way to resolve this inefficiency. :-)
>
> The main problem I see with this is having to walk and free each
> chain with the lock held. We could avoid this if we had a pointer
> in struct rtable to chain them up for freeing later.
>
> I just checked and struct rtable is 236 bytes long on 32-bit but
> the slab cache pads it to 256 bytes so we've got some free space.
> I suspect 64-bit should be similar.

SLUB I believe packs more aggressively and won't pad things out like
that. Therefore adding a member to rtable is much less attractive.

I've been considering various alternative ways to deal with this.

For 2.6.22 and -stable's sake we could allocate an array of pointers
of size N where N is the number of rtable hash slots per spinlock.
A big lock wraps around rt_run_flush() to protect these slots, and
then the loop is:

grap_lock();
for_each_hash_chain_for_lock(i) {
rth = rt_hash_table[i].chain;
if (rth) {
rt_hash_table[i].chain = NULL;
flush_chain[i % N] = rt;
}
}
drop_lock();

for (i = 0; i < N; i++) {
struct rtable *rth = flush_chain[i];
flush_chain[i] = NULL;
while (rth) {
struct rtable *next = rth->u.dst.rt_next;
rt_free(rth);
rth = next;
}
}

Holding a lock across the entire hash plucking has it's not nice
properties, but it's better than taking the same lock N times in
a row.

In the longer term, if I resurrect my dynamically sized rtable hash
patches (which I do intend to do), that code protected a lot of this
stuff with a seqlock and it might be possible to use that seqlock
solely to flush the lists in rt_run_flush().

Any better ideas?