2012-05-25 20:16:49

by Arun Sharma

[permalink] [raw]
Subject: [PATCH] net: compute a more reasonable default ip6_rt_max_size

The algorithm is based on ipv4 and alloc_large_system_hash().

The following data is from a x86_64 box I tested:

128MB
$ cat /proc/sys/net/ipv{4,6}/route/max_size
16384
22444

512MB
$ cat /proc/sys/net/ipv{4,6}/route/max_size
65536
99856

1GB
$ cat /proc/sys/net/ipv{4,6}/route/max_size
524288
203068

2GB
$ cat /proc/sys/net/ipv{4,6}/route/max_size
1048576
524288

4GB
$ cat /proc/sys/net/ipv{4,6}/route/max_size
2097152
524288

Signed-off-by: Arun Sharma <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: David Miller <[email protected]>
---
net/ipv6/route.c | 21 ++++++++++++++++++++-
1 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 49d6ce1..c89ebbb 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2827,6 +2827,16 @@ struct ctl_table * __net_init ipv6_route_sysctl_init(struct net *net)
}
#endif

+static __initdata unsigned long ip6_rt_entries;
+static int __init set_rt_entries(char *str)
+{
+ if (!str)
+ return 0;
+ ip6_rt_entries = simple_strtoul(str, &str, 0);
+ return 1;
+}
+__setup("ip6_rt_entries=", set_rt_entries);
+
static int __net_init ip6_route_net_init(struct net *net)
{
int ret = -ENOMEM;
@@ -2872,8 +2882,17 @@ static int __net_init ip6_route_net_init(struct net *net)
ip6_template_metrics, true);
#endif

+ /* Compute a reasonable default based on what we do for ipv4
+ * total size = 1/16th of total RAM
+ * No more than 512k entries unless overridden on kernel cmdline */
+ if (ip6_rt_entries == 0) {
+ ip6_rt_entries = (totalram_pages << PAGE_SHIFT) >> 4;
+ ip6_rt_entries /= sizeof(struct rt6_info);
+ ip6_rt_entries = min(512 * 1024UL, ip6_rt_entries);
+ }
+
net->ipv6.sysctl.flush_delay = 0;
- net->ipv6.sysctl.ip6_rt_max_size = 4096;
+ net->ipv6.sysctl.ip6_rt_max_size = ip6_rt_entries;
net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2;
net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ;
net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ;
--
1.7.8.4


2012-05-25 20:27:01

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] net: compute a more reasonable default ip6_rt_max_size

From: Arun Sharma <[email protected]>
Date: Fri, 25 May 2012 13:15:34 -0700

> + /* Compute a reasonable default based on what we do for ipv4
> + * total size = 1/16th of total RAM
> + * No more than 512k entries unless overridden on kernel cmdline */

Please format this comment correctly:

/* Compute a reasonable default based on what we do for ipv4
* total size = 1/16th of total RAM
* No more than 512k entries unless overridden on kernel cmdline
*/

2012-05-25 20:47:07

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] net: compute a more reasonable default ip6_rt_max_size

On Fri, 2012-05-25 at 13:15 -0700, Arun Sharma wrote:
> The algorithm is based on ipv4 and alloc_large_system_hash().
>

Why is it needed at all ?

IPv4 has a route cache with potentially millions of entries, not IPv6.

2012-05-25 22:23:17

by Arun Sharma

[permalink] [raw]
Subject: Re: [PATCH] net: compute a more reasonable default ip6_rt_max_size

On 5/25/12 1:47 PM, Eric Dumazet wrote:
> On Fri, 2012-05-25 at 13:15 -0700, Arun Sharma wrote:
>> The algorithm is based on ipv4 and alloc_large_system_hash().
>>
>
> Why is it needed at all ?
>
> IPv4 has a route cache with potentially millions of entries, not IPv6.

With the default size of 4096 for the ipv6 routing table, entries can
get garbage collected and hosts could lose their default route and
therefore lose connectivity.

We actually saw it happen.

-Arun

2012-05-25 22:51:39

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] net: compute a more reasonable default ip6_rt_max_size

From: Arun Sharma <[email protected]>
Date: Fri, 25 May 2012 15:22:54 -0700

> On 5/25/12 1:47 PM, Eric Dumazet wrote:
>> On Fri, 2012-05-25 at 13:15 -0700, Arun Sharma wrote:
>>> The algorithm is based on ipv4 and alloc_large_system_hash().
>>>
>>
>> Why is it needed at all ?
>>
>> IPv4 has a route cache with potentially millions of entries, not IPv6.
>
> With the default size of 4096 for the ipv6 routing table, entries can
> get garbage collected and hosts could lose their default route and
> therefore lose connectivity.
>
> We actually saw it happen.

Under no circumstances should administrator configured ipv6 routes be
garbage collected, that is a bug.

2012-05-26 00:09:15

by Arun Sharma

[permalink] [raw]
Subject: Re: [PATCH] net: compute a more reasonable default ip6_rt_max_size

On 5/25/12 3:51 PM, David Miller wrote:
> From: Arun Sharma<[email protected]>
> Date: Fri, 25 May 2012 15:22:54 -0700
>
>> On 5/25/12 1:47 PM, Eric Dumazet wrote:
>>> On Fri, 2012-05-25 at 13:15 -0700, Arun Sharma wrote:
>>>> The algorithm is based on ipv4 and alloc_large_system_hash().
>>>>
>>>
>>> Why is it needed at all ?
>>>
>>> IPv4 has a route cache with potentially millions of entries, not IPv6.
>>
>> With the default size of 4096 for the ipv6 routing table, entries can
>> get garbage collected and hosts could lose their default route and
>> therefore lose connectivity.
>>
>> We actually saw it happen.
>
> Under no circumstances should administrator configured ipv6 routes be
> garbage collected, that is a bug.

These were not admin configured routes. They were discovered via ipv6
neighbor discovery.

-Arun

2012-05-26 00:11:55

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] net: compute a more reasonable default ip6_rt_max_size

From: Arun Sharma <[email protected]>
Date: Fri, 25 May 2012 17:08:59 -0700

> On 5/25/12 3:51 PM, David Miller wrote:
>> From: Arun Sharma<[email protected]>
>> Date: Fri, 25 May 2012 15:22:54 -0700
>>
>>> On 5/25/12 1:47 PM, Eric Dumazet wrote:
>>>> On Fri, 2012-05-25 at 13:15 -0700, Arun Sharma wrote:
>>>>> The algorithm is based on ipv4 and alloc_large_system_hash().
>>>>>
>>>>
>>>> Why is it needed at all ?
>>>>
>>>> IPv4 has a route cache with potentially millions of entries, not IPv6.
>>>
>>> With the default size of 4096 for the ipv6 routing table, entries can
>>> get garbage collected and hosts could lose their default route and
>>> therefore lose connectivity.
>>>
>>> We actually saw it happen.
>>
>> Under no circumstances should administrator configured ipv6 routes be
>> garbage collected, that is a bug.
>
> These were not admin configured routes. They were discovered via ipv6
> neighbor discovery.

Then such default routes should either be:

1) Passed over by GC

2) Trigger neighbour discovery when GC'd

2012-05-26 00:44:46

by Arun Sharma

[permalink] [raw]
Subject: Re: [PATCH] net: compute a more reasonable default ip6_rt_max_size

On 5/25/12 5:11 PM, David Miller wrote:

>> These were not admin configured routes. They were discovered via ipv6
>> neighbor discovery.
>
> Then such default routes should either be:
>
> 1) Passed over by GC
>
> 2) Trigger neighbour discovery when GC'd

It's possible that there is a bug somewhere - we didn't get a chance to
dig deeper. What we observed is that as we got close to the 4096 limit,
some hosts were becoming unreachable. A modest increase in the routing
table size made things better.

-Arun

2012-05-26 03:39:47

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] net: compute a more reasonable default ip6_rt_max_size

On Fri, 2012-05-25 at 17:44 -0700, Arun Sharma wrote:
> On 5/25/12 5:11 PM, David Miller wrote:
>
> >> These were not admin configured routes. They were discovered via ipv6
> >> neighbor discovery.
> >
> > Then such default routes should either be:
> >
> > 1) Passed over by GC
> >
> > 2) Trigger neighbour discovery when GC'd
>
> It's possible that there is a bug somewhere - we didn't get a chance to
> dig deeper. What we observed is that as we got close to the 4096 limit,
> some hosts were becoming unreachable. A modest increase in the routing
> table size made things better.
>
> -Arun

But your patch is not a "modest increase", so whats the deal ?

A modest increase would be 8192 instead of 4096, regardless of RAM size.


2012-05-26 04:17:55

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] net: compute a more reasonable default ip6_rt_max_size

On Sat, 2012-05-26 at 05:39 +0200, Eric Dumazet wrote:

> But your patch is not a "modest increase", so whats the deal ?
>
> A modest increase would be 8192 instead of 4096, regardless of RAM size.
>
>

More over, a boot parameter to tweak it is absolutely not needed

sysctl -w net.ipv6.route.max_size=16384

or

echo 16384 >/proc/sys/net/ipv6/route/max_size

IPv4 has to allocate a hash table at boot time, and this hash table is
not resized. Thus some really special purpose machines need a boot
param.



2012-05-27 03:54:40

by Arun Sharma

[permalink] [raw]
Subject: Re: [PATCH] net: compute a more reasonable default ip6_rt_max_size

On 5/25/12 8:39 PM, Eric Dumazet wrote:
>
> But your patch is not a "modest increase", so whats the deal ?
>
> A modest increase would be 8192 instead of 4096, regardless of RAM size.
>

Yes - 8192 solves our immediate problem, but I was worrying that the
problem might resurface as ipv6 adoption becomes more widespread.

We were testing a pre-3.0 kernel that didn't have Dave's DST_NOCOUNT
patch. Will retest with that patch applied.

> More over, a boot parameter to tweak it is absolutely not needed

Agreed. Will remove that part.

Still not sure why you'd like to go for one size regardless of
totalram_pages.

-Arun

2012-05-27 13:18:35

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] net: compute a more reasonable default ip6_rt_max_size

On Sat, 2012-05-26 at 20:54 -0700, Arun Sharma wrote:
> On 5/25/12 8:39 PM, Eric Dumazet wrote:
> >
> > But your patch is not a "modest increase", so whats the deal ?
> >
> > A modest increase would be 8192 instead of 4096, regardless of RAM size.
> >
>
> Yes - 8192 solves our immediate problem, but I was worrying that the
> problem might resurface as ipv6 adoption becomes more widespread.
>

Going from 4096 to 8192 is modest increase. If you put 65536, it should
be enough for the next years.

Your patch was increasing 4096 to 524288 (for 2GB of ram), which sounds
not modest at all.

> We were testing a pre-3.0 kernel that didn't have Dave's DST_NOCOUNT
> patch. Will retest with that patch applied.

Good

>
> > More over, a boot parameter to tweak it is absolutely not needed
>
> Agreed. Will remove that part.
>
> Still not sure why you'd like to go for one size regardless of
> totalram_pages.

Because size of IPv6 route table is not depending on RAM size, but on
number or IPv6 routes.

A router runs a piece of software complex enough to be able to adjust
the limit when needed, don't you think so ?

Your patch basically removes the whole idea of having a limit in the
first place. Why do we have a limit if you set it to four order of
magnitudes bigger than necessary ?



2012-05-31 00:00:04

by Lubashev, Igor

[permalink] [raw]
Subject: Re: [PATCH] net: compute a more reasonable default ip6_rt_max_size

>It's possible that there is a bug somewhere - we didn't get a chance to
>dig deeper. What we observed is that as we got close to the 4096 limit,
>some hosts were becoming unreachable. A modest increase in the routing
>table size made things better.

First of all, we have observed the same thing.

While I am not an expert in this area of the routing code, the function fib6_age in net/ipv6/ip6_fib.c puzzles me.


In kernel version 2.7.2.0.3, we have net/ipv6/ip6_fib.c:
static int fib6_age(struct rt6_info *rt, void *arg)
{
unsigned long now = jiffies;

if (rt->rt6i_flags&RTF_EXPIRES && rt->rt6i_expires) {
if (time_after(now, rt->rt6i_expires)) {
RT6_TRACE("expiring %p\n", rt);
return -1;
}
gc_args.more++;
} else if (rt->rt6i_flags & RTF_CACHE) {
if (atomic_read(&rt->dst.__refcnt) == 0 &&
time_after_eq(now, rt->dst.lastuse + gc_args.timeout)) {
RT6_TRACE("aging clone %p\n", rt);
return -1;
} else if ((rt->rt6i_flags & RTF_GATEWAY) &&
(!(rt->rt6i_nexthop->flags & NTF_ROUTER))) {
RT6_TRACE("purging route %p via non-router but gateway\n",
rt);
return -1;
}
gc_args.more++;
}

return 0;
}


In kernel 3.0.32, we have net/ipv6/ip6_fib.c:
static int fib6_age(struct rt6_info *rt, void *arg)
{
unsigned long now = jiffies;

if (rt->rt6i_flags&RTF_EXPIRES && rt->rt6i_expires) {
if (time_after(now, rt->rt6i_expires)) {
RT6_TRACE("expiring %p\n", rt);
return -1;
}
gc_args.more++;
} else if (rt->rt6i_flags & RTF_CACHE) {
if (atomic_read(&rt->dst.__refcnt) == 0 &&
time_after_eq(now, rt->dst.lastuse + gc_args.timeout)) {
RT6_TRACE("aging clone %p\n", rt);
return -1;
} else if ((rt->rt6i_flags & RTF_GATEWAY) &&
(!(dst_get_neighbour_raw(&rt->dst)->flags & NTF_ROUTER))) {
RT6_TRACE("purging route %p via non-router but gateway\n",
rt);
return -1;
}
gc_args.more++;
}

return 0;
}


In kernel 3.4, we have net/ipv6/ip6_fib.c:
static int fib6_age(struct rt6_info *rt, void *arg)
{
unsigned long now = jiffies;

if (rt->rt6i_flags & RTF_EXPIRES && rt->dst.expires) {
if (time_after(now, rt->dst.expires)) {
RT6_TRACE("expiring %p\n", rt);
return -1;
}
gc_args.more++;
} else if (rt->rt6i_flags & RTF_CACHE) {
if (atomic_read(&rt->dst.__refcnt) == 0 &&
time_after_eq(now, rt->dst.lastuse + gc_args.timeout)) {
RT6_TRACE("aging clone %p\n", rt);
return -1;
} else if (rt->rt6i_flags & RTF_GATEWAY) {
struct neighbour *neigh;
__u8 neigh_flags = 0;

neigh = dst_neigh_lookup(&rt->dst, &rt->rt6i_gateway);
if (neigh) {
neigh_flags = neigh->flags;
neigh_release(neigh);
}
if (neigh_flags & NTF_ROUTER) {
RT6_TRACE("purging route %p via non-router but gateway\n",
rt);
return -1;
}
}
gc_args.more++;
}

return 0;
}


Do we have the meaning of the NTF_ROUTER flag reversed in kernel 3.4? Or is the opposite use of that flag a fix for the bug in the previous releases? Or is this a bug in kernel 3.4?

Also, could this remove a Gateway entry, if there is no neighbor entry for it (in any of the version of the code)? Could this try to deference a null pointer in 3.0.32 version of the code (and any version prior to 3.4)? In general, is this the right place to remove a gateway route that has __refcnt > 0?

I wish I had more expertise in this area of the code to answer questions and not only to pose them.

Thank you,

- Igor

2012-06-04 19:04:29

by Lubashev, Igor

[permalink] [raw]
Subject: RE: [PATCH] net: compute a more reasonable default ip6_rt_max_size

David and Eric,

Any news about this? We definitely have many machines that are experiencing abnormal behavior under ipv6 load. The machines are healthier when the ipv6 route cache is increased to 64K, but I am afraid this is a band-aid that is hiding the actual problems.

So, could you address the concerns about the code in fib6_age? I can see three potential problems with it:

1. The meaning/use of NTF_ROUTER flag is inverted in 3.4

2. A potential NULL-pointer exception in pre-3.4 versions. In particular, "rt->rt6i_nexthop" (version 2.6.37) is checked for NULL in (almost?) all cases of referencing that field. I do not know for sure about "dst_get_neighbour_raw(&rt->dst)" in 3.0.32.

3. In all cases, an RTF_GATEWAY entry with __refcount > 0 may be garbage collected. That seems like a wrong thing to do. Is it?

Thank you!

- Igor


-----Original Message-----
From: Lubashev, Igor
Sent: Wednesday, May 30, 2012 7:50 PM
To: David Miller; Arun Sharma
Cc: [email protected]; [email protected]; [email protected]
Subject: Re: [PATCH] net: compute a more reasonable default ip6_rt_max_size

>It's possible that there is a bug somewhere - we didn't get a chance to
>dig deeper. What we observed is that as we got close to the 4096 limit,
>some hosts were becoming unreachable. A modest increase in the routing
>table size made things better.

First of all, we have observed the same thing.

While I am not an expert in this area of the routing code, the function fib6_age in net/ipv6/ip6_fib.c puzzles me.


In kernel version 2.6.37, we have net/ipv6/ip6_fib.c:
static int fib6_age(struct rt6_info *rt, void *arg) {
unsigned long now = jiffies;

if (rt->rt6i_flags&RTF_EXPIRES && rt->rt6i_expires) {
if (time_after(now, rt->rt6i_expires)) {
RT6_TRACE("expiring %p\n", rt);
return -1;
}
gc_args.more++;
} else if (rt->rt6i_flags & RTF_CACHE) {
if (atomic_read(&rt->dst.__refcnt) == 0 &&
time_after_eq(now, rt->dst.lastuse + gc_args.timeout)) {
RT6_TRACE("aging clone %p\n", rt);
return -1;
} else if ((rt->rt6i_flags & RTF_GATEWAY) &&
(!(rt->rt6i_nexthop->flags & NTF_ROUTER))) {
RT6_TRACE("purging route %p via non-router but gateway\n",
rt);
return -1;
}
gc_args.more++;
}

return 0;
}


In kernel 3.0.32, we have net/ipv6/ip6_fib.c:
static int fib6_age(struct rt6_info *rt, void *arg) {
unsigned long now = jiffies;

if (rt->rt6i_flags&RTF_EXPIRES && rt->rt6i_expires) {
if (time_after(now, rt->rt6i_expires)) {
RT6_TRACE("expiring %p\n", rt);
return -1;
}
gc_args.more++;
} else if (rt->rt6i_flags & RTF_CACHE) {
if (atomic_read(&rt->dst.__refcnt) == 0 &&
time_after_eq(now, rt->dst.lastuse + gc_args.timeout)) {
RT6_TRACE("aging clone %p\n", rt);
return -1;
} else if ((rt->rt6i_flags & RTF_GATEWAY) &&
(!(dst_get_neighbour_raw(&rt->dst)->flags & NTF_ROUTER))) {
RT6_TRACE("purging route %p via non-router but gateway\n",
rt);
return -1;
}
gc_args.more++;
}

return 0;
}


In kernel 3.4, we have net/ipv6/ip6_fib.c:
static int fib6_age(struct rt6_info *rt, void *arg) {
unsigned long now = jiffies;

if (rt->rt6i_flags & RTF_EXPIRES && rt->dst.expires) {
if (time_after(now, rt->dst.expires)) {
RT6_TRACE("expiring %p\n", rt);
return -1;
}
gc_args.more++;
} else if (rt->rt6i_flags & RTF_CACHE) {
if (atomic_read(&rt->dst.__refcnt) == 0 &&
time_after_eq(now, rt->dst.lastuse + gc_args.timeout)) {
RT6_TRACE("aging clone %p\n", rt);
return -1;
} else if (rt->rt6i_flags & RTF_GATEWAY) {
struct neighbour *neigh;
__u8 neigh_flags = 0;

neigh = dst_neigh_lookup(&rt->dst, &rt->rt6i_gateway);
if (neigh) {
neigh_flags = neigh->flags;
neigh_release(neigh);
}
if (neigh_flags & NTF_ROUTER) {
RT6_TRACE("purging route %p via non-router but gateway\n",
rt);
return -1;
}
}
gc_args.more++;
}

return 0;
}


Do we have the meaning of the NTF_ROUTER flag reversed in kernel 3.4? Or is the opposite use of that flag a fix for the bug in the previous releases? Or is this a bug in kernel 3.4?

Also, could this remove a Gateway entry, if there is no neighbor entry for it (in any of the version of the code)? Could this try to deference a null pointer in 3.0.32 version of the code (and any version prior to 3.4)? In general, is this the right place to remove a gateway route that has __refcnt > 0?

I wish I had more expertise in this area of the code to answer questions and not only to pose them.

Thank you,

- Igor