The net.ipv6.route.flush system parameter takes a value which specifies
a delay used during the flush operation for aging exception routes. The
written value is however not used in the currently requested flush and
instead utilized only in the next one.
A problem is that ipv6_sysctl_rtcache_flush() first reads the old value
of net->ipv6.sysctl.flush_delay into a local delay variable and then
calls proc_dointvec() which actually updates the sysctl based on the
provided input.
Fix the problem by removing net->ipv6.sysctl.flush_delay because the
value is never actually used after the flush operation and instead use
a temporary ctl_table in ipv6_sysctl_rtcache_flush() pointing directly
to the local delay variable.
Fixes: 4990509f19e8 ("[NETNS][IPV6]: Make sysctls route per namespace.")
Signed-off-by: Petr Pavlu <[email protected]>
---
Note that when testing this fix, I noticed that an aging exception route
(created via ICMP redirect) was not getting removed when triggering the
flush operation unless the associated fib6_info was an expiring route.
It looks the logic introduced in 5eb902b8e719 ("net/ipv6: Remove expired
routes with a separated list of routes.") otherwise missed registering
the fib6_info with the GC. That is potentially a separate issue, just
adding it here in case someone decides to test this patch and possibly
run into this problem too.
include/net/netns/ipv6.h | 1 -
net/ipv6/route.c | 13 ++++++-------
2 files changed, 6 insertions(+), 8 deletions(-)
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 5f2cfd84570a..2ed7659013a4 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -20,7 +20,6 @@ struct netns_sysctl_ipv6 {
struct ctl_table_header *frags_hdr;
struct ctl_table_header *xfrm6_hdr;
#endif
- int flush_delay;
int ip6_rt_max_size;
int ip6_rt_gc_min_interval;
int ip6_rt_gc_timeout;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index bbc2a0dd9314..f07f050003c3 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -6335,15 +6335,17 @@ static int rt6_stats_seq_show(struct seq_file *seq, void *v)
static int ipv6_sysctl_rtcache_flush(struct ctl_table *ctl, int write,
void *buffer, size_t *lenp, loff_t *ppos)
{
- struct net *net;
+ struct net *net = ctl->extra1;
+ struct ctl_table lctl;
int delay;
int ret;
+
if (!write)
return -EINVAL;
- net = (struct net *)ctl->extra1;
- delay = net->ipv6.sysctl.flush_delay;
- ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
+ lctl = *ctl;
+ lctl.data = &delay;
+ ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
if (ret)
return ret;
@@ -6368,7 +6370,6 @@ static struct ctl_table ipv6_route_table_template[] = {
},
{
.procname = "flush",
- .data = &init_net.ipv6.sysctl.flush_delay,
.maxlen = sizeof(int),
.mode = 0200,
.proc_handler = ipv6_sysctl_rtcache_flush
@@ -6444,7 +6445,6 @@ struct ctl_table * __net_init ipv6_route_sysctl_init(struct net *net)
if (table) {
table[0].data = &net->ipv6.sysctl.ip6_rt_max_size;
table[1].data = &net->ipv6.ip6_dst_ops.gc_thresh;
- table[2].data = &net->ipv6.sysctl.flush_delay;
table[2].extra1 = net;
table[3].data = &net->ipv6.sysctl.ip6_rt_gc_min_interval;
table[4].data = &net->ipv6.sysctl.ip6_rt_gc_timeout;
@@ -6521,7 +6521,6 @@ static int __net_init ip6_route_net_init(struct net *net)
#endif
#endif
- net->ipv6.sysctl.flush_delay = 0;
net->ipv6.sysctl.ip6_rt_max_size = INT_MAX;
net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2;
net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ;
base-commit: 2bfcfd584ff5ccc8bb7acde19b42570414bf880b
--
2.35.3
[Added back [email protected] and [email protected]
which seem to be dropped by accident.]
On 5/30/24 17:59, Kuifeng Lee wrote:
> On Wed, May 29, 2024 at 6:53 AM Petr Pavlu <[email protected]> wrote:
>>
>> The net.ipv6.route.flush system parameter takes a value which specifies
>> a delay used during the flush operation for aging exception routes. The
>> written value is however not used in the currently requested flush and
>> instead utilized only in the next one.
>>
>> A problem is that ipv6_sysctl_rtcache_flush() first reads the old value
>> of net->ipv6.sysctl.flush_delay into a local delay variable and then
>> calls proc_dointvec() which actually updates the sysctl based on the
>> provided input.
>
> If the problem we are trying to fix is using the old value, should we move
> the line reading the value to a place after updating it instead of a
> local copy of
> the whole ctl_table?
Just moving the read of net->ipv6.sysctl.flush_delay after the
proc_dointvec() call was actually my initial implementation. I then
opted for the proposed version because it looked useful to me to save
memory used to store net->ipv6.sysctl.flush_delay.
Another minor aspect is that these sysctl writes are not serialized. Two
invocations of ipv6_sysctl_rtcache_flush() could in theory occur at the
same time. It can then happen that they both first execute
proc_dointvec(). One of them ends up slower and thus its value gets
stored in net->ipv6.sysctl.flush_delay. Both runs then return to
ipv6_sysctl_rtcache_flush(), read the stored value and execute
fib6_run_gc(). It means one of them calls this function with a value
different that it was actually given on input. By having a purely local
variable, each write is independent and fib6_run_gc() is executed with
the right input delay.
The cost of making a copy of ctl_table is a few instructions and this
isn't on any hot path. The same pattern is used, for example, in
net/ipv6/addrconf.c, function addrconf_sysctl_forward().
So overall, the proposed version looked marginally better to me than
just moving the read of net->ipv6.sysctl.flush_delay later in
ipv6_sysctl_rtcache_flush().
Thanks,
Petr
>
>>
>> Fix the problem by removing net->ipv6.sysctl.flush_delay because the
>> value is never actually used after the flush operation and instead use
>> a temporary ctl_table in ipv6_sysctl_rtcache_flush() pointing directly
>> to the local delay variable.
>>
>> Fixes: 4990509f19e8 ("[NETNS][IPV6]: Make sysctls route per namespace.")
>> Signed-off-by: Petr Pavlu <[email protected]>
>> ---
>>
>> Note that when testing this fix, I noticed that an aging exception route
>> (created via ICMP redirect) was not getting removed when triggering the
>> flush operation unless the associated fib6_info was an expiring route.
>> It looks the logic introduced in 5eb902b8e719 ("net/ipv6: Remove expired
>> routes with a separated list of routes.") otherwise missed registering
>> the fib6_info with the GC. That is potentially a separate issue, just
>> adding it here in case someone decides to test this patch and possibly
>> run into this problem too.
>>
>> include/net/netns/ipv6.h | 1 -
>> net/ipv6/route.c | 13 ++++++-------
>> 2 files changed, 6 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
>> index 5f2cfd84570a..2ed7659013a4 100644
>> --- a/include/net/netns/ipv6.h
>> +++ b/include/net/netns/ipv6.h
>> @@ -20,7 +20,6 @@ struct netns_sysctl_ipv6 {
>> struct ctl_table_header *frags_hdr;
>> struct ctl_table_header *xfrm6_hdr;
>> #endif
>> - int flush_delay;
>> int ip6_rt_max_size;
>> int ip6_rt_gc_min_interval;
>> int ip6_rt_gc_timeout;
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index bbc2a0dd9314..f07f050003c3 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -6335,15 +6335,17 @@ static int rt6_stats_seq_show(struct seq_file *seq, void *v)
>> static int ipv6_sysctl_rtcache_flush(struct ctl_table *ctl, int write,
>> void *buffer, size_t *lenp, loff_t *ppos)
>> {
>> - struct net *net;
>> + struct net *net = ctl->extra1;
>> + struct ctl_table lctl;
>> int delay;
>> int ret;
>> +
>> if (!write)
>> return -EINVAL;
>>
>> - net = (struct net *)ctl->extra1;
>> - delay = net->ipv6.sysctl.flush_delay;
>> - ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
>> + lctl = *ctl;
>> + lctl.data = &delay;
>> + ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
>> if (ret)
>> return ret;
>>
>> @@ -6368,7 +6370,6 @@ static struct ctl_table ipv6_route_table_template[] = {
>> },
>> {
>> .procname = "flush",
>> - .data = &init_net.ipv6.sysctl.flush_delay,
>> .maxlen = sizeof(int),
>> .mode = 0200,
>> .proc_handler = ipv6_sysctl_rtcache_flush
>> @@ -6444,7 +6445,6 @@ struct ctl_table * __net_init ipv6_route_sysctl_init(struct net *net)
>> if (table) {
>> table[0].data = &net->ipv6.sysctl.ip6_rt_max_size;
>> table[1].data = &net->ipv6.ip6_dst_ops.gc_thresh;
>> - table[2].data = &net->ipv6.sysctl.flush_delay;
>> table[2].extra1 = net;
>> table[3].data = &net->ipv6.sysctl.ip6_rt_gc_min_interval;
>> table[4].data = &net->ipv6.sysctl.ip6_rt_gc_timeout;
>> @@ -6521,7 +6521,6 @@ static int __net_init ip6_route_net_init(struct net *net)
>> #endif
>> #endif
>>
>> - net->ipv6.sysctl.flush_delay = 0;
>> net->ipv6.sysctl.ip6_rt_max_size = INT_MAX;
>> net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2;
>> net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ;
>>
>> base-commit: 2bfcfd584ff5ccc8bb7acde19b42570414bf880b
>> --
>> 2.35.3
>>
>>
On Fri, 2024-05-31 at 10:53 +0200, Petr Pavlu wrote:
> [Added back [email protected] and [email protected]
> which seem to be dropped by accident.]
>
> On 5/30/24 17:59, Kuifeng Lee wrote:
> > On Wed, May 29, 2024 at 6:53 AM Petr Pavlu <[email protected]> wrote:
> > >
> > > The net.ipv6.route.flush system parameter takes a value which specifies
> > > a delay used during the flush operation for aging exception routes. The
> > > written value is however not used in the currently requested flush and
> > > instead utilized only in the next one.
> > >
> > > A problem is that ipv6_sysctl_rtcache_flush() first reads the old value
> > > of net->ipv6.sysctl.flush_delay into a local delay variable and then
> > > calls proc_dointvec() which actually updates the sysctl based on the
> > > provided input.
> >
> > If the problem we are trying to fix is using the old value, should we move
> > the line reading the value to a place after updating it instead of a
> > local copy of
> > the whole ctl_table?
>
> Just moving the read of net->ipv6.sysctl.flush_delay after the
> proc_dointvec() call was actually my initial implementation. I then
> opted for the proposed version because it looked useful to me to save
> memory used to store net->ipv6.sysctl.flush_delay.
Note that due to alignment, the struct netns_sysctl_ipv6 size is not
going to change on 64 bits build.
And if the layout would change, that could have subtle performance side
effects (moving later fields in netns_sysctl_ipv6 in different
cachelines) that we want to avoid for a net patch.
> Another minor aspect is that these sysctl writes are not serialized. Two
> invocations of ipv6_sysctl_rtcache_flush() could in theory occur at the
> same time. It can then happen that they both first execute
> proc_dointvec(). One of them ends up slower and thus its value gets
> stored in net->ipv6.sysctl.flush_delay. Both runs then return to
> ipv6_sysctl_rtcache_flush(), read the stored value and execute
> fib6_run_gc(). It means one of them calls this function with a value
> different that it was actually given on input. By having a purely local
> variable, each write is independent and fib6_run_gc() is executed with
> the right input delay.
>
> The cost of making a copy of ctl_table is a few instructions and this
> isn't on any hot path. The same pattern is used, for example, in
> net/ipv6/addrconf.c, function addrconf_sysctl_forward().
>
> So overall, the proposed version looked marginally better to me than
> just moving the read of net->ipv6.sysctl.flush_delay later in
> ipv6_sysctl_rtcache_flush().
All in all the increased complexity vs the simple solution does not
look worth to me.
Please revert to the initial/simpler implementation for this fix,
thanks!
Paolo
On 6/4/24 10:30, Paolo Abeni wrote:
> On Fri, 2024-05-31 at 10:53 +0200, Petr Pavlu wrote:
>> [Added back [email protected] and [email protected]
>> which seem to be dropped by accident.]
>>
>> On 5/30/24 17:59, Kuifeng Lee wrote:
>>> On Wed, May 29, 2024 at 6:53 AM Petr Pavlu <[email protected]> wrote:
>>>>
>>>> The net.ipv6.route.flush system parameter takes a value which specifies
>>>> a delay used during the flush operation for aging exception routes. The
>>>> written value is however not used in the currently requested flush and
>>>> instead utilized only in the next one.
>>>>
>>>> A problem is that ipv6_sysctl_rtcache_flush() first reads the old value
>>>> of net->ipv6.sysctl.flush_delay into a local delay variable and then
>>>> calls proc_dointvec() which actually updates the sysctl based on the
>>>> provided input.
>>>
>>> If the problem we are trying to fix is using the old value, should we move
>>> the line reading the value to a place after updating it instead of a
>>> local copy of
>>> the whole ctl_table?
>>
>> Just moving the read of net->ipv6.sysctl.flush_delay after the
>> proc_dointvec() call was actually my initial implementation. I then
>> opted for the proposed version because it looked useful to me to save
>> memory used to store net->ipv6.sysctl.flush_delay.
>
> Note that due to alignment, the struct netns_sysctl_ipv6 size is not
> going to change on 64 bits build.
>
> And if the layout would change, that could have subtle performance side
> effects (moving later fields in netns_sysctl_ipv6 in different
> cachelines) that we want to avoid for a net patch.
>
>> Another minor aspect is that these sysctl writes are not serialized. Two
>> invocations of ipv6_sysctl_rtcache_flush() could in theory occur at the
>> same time. It can then happen that they both first execute
>> proc_dointvec(). One of them ends up slower and thus its value gets
>> stored in net->ipv6.sysctl.flush_delay. Both runs then return to
>> ipv6_sysctl_rtcache_flush(), read the stored value and execute
>> fib6_run_gc(). It means one of them calls this function with a value
>> different that it was actually given on input. By having a purely local
>> variable, each write is independent and fib6_run_gc() is executed with
>> the right input delay.
>>
>> The cost of making a copy of ctl_table is a few instructions and this
>> isn't on any hot path. The same pattern is used, for example, in
>> net/ipv6/addrconf.c, function addrconf_sysctl_forward().
>>
>> So overall, the proposed version looked marginally better to me than
>> just moving the read of net->ipv6.sysctl.flush_delay later in
>> ipv6_sysctl_rtcache_flush().
>
> All in all the increased complexity vs the simple solution does not
> look worth to me.
>
> Please revert to the initial/simpler implementation for this fix,
> thanks!
Fair enough, I'll post v2 with the initial/simpler version.
Thanks,
Petr