Hi!
We're investigating some approaches to improve isolation of syzkaller
programs. One of the ideas is run each program in it's own user/net
namespace. However, while I was experimenting with this, I stumbled
upon a problem.
It seems that cleanup_net() might take a very long time to execute.
I've attached the reproducer and kernel .config that I used. Run as
"./a.out 1". The reproducer just forks and does unshare(CLONE_NEWNET)
in a loop. Note, that I have a lot of network-related configs enabled,
which causes a few interfaces to be set up by default.
What I see with this reproducer is that at first a huge number
(~200-300) net namespaces are created without any contention. But then
(probably when one of these namespaces gets destroyed) the program
hangs for a considerable amount of time (~100 seconds in my vm).
Nothing locks up inside the kernel and the CPU is mostly idle.
Adding debug printfs showed that the part that takes almost all of
that time is the lines between synchronize_rcu() and
mutex_unlock(&net_mutex) in cleanup_net. Running perf showed that the
cause of this might be a lot of calls to synchronize_net that happen
while executing those lines.
Is there any change that can be done to speed up the
creation/destruction of a huge number of net namespaces?
Running the reproducer with unshare(CLONE_NEWUSER) doesn't seem to
cause any delays.
Thanks!
On Fri, Apr 21, 2017 at 10:50 AM, Andrey Konovalov
<[email protected]> wrote:
> Hi!
>
> We're investigating some approaches to improve isolation of syzkaller
> programs. One of the ideas is run each program in it's own user/net
> namespace. However, while I was experimenting with this, I stumbled
> upon a problem.
>
> It seems that cleanup_net() might take a very long time to execute.
>
> I've attached the reproducer and kernel .config that I used. Run as
> "./a.out 1". The reproducer just forks and does unshare(CLONE_NEWNET)
> in a loop. Note, that I have a lot of network-related configs enabled,
> which causes a few interfaces to be set up by default.
>
> What I see with this reproducer is that at first a huge number
> (~200-300) net namespaces are created without any contention. But then
> (probably when one of these namespaces gets destroyed) the program
> hangs for a considerable amount of time (~100 seconds in my vm).
> Nothing locks up inside the kernel and the CPU is mostly idle.
>
> Adding debug printfs showed that the part that takes almost all of
> that time is the lines between synchronize_rcu() and
> mutex_unlock(&net_mutex) in cleanup_net. Running perf showed that the
> cause of this might be a lot of calls to synchronize_net that happen
> while executing those lines.
>
> Is there any change that can be done to speed up the
> creation/destruction of a huge number of net namespaces?
>
We have batches, but fundamentally this is a hard problem to solve.
Every time we try, we add bugs :/
RTNL is the new BKL (Big Kernel Lock of early linux) of today.
Even the synchronize_rcu_expedited() done from synchronize_net() is a
serious issue on some platforms.
Eric Dumazet <[email protected]> wrote:
> On Fri, Apr 21, 2017 at 10:50 AM, Andrey Konovalov
> <[email protected]> wrote:
> > Hi!
> >
> > We're investigating some approaches to improve isolation of syzkaller
> > programs. One of the ideas is run each program in it's own user/net
> > namespace. However, while I was experimenting with this, I stumbled
> > upon a problem.
> >
> > It seems that cleanup_net() might take a very long time to execute.
> >
> > I've attached the reproducer and kernel .config that I used. Run as
> > "./a.out 1". The reproducer just forks and does unshare(CLONE_NEWNET)
> > in a loop. Note, that I have a lot of network-related configs enabled,
> > which causes a few interfaces to be set up by default.
> >
> > What I see with this reproducer is that at first a huge number
> > (~200-300) net namespaces are created without any contention. But then
> > (probably when one of these namespaces gets destroyed) the program
> > hangs for a considerable amount of time (~100 seconds in my vm).
> > Nothing locks up inside the kernel and the CPU is mostly idle.
> >
> > Adding debug printfs showed that the part that takes almost all of
> > that time is the lines between synchronize_rcu() and
> > mutex_unlock(&net_mutex) in cleanup_net. Running perf showed that the
> > cause of this might be a lot of calls to synchronize_net that happen
> > while executing those lines.
> >
> > Is there any change that can be done to speed up the
> > creation/destruction of a huge number of net namespaces?
> >
>
> We have batches, but fundamentally this is a hard problem to solve.
>
> Every time we try, we add bugs :/
>
> RTNL is the new BKL (Big Kernel Lock of early linux) of today.
>
> Even the synchronize_rcu_expedited() done from synchronize_net() is a
> serious issue on some platforms.
Indeed. Setting net.netfilter.nf_conntrack_default_on=0 cuts time
cleanup time by 2/3 ...
nf unregister is way too happy to issue synchronize_net(), I'll work on
a fix.
On Fri, Apr 21, 2017 at 7:57 PM, Eric Dumazet <[email protected]> wrote:
> On Fri, Apr 21, 2017 at 10:50 AM, Andrey Konovalov
> <[email protected]> wrote:
>> Hi!
>>
>> We're investigating some approaches to improve isolation of syzkaller
>> programs. One of the ideas is run each program in it's own user/net
>> namespace. However, while I was experimenting with this, I stumbled
>> upon a problem.
>>
>> It seems that cleanup_net() might take a very long time to execute.
>>
>> I've attached the reproducer and kernel .config that I used. Run as
>> "./a.out 1". The reproducer just forks and does unshare(CLONE_NEWNET)
>> in a loop. Note, that I have a lot of network-related configs enabled,
>> which causes a few interfaces to be set up by default.
>>
>> What I see with this reproducer is that at first a huge number
>> (~200-300) net namespaces are created without any contention. But then
>> (probably when one of these namespaces gets destroyed) the program
>> hangs for a considerable amount of time (~100 seconds in my vm).
>> Nothing locks up inside the kernel and the CPU is mostly idle.
>>
>> Adding debug printfs showed that the part that takes almost all of
>> that time is the lines between synchronize_rcu() and
>> mutex_unlock(&net_mutex) in cleanup_net. Running perf showed that the
>> cause of this might be a lot of calls to synchronize_net that happen
>> while executing those lines.
>>
>> Is there any change that can be done to speed up the
>> creation/destruction of a huge number of net namespaces?
>>
>
> We have batches, but fundamentally this is a hard problem to solve.
>
> Every time we try, we add bugs :/
>
> RTNL is the new BKL (Big Kernel Lock of early linux) of today.
>
> Even the synchronize_rcu_expedited() done from synchronize_net() is a
> serious issue on some platforms.
Not sure how hard it is (I suppose hard) but if whole cleanup_net is
structured as series of call_rcu callbacks, it should give perfect
batching across all possible dimensions (e.g. across different
ops_exit_list calls and even across cleanup of different namespaces).
Florian Westphal <[email protected]> wrote:
> Indeed. Setting net.netfilter.nf_conntrack_default_on=0 cuts time
> cleanup time by 2/3 ...
>
> nf unregister is way too happy to issue synchronize_net(), I'll work on
> a fix.
I'll test this patch as a start. Maybe we can also leverage exit_batch
more on netfilter side.
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index a87a6f8a74d8..08fe1f526265 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -126,14 +126,15 @@ int nf_register_net_hook(struct net *net, const struct nf_hook_ops *reg)
}
EXPORT_SYMBOL(nf_register_net_hook);
-void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
+static struct nf_hook_entry *
+__nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
{
struct nf_hook_entry __rcu **pp;
struct nf_hook_entry *p;
pp = nf_hook_entry_head(net, reg);
if (WARN_ON_ONCE(!pp))
- return;
+ return NULL;
mutex_lock(&nf_hook_mutex);
for (; (p = nf_entry_dereference(*pp)) != NULL; pp = &p->next) {
@@ -145,7 +146,7 @@ void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
mutex_unlock(&nf_hook_mutex);
if (!p) {
WARN(1, "nf_unregister_net_hook: hook not found!\n");
- return;
+ return NULL;
}
#ifdef CONFIG_NETFILTER_INGRESS
if (reg->pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_INGRESS)
@@ -154,6 +155,17 @@ void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
#ifdef HAVE_JUMP_LABEL
static_key_slow_dec(&nf_hooks_needed[reg->pf][reg->hooknum]);
#endif
+
+ return p;
+}
+
+void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
+{
+ struct nf_hook_entry *p = __nf_unregister_net_hook(net, reg);
+
+ if (!p)
+ return;
+
synchronize_net();
nf_queue_nf_hook_drop(net, p);
/* other cpu might still process nfqueue verdict that used reg */
@@ -183,10 +195,36 @@ int nf_register_net_hooks(struct net *net, const struct nf_hook_ops *reg,
EXPORT_SYMBOL(nf_register_net_hooks);
void nf_unregister_net_hooks(struct net *net, const struct nf_hook_ops *reg,
- unsigned int n)
+ unsigned int hookcount)
{
- while (n-- > 0)
- nf_unregister_net_hook(net, ®[n]);
+ struct nf_hook_entry *to_free[16];
+ unsigned int i, n;
+
+ WARN_ON_ONCE(hookcount > ARRAY_SIZE(to_free));
+
+ next_round:
+ n = min_t(unsigned int, hookcount, ARRAY_SIZE(to_free));
+
+ for (i = 0; i < n; i++)
+ to_free[i] = __nf_unregister_net_hook(net, ®[i]);
+
+ synchronize_net();
+
+ for (i = 0; i < n; i++) {
+ if (to_free[i])
+ nf_queue_nf_hook_drop(net, to_free[i]);
+ }
+
+ synchronize_net();
+
+ for (i = 0; i < n; i++)
+ kfree(to_free[i]);
+
+ if (n < hookcount) {
+ hookcount -= n;
+ reg += n;
+ goto next_round;
+ }
}
EXPORT_SYMBOL(nf_unregister_net_hooks);
On Fri, Apr 21, 2017 at 9:45 PM, Florian Westphal <[email protected]> wrote:
> Florian Westphal <[email protected]> wrote:
>> Indeed. Setting net.netfilter.nf_conntrack_default_on=0 cuts time
>> cleanup time by 2/3 ...
>>
>> nf unregister is way too happy to issue synchronize_net(), I'll work on
>> a fix.
>
> I'll test this patch as a start. Maybe we can also leverage exit_batch
> more on netfilter side.
Hi Florian,
Your patch improves fuzzing speed at least twice, which is a great start!
Thanks!
>
> diff --git a/net/netfilter/core.c b/net/netfilter/core.c
> index a87a6f8a74d8..08fe1f526265 100644
> --- a/net/netfilter/core.c
> +++ b/net/netfilter/core.c
> @@ -126,14 +126,15 @@ int nf_register_net_hook(struct net *net, const struct nf_hook_ops *reg)
> }
> EXPORT_SYMBOL(nf_register_net_hook);
>
> -void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
> +static struct nf_hook_entry *
> +__nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
> {
> struct nf_hook_entry __rcu **pp;
> struct nf_hook_entry *p;
>
> pp = nf_hook_entry_head(net, reg);
> if (WARN_ON_ONCE(!pp))
> - return;
> + return NULL;
>
> mutex_lock(&nf_hook_mutex);
> for (; (p = nf_entry_dereference(*pp)) != NULL; pp = &p->next) {
> @@ -145,7 +146,7 @@ void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
> mutex_unlock(&nf_hook_mutex);
> if (!p) {
> WARN(1, "nf_unregister_net_hook: hook not found!\n");
> - return;
> + return NULL;
> }
> #ifdef CONFIG_NETFILTER_INGRESS
> if (reg->pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_INGRESS)
> @@ -154,6 +155,17 @@ void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
> #ifdef HAVE_JUMP_LABEL
> static_key_slow_dec(&nf_hooks_needed[reg->pf][reg->hooknum]);
> #endif
> +
> + return p;
> +}
> +
> +void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
> +{
> + struct nf_hook_entry *p = __nf_unregister_net_hook(net, reg);
> +
> + if (!p)
> + return;
> +
> synchronize_net();
> nf_queue_nf_hook_drop(net, p);
> /* other cpu might still process nfqueue verdict that used reg */
> @@ -183,10 +195,36 @@ int nf_register_net_hooks(struct net *net, const struct nf_hook_ops *reg,
> EXPORT_SYMBOL(nf_register_net_hooks);
>
> void nf_unregister_net_hooks(struct net *net, const struct nf_hook_ops *reg,
> - unsigned int n)
> + unsigned int hookcount)
> {
> - while (n-- > 0)
> - nf_unregister_net_hook(net, ®[n]);
> + struct nf_hook_entry *to_free[16];
> + unsigned int i, n;
> +
> + WARN_ON_ONCE(hookcount > ARRAY_SIZE(to_free));
> +
> + next_round:
> + n = min_t(unsigned int, hookcount, ARRAY_SIZE(to_free));
> +
> + for (i = 0; i < n; i++)
> + to_free[i] = __nf_unregister_net_hook(net, ®[i]);
> +
> + synchronize_net();
> +
> + for (i = 0; i < n; i++) {
> + if (to_free[i])
> + nf_queue_nf_hook_drop(net, to_free[i]);
> + }
> +
> + synchronize_net();
> +
> + for (i = 0; i < n; i++)
> + kfree(to_free[i]);
> +
> + if (n < hookcount) {
> + hookcount -= n;
> + reg += n;
> + goto next_round;
> + }
> }
> EXPORT_SYMBOL(nf_unregister_net_hooks);
>
Andrey Konovalov <[email protected]> wrote:
> On Fri, Apr 21, 2017 at 9:45 PM, Florian Westphal <[email protected]> wrote:
> > Florian Westphal <[email protected]> wrote:
> >> Indeed. Setting net.netfilter.nf_conntrack_default_on=0 cuts time
> >> cleanup time by 2/3 ...
> >>
> >> nf unregister is way too happy to issue synchronize_net(), I'll work on
> >> a fix.
> >
> > I'll test this patch as a start. Maybe we can also leverage exit_batch
> > more on netfilter side.
>
> Hi Florian,
>
> Your patch improves fuzzing speed at least twice, which is a great start!
Great. I'll try to push the patches i have to nf-next ASAP.