2016-11-09 15:42:21

by Rolf Neugebauer

[permalink] [raw]
Subject: Long delays creating a netns after deleting one (possibly RCU related)

Hi

We noticed some long delays starting docker containers on some newer
kernels (starting with 4.5.x and still present in 4.9-rc4, 4.4.x is
fine). We narrowed this down to the creation of a network namespace
being delayed directly after removing another one (details and
reproduction below). We have seen delays of up to 60s on some systems.

- The delay is proportional to the number of CPUs (online or offline).
We first discovered it with a Hyper-V Linux VM. Hyper-V advertises up
to 240 offline vCPUs even if one configures the VM with only, say 2
vCPUs. We see linear increase in delay when we change NR_CPUS in the
kernel config.

- The delay is also dependent on some tunnel network interfaces being
present (which we had compiled in in one of our kernel configs).

- We can reproduce this issue with stock kernels from
http://kernel.ubuntu.com/~kernel-ppa/mainline/running in Hyper-V VMs
as well as other hypervisors like qemu and hyperkit where we have good
control over the number of CPUs.

A simple test is:
modprobe ipip
moprobe ip_gre
modprobe ip_vti
echo -n "add netns foo ===> "; /usr/bin/time -f "%E" ip netns add foo
echo -n "del netns foo ===> "; /usr/bin/time -f "%E" ip netns delete foo
echo -n "add netns bar ===> "; /usr/bin/time -f "%E" ip netns add bar
echo -n "del netns bar ===> "; /usr/bin/time -f "%E" ip netns delete bar

with an output like:
add netns foo ===> 0:00.00
del netns foo ===> 0:00.01
add netns bar ===> 0:08.53
del netns bar ===> 0:00.01

This is on a 4.9-rc4 kernel from the above URL configured with
NR_CPUS=256 running in a Hyper-V VM (kernel config attached).

Below is a dump of the work queues while the second 'ip add netns' is
hanging. The state of the work queues does not seem to change while
the command is delayed and the pattern shown is consistent across
different kernel versions.

Is this a known issue and/or is someone working on a fix?

Thanks
Rolf
[I'm not subscribed to linux-kernel@ nor netdev@, please cc me on replies]


[ 610.356272] sysrq: SysRq : Show Blocked State
[ 610.356742] task PC stack pid father
[ 610.357252] kworker/u480:1 D 0 1994 2 0x00000000
[ 610.357752] Workqueue: netns cleanup_net
[ 610.358239] ffff9892f1065800 0000000000000000 ffff9892ee1e1e00
ffff9892f8e59340
[ 610.358705] ffff9892f4526900 ffffbf0104b5ba88 ffffffffbe486df3
ffffbf0104b5ba60
[ 610.359168] 00ffffffbdcbe663 ffff9892f8e59340 0000000100012e70
ffff9892ee1e1e00
[ 610.359677] Call Trace:
[ 610.360169] [<ffffffffbe486df3>] ? __schedule+0x233/0x6e0
[ 610.360723] [<ffffffffbe4872d6>] schedule+0x36/0x80
[ 610.361194] [<ffffffffbe48a9ca>] schedule_timeout+0x22a/0x3f0
[ 610.361789] [<ffffffffbe486dfb>] ? __schedule+0x23b/0x6e0
[ 610.362260] [<ffffffffbe487d24>] wait_for_completion+0xb4/0x140
[ 610.362736] [<ffffffffbdcb05a0>] ? wake_up_q+0x80/0x80
[ 610.363306] [<ffffffffbdceb528>] __wait_rcu_gp+0xc8/0xf0
[ 610.363782] [<ffffffffbdceea5c>] synchronize_sched+0x5c/0x80
[ 610.364137] [<ffffffffbdcf0010>] ? call_rcu_bh+0x20/0x20
[ 610.364742] [<ffffffffbdceb440>] ?
trace_raw_output_rcu_utilization+0x60/0x60
[ 610.365337] [<ffffffffbe3696bc>] synchronize_net+0x1c/0x30
[ 610.365846] [<ffffffffbe369803>] netif_napi_del+0x23/0x80
[ 610.367494] [<ffffffffc057f6f8>] ip_tunnel_dev_free+0x68/0xf0 [ip_tunnel]
[ 610.368007] [<ffffffffbe372c10>] netdev_run_todo+0x230/0x330
[ 610.368454] [<ffffffffbe37eb4e>] rtnl_unlock+0xe/0x10
[ 610.369001] [<ffffffffc057f4df>] ip_tunnel_delete_net+0xdf/0x120 [ip_tunnel]
[ 610.369500] [<ffffffffc058b92c>] ipip_exit_net+0x2c/0x30 [ipip]
[ 610.369997] [<ffffffffbe362688>] ops_exit_list.isra.4+0x38/0x60
[ 610.370636] [<ffffffffbe363674>] cleanup_net+0x1c4/0x2b0
[ 610.371130] [<ffffffffbdc9e4ac>] process_one_work+0x1fc/0x4b0
[ 610.371812] [<ffffffffbdc9e7ab>] worker_thread+0x4b/0x500
[ 610.373074] [<ffffffffbdc9e760>] ? process_one_work+0x4b0/0x4b0
[ 610.373622] [<ffffffffbdc9e760>] ? process_one_work+0x4b0/0x4b0
[ 610.374100] [<ffffffffbdca4b09>] kthread+0xd9/0xf0
[ 610.374574] [<ffffffffbdca4a30>] ? kthread_park+0x60/0x60
[ 610.375198] [<ffffffffbe48c2b5>] ret_from_fork+0x25/0x30
[ 610.375678] ip D 0 2149 2148 0x00000000
[ 610.376185] ffff9892f0a99000 0000000000000000 ffff9892f0a66900
[ 610.376185] ffff9892f8e59340
[ 610.376185] ffff9892f4526900 ffffbf0101173db8 ffffffffbe486df3
[ 610.376753] 00000005fecffd76
[ 610.376762] 00ff9892f11d9820 ffff9892f8e59340 ffff989200000000
ffff9892f0a66900
[ 610.377274] Call Trace:
[ 610.377789] [<ffffffffbe486df3>] ? __schedule+0x233/0x6e0
[ 610.378306] [<ffffffffbe4872d6>] schedule+0x36/0x80
[ 610.378992] [<ffffffffbe48756e>] schedule_preempt_disabled+0xe/0x10
[ 610.379514] [<ffffffffbe489199>] __mutex_lock_slowpath+0xb9/0x130
[ 610.380031] [<ffffffffbde0fce2>] ? __kmalloc+0x162/0x1e0
[ 610.380556] [<ffffffffbe48922f>] mutex_lock+0x1f/0x30
[ 610.381135] [<ffffffffbe3637ff>] copy_net_ns+0x9f/0x170
[ 610.381647] [<ffffffffbdca5e6b>] create_new_namespaces+0x11b/0x200
[ 610.382249] [<ffffffffbdca60fa>] unshare_nsproxy_namespaces+0x5a/0xb0
[ 610.382818] [<ffffffffbdc82dcd>] SyS_unshare+0x1cd/0x360
[ 610.383319] [<ffffffffbe48c03b>] entry_SYSCALL_64_fastpath+0x1e/0xad


Attachments:
config-4.9.0-040900rc4-generic (196.33 kB)

2016-11-10 17:38:13

by Cong Wang

[permalink] [raw]
Subject: Re: Long delays creating a netns after deleting one (possibly RCU related)

(Cc'ing Paul)

On Wed, Nov 9, 2016 at 7:42 AM, Rolf Neugebauer
<[email protected]> wrote:
> Hi
>
> We noticed some long delays starting docker containers on some newer
> kernels (starting with 4.5.x and still present in 4.9-rc4, 4.4.x is
> fine). We narrowed this down to the creation of a network namespace
> being delayed directly after removing another one (details and
> reproduction below). We have seen delays of up to 60s on some systems.
>
> - The delay is proportional to the number of CPUs (online or offline).
> We first discovered it with a Hyper-V Linux VM. Hyper-V advertises up
> to 240 offline vCPUs even if one configures the VM with only, say 2
> vCPUs. We see linear increase in delay when we change NR_CPUS in the
> kernel config.
>
> - The delay is also dependent on some tunnel network interfaces being
> present (which we had compiled in in one of our kernel configs).
>
> - We can reproduce this issue with stock kernels from
> http://kernel.ubuntu.com/~kernel-ppa/mainline/running in Hyper-V VMs
> as well as other hypervisors like qemu and hyperkit where we have good
> control over the number of CPUs.
>
> A simple test is:
> modprobe ipip
> moprobe ip_gre
> modprobe ip_vti
> echo -n "add netns foo ===> "; /usr/bin/time -f "%E" ip netns add foo
> echo -n "del netns foo ===> "; /usr/bin/time -f "%E" ip netns delete foo
> echo -n "add netns bar ===> "; /usr/bin/time -f "%E" ip netns add bar
> echo -n "del netns bar ===> "; /usr/bin/time -f "%E" ip netns delete bar
>
> with an output like:
> add netns foo ===> 0:00.00
> del netns foo ===> 0:00.01
> add netns bar ===> 0:08.53
> del netns bar ===> 0:00.01
>
> This is on a 4.9-rc4 kernel from the above URL configured with
> NR_CPUS=256 running in a Hyper-V VM (kernel config attached).
>
> Below is a dump of the work queues while the second 'ip add netns' is
> hanging. The state of the work queues does not seem to change while
> the command is delayed and the pattern shown is consistent across
> different kernel versions.
>
> Is this a known issue and/or is someone working on a fix?

Not to me.


>
> [ 610.356272] sysrq: SysRq : Show Blocked State
> [ 610.356742] task PC stack pid father
> [ 610.357252] kworker/u480:1 D 0 1994 2 0x00000000
> [ 610.357752] Workqueue: netns cleanup_net
> [ 610.358239] ffff9892f1065800 0000000000000000 ffff9892ee1e1e00
> ffff9892f8e59340
> [ 610.358705] ffff9892f4526900 ffffbf0104b5ba88 ffffffffbe486df3
> ffffbf0104b5ba60
> [ 610.359168] 00ffffffbdcbe663 ffff9892f8e59340 0000000100012e70
> ffff9892ee1e1e00
> [ 610.359677] Call Trace:
> [ 610.360169] [<ffffffffbe486df3>] ? __schedule+0x233/0x6e0
> [ 610.360723] [<ffffffffbe4872d6>] schedule+0x36/0x80
> [ 610.361194] [<ffffffffbe48a9ca>] schedule_timeout+0x22a/0x3f0
> [ 610.361789] [<ffffffffbe486dfb>] ? __schedule+0x23b/0x6e0
> [ 610.362260] [<ffffffffbe487d24>] wait_for_completion+0xb4/0x140
> [ 610.362736] [<ffffffffbdcb05a0>] ? wake_up_q+0x80/0x80
> [ 610.363306] [<ffffffffbdceb528>] __wait_rcu_gp+0xc8/0xf0
> [ 610.363782] [<ffffffffbdceea5c>] synchronize_sched+0x5c/0x80
> [ 610.364137] [<ffffffffbdcf0010>] ? call_rcu_bh+0x20/0x20
> [ 610.364742] [<ffffffffbdceb440>] ?
> trace_raw_output_rcu_utilization+0x60/0x60
> [ 610.365337] [<ffffffffbe3696bc>] synchronize_net+0x1c/0x30

This is a worker which holds the net_mutex and is waiting for
a RCU grace period to elapse.


> [ 610.365846] [<ffffffffbe369803>] netif_napi_del+0x23/0x80
> [ 610.367494] [<ffffffffc057f6f8>] ip_tunnel_dev_free+0x68/0xf0 [ip_tunnel]
> [ 610.368007] [<ffffffffbe372c10>] netdev_run_todo+0x230/0x330
> [ 610.368454] [<ffffffffbe37eb4e>] rtnl_unlock+0xe/0x10
> [ 610.369001] [<ffffffffc057f4df>] ip_tunnel_delete_net+0xdf/0x120 [ip_tunnel]
> [ 610.369500] [<ffffffffc058b92c>] ipip_exit_net+0x2c/0x30 [ipip]
> [ 610.369997] [<ffffffffbe362688>] ops_exit_list.isra.4+0x38/0x60
> [ 610.370636] [<ffffffffbe363674>] cleanup_net+0x1c4/0x2b0
> [ 610.371130] [<ffffffffbdc9e4ac>] process_one_work+0x1fc/0x4b0
> [ 610.371812] [<ffffffffbdc9e7ab>] worker_thread+0x4b/0x500
> [ 610.373074] [<ffffffffbdc9e760>] ? process_one_work+0x4b0/0x4b0
> [ 610.373622] [<ffffffffbdc9e760>] ? process_one_work+0x4b0/0x4b0
> [ 610.374100] [<ffffffffbdca4b09>] kthread+0xd9/0xf0
> [ 610.374574] [<ffffffffbdca4a30>] ? kthread_park+0x60/0x60
> [ 610.375198] [<ffffffffbe48c2b5>] ret_from_fork+0x25/0x30
> [ 610.375678] ip D 0 2149 2148 0x00000000
> [ 610.376185] ffff9892f0a99000 0000000000000000 ffff9892f0a66900
> [ 610.376185] ffff9892f8e59340
> [ 610.376185] ffff9892f4526900 ffffbf0101173db8 ffffffffbe486df3
> [ 610.376753] 00000005fecffd76
> [ 610.376762] 00ff9892f11d9820 ffff9892f8e59340 ffff989200000000
> ffff9892f0a66900
> [ 610.377274] Call Trace:
> [ 610.377789] [<ffffffffbe486df3>] ? __schedule+0x233/0x6e0
> [ 610.378306] [<ffffffffbe4872d6>] schedule+0x36/0x80
> [ 610.378992] [<ffffffffbe48756e>] schedule_preempt_disabled+0xe/0x10
> [ 610.379514] [<ffffffffbe489199>] __mutex_lock_slowpath+0xb9/0x130
> [ 610.380031] [<ffffffffbde0fce2>] ? __kmalloc+0x162/0x1e0
> [ 610.380556] [<ffffffffbe48922f>] mutex_lock+0x1f/0x30
> [ 610.381135] [<ffffffffbe3637ff>] copy_net_ns+0x9f/0x170
> [ 610.381647] [<ffffffffbdca5e6b>] create_new_namespaces+0x11b/0x200
> [ 610.382249] [<ffffffffbdca60fa>] unshare_nsproxy_namespaces+0x5a/0xb0
> [ 610.382818] [<ffffffffbdc82dcd>] SyS_unshare+0x1cd/0x360
> [ 610.383319] [<ffffffffbe48c03b>] entry_SYSCALL_64_fastpath+0x1e/0xad

This process is apparently waiting for the net_mutex held by the previous one.

Either RCU implementation is broken or something else is missing.
Do you have more stack traces of related processes? For example,
rcu_tasks_kthread. And if anything you can help to narrow down the problem,
it would be great.

Thanks.

2016-11-10 21:24:31

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Long delays creating a netns after deleting one (possibly RCU related)

On Thu, Nov 10, 2016 at 09:37:47AM -0800, Cong Wang wrote:
> (Cc'ing Paul)
>
> On Wed, Nov 9, 2016 at 7:42 AM, Rolf Neugebauer
> <[email protected]> wrote:
> > Hi
> >
> > We noticed some long delays starting docker containers on some newer
> > kernels (starting with 4.5.x and still present in 4.9-rc4, 4.4.x is
> > fine). We narrowed this down to the creation of a network namespace
> > being delayed directly after removing another one (details and
> > reproduction below). We have seen delays of up to 60s on some systems.
> >
> > - The delay is proportional to the number of CPUs (online or offline).
> > We first discovered it with a Hyper-V Linux VM. Hyper-V advertises up
> > to 240 offline vCPUs even if one configures the VM with only, say 2
> > vCPUs. We see linear increase in delay when we change NR_CPUS in the
> > kernel config.
> >
> > - The delay is also dependent on some tunnel network interfaces being
> > present (which we had compiled in in one of our kernel configs).
> >
> > - We can reproduce this issue with stock kernels from
> > http://kernel.ubuntu.com/~kernel-ppa/mainline/running in Hyper-V VMs
> > as well as other hypervisors like qemu and hyperkit where we have good
> > control over the number of CPUs.
> >
> > A simple test is:
> > modprobe ipip
> > moprobe ip_gre
> > modprobe ip_vti
> > echo -n "add netns foo ===> "; /usr/bin/time -f "%E" ip netns add foo
> > echo -n "del netns foo ===> "; /usr/bin/time -f "%E" ip netns delete foo
> > echo -n "add netns bar ===> "; /usr/bin/time -f "%E" ip netns add bar
> > echo -n "del netns bar ===> "; /usr/bin/time -f "%E" ip netns delete bar
> >
> > with an output like:
> > add netns foo ===> 0:00.00
> > del netns foo ===> 0:00.01
> > add netns bar ===> 0:08.53
> > del netns bar ===> 0:00.01
> >
> > This is on a 4.9-rc4 kernel from the above URL configured with
> > NR_CPUS=256 running in a Hyper-V VM (kernel config attached).
> >
> > Below is a dump of the work queues while the second 'ip add netns' is
> > hanging. The state of the work queues does not seem to change while
> > the command is delayed and the pattern shown is consistent across
> > different kernel versions.
> >
> > Is this a known issue and/or is someone working on a fix?
>
> Not to me.
>
>
> >
> > [ 610.356272] sysrq: SysRq : Show Blocked State
> > [ 610.356742] task PC stack pid father
> > [ 610.357252] kworker/u480:1 D 0 1994 2 0x00000000
> > [ 610.357752] Workqueue: netns cleanup_net
> > [ 610.358239] ffff9892f1065800 0000000000000000 ffff9892ee1e1e00
> > ffff9892f8e59340
> > [ 610.358705] ffff9892f4526900 ffffbf0104b5ba88 ffffffffbe486df3
> > ffffbf0104b5ba60
> > [ 610.359168] 00ffffffbdcbe663 ffff9892f8e59340 0000000100012e70
> > ffff9892ee1e1e00
> > [ 610.359677] Call Trace:
> > [ 610.360169] [<ffffffffbe486df3>] ? __schedule+0x233/0x6e0
> > [ 610.360723] [<ffffffffbe4872d6>] schedule+0x36/0x80
> > [ 610.361194] [<ffffffffbe48a9ca>] schedule_timeout+0x22a/0x3f0
> > [ 610.361789] [<ffffffffbe486dfb>] ? __schedule+0x23b/0x6e0
> > [ 610.362260] [<ffffffffbe487d24>] wait_for_completion+0xb4/0x140
> > [ 610.362736] [<ffffffffbdcb05a0>] ? wake_up_q+0x80/0x80
> > [ 610.363306] [<ffffffffbdceb528>] __wait_rcu_gp+0xc8/0xf0
> > [ 610.363782] [<ffffffffbdceea5c>] synchronize_sched+0x5c/0x80
> > [ 610.364137] [<ffffffffbdcf0010>] ? call_rcu_bh+0x20/0x20
> > [ 610.364742] [<ffffffffbdceb440>] ?
> > trace_raw_output_rcu_utilization+0x60/0x60
> > [ 610.365337] [<ffffffffbe3696bc>] synchronize_net+0x1c/0x30
>
> This is a worker which holds the net_mutex and is waiting for
> a RCU grace period to elapse.
>
>
> > [ 610.365846] [<ffffffffbe369803>] netif_napi_del+0x23/0x80
> > [ 610.367494] [<ffffffffc057f6f8>] ip_tunnel_dev_free+0x68/0xf0 [ip_tunnel]
> > [ 610.368007] [<ffffffffbe372c10>] netdev_run_todo+0x230/0x330
> > [ 610.368454] [<ffffffffbe37eb4e>] rtnl_unlock+0xe/0x10
> > [ 610.369001] [<ffffffffc057f4df>] ip_tunnel_delete_net+0xdf/0x120 [ip_tunnel]
> > [ 610.369500] [<ffffffffc058b92c>] ipip_exit_net+0x2c/0x30 [ipip]
> > [ 610.369997] [<ffffffffbe362688>] ops_exit_list.isra.4+0x38/0x60
> > [ 610.370636] [<ffffffffbe363674>] cleanup_net+0x1c4/0x2b0
> > [ 610.371130] [<ffffffffbdc9e4ac>] process_one_work+0x1fc/0x4b0
> > [ 610.371812] [<ffffffffbdc9e7ab>] worker_thread+0x4b/0x500
> > [ 610.373074] [<ffffffffbdc9e760>] ? process_one_work+0x4b0/0x4b0
> > [ 610.373622] [<ffffffffbdc9e760>] ? process_one_work+0x4b0/0x4b0
> > [ 610.374100] [<ffffffffbdca4b09>] kthread+0xd9/0xf0
> > [ 610.374574] [<ffffffffbdca4a30>] ? kthread_park+0x60/0x60
> > [ 610.375198] [<ffffffffbe48c2b5>] ret_from_fork+0x25/0x30
> > [ 610.375678] ip D 0 2149 2148 0x00000000
> > [ 610.376185] ffff9892f0a99000 0000000000000000 ffff9892f0a66900
> > [ 610.376185] ffff9892f8e59340
> > [ 610.376185] ffff9892f4526900 ffffbf0101173db8 ffffffffbe486df3
> > [ 610.376753] 00000005fecffd76
> > [ 610.376762] 00ff9892f11d9820 ffff9892f8e59340 ffff989200000000
> > ffff9892f0a66900
> > [ 610.377274] Call Trace:
> > [ 610.377789] [<ffffffffbe486df3>] ? __schedule+0x233/0x6e0
> > [ 610.378306] [<ffffffffbe4872d6>] schedule+0x36/0x80
> > [ 610.378992] [<ffffffffbe48756e>] schedule_preempt_disabled+0xe/0x10
> > [ 610.379514] [<ffffffffbe489199>] __mutex_lock_slowpath+0xb9/0x130
> > [ 610.380031] [<ffffffffbde0fce2>] ? __kmalloc+0x162/0x1e0
> > [ 610.380556] [<ffffffffbe48922f>] mutex_lock+0x1f/0x30
> > [ 610.381135] [<ffffffffbe3637ff>] copy_net_ns+0x9f/0x170
> > [ 610.381647] [<ffffffffbdca5e6b>] create_new_namespaces+0x11b/0x200
> > [ 610.382249] [<ffffffffbdca60fa>] unshare_nsproxy_namespaces+0x5a/0xb0
> > [ 610.382818] [<ffffffffbdc82dcd>] SyS_unshare+0x1cd/0x360
> > [ 610.383319] [<ffffffffbe48c03b>] entry_SYSCALL_64_fastpath+0x1e/0xad
>
> This process is apparently waiting for the net_mutex held by the previous one.
>
> Either RCU implementation is broken or something else is missing.
> Do you have more stack traces of related processes? For example,
> rcu_tasks_kthread. And if anything you can help to narrow down the problem,
> it would be great.

Did you set the rcu_normal boot parameter? Doing so would have this effect.

(It is intended for real-time users who don't like expedited grace periods.)

Thanx, Paul

2016-11-11 13:11:07

by Rolf Neugebauer

[permalink] [raw]
Subject: Re: Long delays creating a netns after deleting one (possibly RCU related)

On Thu, Nov 10, 2016 at 9:24 PM, Paul E. McKenney
<[email protected]> wrote:
> On Thu, Nov 10, 2016 at 09:37:47AM -0800, Cong Wang wrote:
>> (Cc'ing Paul)
>>
>> On Wed, Nov 9, 2016 at 7:42 AM, Rolf Neugebauer
>> <[email protected]> wrote:
>> > Hi
>> >
>> > We noticed some long delays starting docker containers on some newer
>> > kernels (starting with 4.5.x and still present in 4.9-rc4, 4.4.x is
>> > fine). We narrowed this down to the creation of a network namespace
>> > being delayed directly after removing another one (details and
>> > reproduction below). We have seen delays of up to 60s on some systems.
>> >
>> > - The delay is proportional to the number of CPUs (online or offline).
>> > We first discovered it with a Hyper-V Linux VM. Hyper-V advertises up
>> > to 240 offline vCPUs even if one configures the VM with only, say 2
>> > vCPUs. We see linear increase in delay when we change NR_CPUS in the
>> > kernel config.
>> >
>> > - The delay is also dependent on some tunnel network interfaces being
>> > present (which we had compiled in in one of our kernel configs).
>> >
>> > - We can reproduce this issue with stock kernels from
>> > http://kernel.ubuntu.com/~kernel-ppa/mainline/running in Hyper-V VMs
>> > as well as other hypervisors like qemu and hyperkit where we have good
>> > control over the number of CPUs.
>> >
>> > A simple test is:
>> > modprobe ipip
>> > moprobe ip_gre
>> > modprobe ip_vti
>> > echo -n "add netns foo ===> "; /usr/bin/time -f "%E" ip netns add foo
>> > echo -n "del netns foo ===> "; /usr/bin/time -f "%E" ip netns delete foo
>> > echo -n "add netns bar ===> "; /usr/bin/time -f "%E" ip netns add bar
>> > echo -n "del netns bar ===> "; /usr/bin/time -f "%E" ip netns delete bar
>> >
>> > with an output like:
>> > add netns foo ===> 0:00.00
>> > del netns foo ===> 0:00.01
>> > add netns bar ===> 0:08.53
>> > del netns bar ===> 0:00.01
>> >
>> > This is on a 4.9-rc4 kernel from the above URL configured with
>> > NR_CPUS=256 running in a Hyper-V VM (kernel config attached).
>> >
>> > Below is a dump of the work queues while the second 'ip add netns' is
>> > hanging. The state of the work queues does not seem to change while
>> > the command is delayed and the pattern shown is consistent across
>> > different kernel versions.
>> >
>> > Is this a known issue and/or is someone working on a fix?
>>
>> Not to me.
>>
>>
>> >
>> > [ 610.356272] sysrq: SysRq : Show Blocked State
>> > [ 610.356742] task PC stack pid father
>> > [ 610.357252] kworker/u480:1 D 0 1994 2 0x00000000
>> > [ 610.357752] Workqueue: netns cleanup_net
>> > [ 610.358239] ffff9892f1065800 0000000000000000 ffff9892ee1e1e00
>> > ffff9892f8e59340
>> > [ 610.358705] ffff9892f4526900 ffffbf0104b5ba88 ffffffffbe486df3
>> > ffffbf0104b5ba60
>> > [ 610.359168] 00ffffffbdcbe663 ffff9892f8e59340 0000000100012e70
>> > ffff9892ee1e1e00
>> > [ 610.359677] Call Trace:
>> > [ 610.360169] [<ffffffffbe486df3>] ? __schedule+0x233/0x6e0
>> > [ 610.360723] [<ffffffffbe4872d6>] schedule+0x36/0x80
>> > [ 610.361194] [<ffffffffbe48a9ca>] schedule_timeout+0x22a/0x3f0
>> > [ 610.361789] [<ffffffffbe486dfb>] ? __schedule+0x23b/0x6e0
>> > [ 610.362260] [<ffffffffbe487d24>] wait_for_completion+0xb4/0x140
>> > [ 610.362736] [<ffffffffbdcb05a0>] ? wake_up_q+0x80/0x80
>> > [ 610.363306] [<ffffffffbdceb528>] __wait_rcu_gp+0xc8/0xf0
>> > [ 610.363782] [<ffffffffbdceea5c>] synchronize_sched+0x5c/0x80
>> > [ 610.364137] [<ffffffffbdcf0010>] ? call_rcu_bh+0x20/0x20
>> > [ 610.364742] [<ffffffffbdceb440>] ?
>> > trace_raw_output_rcu_utilization+0x60/0x60
>> > [ 610.365337] [<ffffffffbe3696bc>] synchronize_net+0x1c/0x30
>>
>> This is a worker which holds the net_mutex and is waiting for
>> a RCU grace period to elapse.
>>
>>
>> > [ 610.365846] [<ffffffffbe369803>] netif_napi_del+0x23/0x80
>> > [ 610.367494] [<ffffffffc057f6f8>] ip_tunnel_dev_free+0x68/0xf0 [ip_tunnel]
>> > [ 610.368007] [<ffffffffbe372c10>] netdev_run_todo+0x230/0x330
>> > [ 610.368454] [<ffffffffbe37eb4e>] rtnl_unlock+0xe/0x10
>> > [ 610.369001] [<ffffffffc057f4df>] ip_tunnel_delete_net+0xdf/0x120 [ip_tunnel]
>> > [ 610.369500] [<ffffffffc058b92c>] ipip_exit_net+0x2c/0x30 [ipip]
>> > [ 610.369997] [<ffffffffbe362688>] ops_exit_list.isra.4+0x38/0x60
>> > [ 610.370636] [<ffffffffbe363674>] cleanup_net+0x1c4/0x2b0
>> > [ 610.371130] [<ffffffffbdc9e4ac>] process_one_work+0x1fc/0x4b0
>> > [ 610.371812] [<ffffffffbdc9e7ab>] worker_thread+0x4b/0x500
>> > [ 610.373074] [<ffffffffbdc9e760>] ? process_one_work+0x4b0/0x4b0
>> > [ 610.373622] [<ffffffffbdc9e760>] ? process_one_work+0x4b0/0x4b0
>> > [ 610.374100] [<ffffffffbdca4b09>] kthread+0xd9/0xf0
>> > [ 610.374574] [<ffffffffbdca4a30>] ? kthread_park+0x60/0x60
>> > [ 610.375198] [<ffffffffbe48c2b5>] ret_from_fork+0x25/0x30
>> > [ 610.375678] ip D 0 2149 2148 0x00000000
>> > [ 610.376185] ffff9892f0a99000 0000000000000000 ffff9892f0a66900
>> > [ 610.376185] ffff9892f8e59340
>> > [ 610.376185] ffff9892f4526900 ffffbf0101173db8 ffffffffbe486df3
>> > [ 610.376753] 00000005fecffd76
>> > [ 610.376762] 00ff9892f11d9820 ffff9892f8e59340 ffff989200000000
>> > ffff9892f0a66900
>> > [ 610.377274] Call Trace:
>> > [ 610.377789] [<ffffffffbe486df3>] ? __schedule+0x233/0x6e0
>> > [ 610.378306] [<ffffffffbe4872d6>] schedule+0x36/0x80
>> > [ 610.378992] [<ffffffffbe48756e>] schedule_preempt_disabled+0xe/0x10
>> > [ 610.379514] [<ffffffffbe489199>] __mutex_lock_slowpath+0xb9/0x130
>> > [ 610.380031] [<ffffffffbde0fce2>] ? __kmalloc+0x162/0x1e0
>> > [ 610.380556] [<ffffffffbe48922f>] mutex_lock+0x1f/0x30
>> > [ 610.381135] [<ffffffffbe3637ff>] copy_net_ns+0x9f/0x170
>> > [ 610.381647] [<ffffffffbdca5e6b>] create_new_namespaces+0x11b/0x200
>> > [ 610.382249] [<ffffffffbdca60fa>] unshare_nsproxy_namespaces+0x5a/0xb0
>> > [ 610.382818] [<ffffffffbdc82dcd>] SyS_unshare+0x1cd/0x360
>> > [ 610.383319] [<ffffffffbe48c03b>] entry_SYSCALL_64_fastpath+0x1e/0xad
>>
>> This process is apparently waiting for the net_mutex held by the previous one.
>>
>> Either RCU implementation is broken or something else is missing.
>> Do you have more stack traces of related processes? For example,
>> rcu_tasks_kthread. And if anything you can help to narrow down the problem,
>> it would be great.
>
> Did you set the rcu_normal boot parameter? Doing so would have this effect.
>
> (It is intended for real-time users who don't like expedited grace periods.)

rcu_normal is not set on the kernel command line and
/sys/kernel/rcu_normal and /sys/kernel/rcu_expedited both show 0.


>
> Thanx, Paul
>

2016-11-12 00:24:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Long delays creating a netns after deleting one (possibly RCU related)

On Fri, Nov 11, 2016 at 01:11:01PM +0000, Rolf Neugebauer wrote:
> On Thu, Nov 10, 2016 at 9:24 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Thu, Nov 10, 2016 at 09:37:47AM -0800, Cong Wang wrote:
> >> (Cc'ing Paul)
> >>
> >> On Wed, Nov 9, 2016 at 7:42 AM, Rolf Neugebauer
> >> <[email protected]> wrote:
> >> > Hi
> >> >
> >> > We noticed some long delays starting docker containers on some newer
> >> > kernels (starting with 4.5.x and still present in 4.9-rc4, 4.4.x is
> >> > fine). We narrowed this down to the creation of a network namespace
> >> > being delayed directly after removing another one (details and
> >> > reproduction below). We have seen delays of up to 60s on some systems.
> >> >
> >> > - The delay is proportional to the number of CPUs (online or offline).
> >> > We first discovered it with a Hyper-V Linux VM. Hyper-V advertises up
> >> > to 240 offline vCPUs even if one configures the VM with only, say 2
> >> > vCPUs. We see linear increase in delay when we change NR_CPUS in the
> >> > kernel config.
> >> >
> >> > - The delay is also dependent on some tunnel network interfaces being
> >> > present (which we had compiled in in one of our kernel configs).
> >> >
> >> > - We can reproduce this issue with stock kernels from
> >> > http://kernel.ubuntu.com/~kernel-ppa/mainline/running in Hyper-V VMs
> >> > as well as other hypervisors like qemu and hyperkit where we have good
> >> > control over the number of CPUs.
> >> >
> >> > A simple test is:
> >> > modprobe ipip
> >> > moprobe ip_gre
> >> > modprobe ip_vti
> >> > echo -n "add netns foo ===> "; /usr/bin/time -f "%E" ip netns add foo
> >> > echo -n "del netns foo ===> "; /usr/bin/time -f "%E" ip netns delete foo
> >> > echo -n "add netns bar ===> "; /usr/bin/time -f "%E" ip netns add bar
> >> > echo -n "del netns bar ===> "; /usr/bin/time -f "%E" ip netns delete bar
> >> >
> >> > with an output like:
> >> > add netns foo ===> 0:00.00
> >> > del netns foo ===> 0:00.01
> >> > add netns bar ===> 0:08.53
> >> > del netns bar ===> 0:00.01
> >> >
> >> > This is on a 4.9-rc4 kernel from the above URL configured with
> >> > NR_CPUS=256 running in a Hyper-V VM (kernel config attached).
> >> >
> >> > Below is a dump of the work queues while the second 'ip add netns' is
> >> > hanging. The state of the work queues does not seem to change while
> >> > the command is delayed and the pattern shown is consistent across
> >> > different kernel versions.
> >> >
> >> > Is this a known issue and/or is someone working on a fix?
> >>
> >> Not to me.
> >>
> >>
> >> >
> >> > [ 610.356272] sysrq: SysRq : Show Blocked State
> >> > [ 610.356742] task PC stack pid father
> >> > [ 610.357252] kworker/u480:1 D 0 1994 2 0x00000000
> >> > [ 610.357752] Workqueue: netns cleanup_net
> >> > [ 610.358239] ffff9892f1065800 0000000000000000 ffff9892ee1e1e00
> >> > ffff9892f8e59340
> >> > [ 610.358705] ffff9892f4526900 ffffbf0104b5ba88 ffffffffbe486df3
> >> > ffffbf0104b5ba60
> >> > [ 610.359168] 00ffffffbdcbe663 ffff9892f8e59340 0000000100012e70
> >> > ffff9892ee1e1e00
> >> > [ 610.359677] Call Trace:
> >> > [ 610.360169] [<ffffffffbe486df3>] ? __schedule+0x233/0x6e0
> >> > [ 610.360723] [<ffffffffbe4872d6>] schedule+0x36/0x80
> >> > [ 610.361194] [<ffffffffbe48a9ca>] schedule_timeout+0x22a/0x3f0
> >> > [ 610.361789] [<ffffffffbe486dfb>] ? __schedule+0x23b/0x6e0
> >> > [ 610.362260] [<ffffffffbe487d24>] wait_for_completion+0xb4/0x140
> >> > [ 610.362736] [<ffffffffbdcb05a0>] ? wake_up_q+0x80/0x80
> >> > [ 610.363306] [<ffffffffbdceb528>] __wait_rcu_gp+0xc8/0xf0
> >> > [ 610.363782] [<ffffffffbdceea5c>] synchronize_sched+0x5c/0x80
> >> > [ 610.364137] [<ffffffffbdcf0010>] ? call_rcu_bh+0x20/0x20
> >> > [ 610.364742] [<ffffffffbdceb440>] ?
> >> > trace_raw_output_rcu_utilization+0x60/0x60
> >> > [ 610.365337] [<ffffffffbe3696bc>] synchronize_net+0x1c/0x30
> >>
> >> This is a worker which holds the net_mutex and is waiting for
> >> a RCU grace period to elapse.

Ah! This net_mutex is different than RTNL. Should synchronize_net() be
modified to check for net_mutex being held in addition to the current
checks for RTNL being held?

Thanx, Paul

> >> > [ 610.365846] [<ffffffffbe369803>] netif_napi_del+0x23/0x80
> >> > [ 610.367494] [<ffffffffc057f6f8>] ip_tunnel_dev_free+0x68/0xf0 [ip_tunnel]
> >> > [ 610.368007] [<ffffffffbe372c10>] netdev_run_todo+0x230/0x330
> >> > [ 610.368454] [<ffffffffbe37eb4e>] rtnl_unlock+0xe/0x10
> >> > [ 610.369001] [<ffffffffc057f4df>] ip_tunnel_delete_net+0xdf/0x120 [ip_tunnel]
> >> > [ 610.369500] [<ffffffffc058b92c>] ipip_exit_net+0x2c/0x30 [ipip]
> >> > [ 610.369997] [<ffffffffbe362688>] ops_exit_list.isra.4+0x38/0x60
> >> > [ 610.370636] [<ffffffffbe363674>] cleanup_net+0x1c4/0x2b0
> >> > [ 610.371130] [<ffffffffbdc9e4ac>] process_one_work+0x1fc/0x4b0
> >> > [ 610.371812] [<ffffffffbdc9e7ab>] worker_thread+0x4b/0x500
> >> > [ 610.373074] [<ffffffffbdc9e760>] ? process_one_work+0x4b0/0x4b0
> >> > [ 610.373622] [<ffffffffbdc9e760>] ? process_one_work+0x4b0/0x4b0
> >> > [ 610.374100] [<ffffffffbdca4b09>] kthread+0xd9/0xf0
> >> > [ 610.374574] [<ffffffffbdca4a30>] ? kthread_park+0x60/0x60
> >> > [ 610.375198] [<ffffffffbe48c2b5>] ret_from_fork+0x25/0x30
> >> > [ 610.375678] ip D 0 2149 2148 0x00000000
> >> > [ 610.376185] ffff9892f0a99000 0000000000000000 ffff9892f0a66900
> >> > [ 610.376185] ffff9892f8e59340
> >> > [ 610.376185] ffff9892f4526900 ffffbf0101173db8 ffffffffbe486df3
> >> > [ 610.376753] 00000005fecffd76
> >> > [ 610.376762] 00ff9892f11d9820 ffff9892f8e59340 ffff989200000000
> >> > ffff9892f0a66900
> >> > [ 610.377274] Call Trace:
> >> > [ 610.377789] [<ffffffffbe486df3>] ? __schedule+0x233/0x6e0
> >> > [ 610.378306] [<ffffffffbe4872d6>] schedule+0x36/0x80
> >> > [ 610.378992] [<ffffffffbe48756e>] schedule_preempt_disabled+0xe/0x10
> >> > [ 610.379514] [<ffffffffbe489199>] __mutex_lock_slowpath+0xb9/0x130
> >> > [ 610.380031] [<ffffffffbde0fce2>] ? __kmalloc+0x162/0x1e0
> >> > [ 610.380556] [<ffffffffbe48922f>] mutex_lock+0x1f/0x30
> >> > [ 610.381135] [<ffffffffbe3637ff>] copy_net_ns+0x9f/0x170
> >> > [ 610.381647] [<ffffffffbdca5e6b>] create_new_namespaces+0x11b/0x200
> >> > [ 610.382249] [<ffffffffbdca60fa>] unshare_nsproxy_namespaces+0x5a/0xb0
> >> > [ 610.382818] [<ffffffffbdc82dcd>] SyS_unshare+0x1cd/0x360
> >> > [ 610.383319] [<ffffffffbe48c03b>] entry_SYSCALL_64_fastpath+0x1e/0xad
> >>
> >> This process is apparently waiting for the net_mutex held by the previous one.
> >>
> >> Either RCU implementation is broken or something else is missing.
> >> Do you have more stack traces of related processes? For example,
> >> rcu_tasks_kthread. And if anything you can help to narrow down the problem,
> >> it would be great.
> >
> > Did you set the rcu_normal boot parameter? Doing so would have this effect.
> >
> > (It is intended for real-time users who don't like expedited grace periods.)
>
> rcu_normal is not set on the kernel command line and
> /sys/kernel/rcu_normal and /sys/kernel/rcu_expedited both show 0.
>
>
> >
> > Thanx, Paul
> >
>

2016-11-12 00:55:34

by Cong Wang

[permalink] [raw]
Subject: Re: Long delays creating a netns after deleting one (possibly RCU related)

On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney
<[email protected]> wrote:
>
> Ah! This net_mutex is different than RTNL. Should synchronize_net() be
> modified to check for net_mutex being held in addition to the current
> checks for RTNL being held?
>

Good point!

Like commit be3fc413da9eb17cce0991f214ab0, checking
for net_mutex for this case seems to be an optimization, I assume
synchronize_rcu_expedited() and synchronize_rcu() have the same
behavior...

diff --git a/net/core/dev.c b/net/core/dev.c
index eaad4c2..3415b6b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7762,7 +7762,7 @@ EXPORT_SYMBOL(free_netdev);
void synchronize_net(void)
{
might_sleep();
- if (rtnl_is_locked())
+ if (rtnl_is_locked() || lockdep_is_held(&net_mutex))
synchronize_rcu_expedited();
else
synchronize_rcu();

2016-11-14 06:47:26

by Cong Wang

[permalink] [raw]
Subject: Re: Long delays creating a netns after deleting one (possibly RCU related)

On Fri, Nov 11, 2016 at 4:55 PM, Cong Wang <[email protected]> wrote:
> On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney
> <[email protected]> wrote:
>>
>> Ah! This net_mutex is different than RTNL. Should synchronize_net() be
>> modified to check for net_mutex being held in addition to the current
>> checks for RTNL being held?
>>
>
> Good point!
>
> Like commit be3fc413da9eb17cce0991f214ab0, checking
> for net_mutex for this case seems to be an optimization, I assume
> synchronize_rcu_expedited() and synchronize_rcu() have the same
> behavior...

Thinking a bit more, I think commit be3fc413da9eb17cce0991f
gets wrong on rtnl_is_locked(), the lock could be locked by other
process not by the current one, therefore it should be
lockdep_rtnl_is_held() which, however, is defined only when LOCKDEP
is enabled... Sigh.

I don't see any better way than letting callers decide if they want the
expedited version or not, but this requires changes of all callers of
synchronize_net(). Hm.

2016-11-14 16:24:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Long delays creating a netns after deleting one (possibly RCU related)

On Sun, Nov 13, 2016 at 10:47:01PM -0800, Cong Wang wrote:
> On Fri, Nov 11, 2016 at 4:55 PM, Cong Wang <[email protected]> wrote:
> > On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney
> > <[email protected]> wrote:
> >>
> >> Ah! This net_mutex is different than RTNL. Should synchronize_net() be
> >> modified to check for net_mutex being held in addition to the current
> >> checks for RTNL being held?
> >>
> >
> > Good point!
> >
> > Like commit be3fc413da9eb17cce0991f214ab0, checking
> > for net_mutex for this case seems to be an optimization, I assume
> > synchronize_rcu_expedited() and synchronize_rcu() have the same
> > behavior...
>
> Thinking a bit more, I think commit be3fc413da9eb17cce0991f
> gets wrong on rtnl_is_locked(), the lock could be locked by other
> process not by the current one, therefore it should be
> lockdep_rtnl_is_held() which, however, is defined only when LOCKDEP
> is enabled... Sigh.
>
> I don't see any better way than letting callers decide if they want the
> expedited version or not, but this requires changes of all callers of
> synchronize_net(). Hm.

I must confess that I don't understand how it would help to use an
expedited grace period when some other process is holding RTNL.
In contrast, I do well understand how it helps when the current process
is holding RTNL.

So what am I missing here?

Thanx, Paul

2016-11-14 17:29:51

by Hannes Frederic Sowa

[permalink] [raw]
Subject: Re: Long delays creating a netns after deleting one (possibly RCU related)

Hi Cong,

On Sat, Nov 12, 2016, at 01:55, Cong Wang wrote:
> On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney
> <[email protected]> wrote:
> >
> > Ah! This net_mutex is different than RTNL. Should synchronize_net() be
> > modified to check for net_mutex being held in addition to the current
> > checks for RTNL being held?
> >
>
> Good point!
>
> Like commit be3fc413da9eb17cce0991f214ab0, checking
> for net_mutex for this case seems to be an optimization, I assume
> synchronize_rcu_expedited() and synchronize_rcu() have the same
> behavior...
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index eaad4c2..3415b6b 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -7762,7 +7762,7 @@ EXPORT_SYMBOL(free_netdev);
> void synchronize_net(void)
> {
> might_sleep();
> - if (rtnl_is_locked())
> + if (rtnl_is_locked() || lockdep_is_held(&net_mutex))
> synchronize_rcu_expedited();

I don't think we should depend on lockdep for this check but rather use
mutex_is_locked here (I think it would fail to build like this without
CONFIG_LOCKDEP).

Bye,
Hannes

2016-11-14 17:45:00

by Cong Wang

[permalink] [raw]
Subject: Re: Long delays creating a netns after deleting one (possibly RCU related)

On Mon, Nov 14, 2016 at 8:24 AM, Paul E. McKenney
<[email protected]> wrote:
> On Sun, Nov 13, 2016 at 10:47:01PM -0800, Cong Wang wrote:
>> On Fri, Nov 11, 2016 at 4:55 PM, Cong Wang <[email protected]> wrote:
>> > On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney
>> > <[email protected]> wrote:
>> >>
>> >> Ah! This net_mutex is different than RTNL. Should synchronize_net() be
>> >> modified to check for net_mutex being held in addition to the current
>> >> checks for RTNL being held?
>> >>
>> >
>> > Good point!
>> >
>> > Like commit be3fc413da9eb17cce0991f214ab0, checking
>> > for net_mutex for this case seems to be an optimization, I assume
>> > synchronize_rcu_expedited() and synchronize_rcu() have the same
>> > behavior...
>>
>> Thinking a bit more, I think commit be3fc413da9eb17cce0991f
>> gets wrong on rtnl_is_locked(), the lock could be locked by other
>> process not by the current one, therefore it should be
>> lockdep_rtnl_is_held() which, however, is defined only when LOCKDEP
>> is enabled... Sigh.
>>
>> I don't see any better way than letting callers decide if they want the
>> expedited version or not, but this requires changes of all callers of
>> synchronize_net(). Hm.
>
> I must confess that I don't understand how it would help to use an
> expedited grace period when some other process is holding RTNL.
> In contrast, I do well understand how it helps when the current process
> is holding RTNL.

Yeah, this is exactly my point. And same for ASSERT_RTNL() which checks
rtnl_is_locked(), clearly we need to assert "it is held by the current process"
rather than "it is locked by whatever process".

But given *_is_held() is always defined by LOCKDEP, so we probably need
mutex to provide such a helper directly, mutex->owner is not always defined
either. :-/

2016-11-14 18:14:36

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Long delays creating a netns after deleting one (possibly RCU related)

On Mon, Nov 14, 2016 at 09:44:35AM -0800, Cong Wang wrote:
> On Mon, Nov 14, 2016 at 8:24 AM, Paul E. McKenney
> <[email protected]> wrote:
> > On Sun, Nov 13, 2016 at 10:47:01PM -0800, Cong Wang wrote:
> >> On Fri, Nov 11, 2016 at 4:55 PM, Cong Wang <[email protected]> wrote:
> >> > On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney
> >> > <[email protected]> wrote:
> >> >>
> >> >> Ah! This net_mutex is different than RTNL. Should synchronize_net() be
> >> >> modified to check for net_mutex being held in addition to the current
> >> >> checks for RTNL being held?
> >> >>
> >> >
> >> > Good point!
> >> >
> >> > Like commit be3fc413da9eb17cce0991f214ab0, checking
> >> > for net_mutex for this case seems to be an optimization, I assume
> >> > synchronize_rcu_expedited() and synchronize_rcu() have the same
> >> > behavior...
> >>
> >> Thinking a bit more, I think commit be3fc413da9eb17cce0991f
> >> gets wrong on rtnl_is_locked(), the lock could be locked by other
> >> process not by the current one, therefore it should be
> >> lockdep_rtnl_is_held() which, however, is defined only when LOCKDEP
> >> is enabled... Sigh.
> >>
> >> I don't see any better way than letting callers decide if they want the
> >> expedited version or not, but this requires changes of all callers of
> >> synchronize_net(). Hm.
> >
> > I must confess that I don't understand how it would help to use an
> > expedited grace period when some other process is holding RTNL.
> > In contrast, I do well understand how it helps when the current process
> > is holding RTNL.
>
> Yeah, this is exactly my point. And same for ASSERT_RTNL() which checks
> rtnl_is_locked(), clearly we need to assert "it is held by the current process"
> rather than "it is locked by whatever process".
>
> But given *_is_held() is always defined by LOCKDEP, so we probably need
> mutex to provide such a helper directly, mutex->owner is not always defined
> either. :-/

There is always the option of making acquisition and release set a per-task
variable that can be tested. (Where did I put that asbestos suit, anyway?)

Thanx, Paul

2016-11-14 22:15:34

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Long delays creating a netns after deleting one (possibly RCU related)

"Paul E. McKenney" <[email protected]> writes:

> On Mon, Nov 14, 2016 at 09:44:35AM -0800, Cong Wang wrote:
>> On Mon, Nov 14, 2016 at 8:24 AM, Paul E. McKenney
>> <[email protected]> wrote:
>> > On Sun, Nov 13, 2016 at 10:47:01PM -0800, Cong Wang wrote:
>> >> On Fri, Nov 11, 2016 at 4:55 PM, Cong Wang <[email protected]> wrote:
>> >> > On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney
>> >> > <[email protected]> wrote:
>> >> >>
>> >> >> Ah! This net_mutex is different than RTNL. Should synchronize_net() be
>> >> >> modified to check for net_mutex being held in addition to the current
>> >> >> checks for RTNL being held?
>> >> >>
>> >> >
>> >> > Good point!
>> >> >
>> >> > Like commit be3fc413da9eb17cce0991f214ab0, checking
>> >> > for net_mutex for this case seems to be an optimization, I assume
>> >> > synchronize_rcu_expedited() and synchronize_rcu() have the same
>> >> > behavior...
>> >>
>> >> Thinking a bit more, I think commit be3fc413da9eb17cce0991f
>> >> gets wrong on rtnl_is_locked(), the lock could be locked by other
>> >> process not by the current one, therefore it should be
>> >> lockdep_rtnl_is_held() which, however, is defined only when LOCKDEP
>> >> is enabled... Sigh.
>> >>
>> >> I don't see any better way than letting callers decide if they want the
>> >> expedited version or not, but this requires changes of all callers of
>> >> synchronize_net(). Hm.
>> >
>> > I must confess that I don't understand how it would help to use an
>> > expedited grace period when some other process is holding RTNL.
>> > In contrast, I do well understand how it helps when the current process
>> > is holding RTNL.
>>
>> Yeah, this is exactly my point. And same for ASSERT_RTNL() which checks
>> rtnl_is_locked(), clearly we need to assert "it is held by the current process"
>> rather than "it is locked by whatever process".
>>
>> But given *_is_held() is always defined by LOCKDEP, so we probably need
>> mutex to provide such a helper directly, mutex->owner is not always defined
>> either. :-/
>
> There is always the option of making acquisition and release set a per-task
> variable that can be tested. (Where did I put that asbestos suit, anyway?)
>
> Thanx, Paul

synchronize_rcu_expidited is not enough if you have multiple network
devices in play.

Looking at the code it comes down to this commit, and it appears there
is a promise add rcu grace period combining by Eric Dumazet.

Eric since people are hitting noticable stalls because of the rcu grace
period taking a long time do you think you could look at this code path
a bit more?

commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9
Author: Eric Dumazet <[email protected]>
Date: Wed Nov 18 06:31:03 2015 -0800

net: provide generic busy polling to all NAPI drivers

NAPI drivers no longer need to observe a particular protocol
to benefit from busy polling (CONFIG_NET_RX_BUSY_POLL=y)

napi_hash_add() and napi_hash_del() are automatically called
from core networking stack, respectively from
netif_napi_add() and netif_napi_del()

This patch depends on free_netdev() and netif_napi_del() being
called from process context, which seems to be the norm.

Drivers might still prefer to call napi_hash_del() on their
own, since they might combine all the rcu grace periods into
a single one, knowing their NAPI structures lifetime, while
core networking stack has no idea of a possible combining.

Once this patch proves to not bring serious regressions,
we will cleanup drivers to either remove napi_hash_del()
or provide appropriate rcu grace periods combining.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Eric

2016-11-14 22:46:15

by Eric Dumazet

[permalink] [raw]
Subject: Re: Long delays creating a netns after deleting one (possibly RCU related)

On Mon, 2016-11-14 at 16:12 -0600, Eric W. Biederman wrote:

> synchronize_rcu_expidited is not enough if you have multiple network
> devices in play.
>
> Looking at the code it comes down to this commit, and it appears there
> is a promise add rcu grace period combining by Eric Dumazet.
>
> Eric since people are hitting noticable stalls because of the rcu grace
> period taking a long time do you think you could look at this code path
> a bit more?
>
> commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9
> Author: Eric Dumazet <[email protected]>
> Date: Wed Nov 18 06:31:03 2015 -0800

Absolutely, I will take a loop asap.

Thanks Eric.


2016-11-14 23:09:31

by Eric Dumazet

[permalink] [raw]
Subject: Re: Long delays creating a netns after deleting one (possibly RCU related)

On Mon, 2016-11-14 at 14:46 -0800, Eric Dumazet wrote:
> On Mon, 2016-11-14 at 16:12 -0600, Eric W. Biederman wrote:
>
> > synchronize_rcu_expidited is not enough if you have multiple network
> > devices in play.
> >
> > Looking at the code it comes down to this commit, and it appears there
> > is a promise add rcu grace period combining by Eric Dumazet.
> >
> > Eric since people are hitting noticable stalls because of the rcu grace
> > period taking a long time do you think you could look at this code path
> > a bit more?
> >
> > commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9
> > Author: Eric Dumazet <[email protected]>
> > Date: Wed Nov 18 06:31:03 2015 -0800
>
> Absolutely, I will take a loop asap.

The worst offender should be fixed by the following patch.

busy poll needs to poll the physical device, not a virtual one...

diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h
index d15214d673b2e8e08fd6437b572278fb1359f10d..2a1abbf8da74368cd01adc40cef6c0644e059ef2 100644
--- a/include/net/gro_cells.h
+++ b/include/net/gro_cells.h
@@ -68,6 +68,9 @@ static inline int gro_cells_init(struct gro_cells *gcells, struct net_device *de
struct gro_cell *cell = per_cpu_ptr(gcells->cells, i);

__skb_queue_head_init(&cell->napi_skbs);
+
+ set_bit(NAPI_STATE_NO_BUSY_POLL, &cell->napi.state);
+
netif_napi_add(dev, &cell->napi, gro_cell_poll, 64);
napi_enable(&cell->napi);
}





2016-11-18 00:31:25

by Jarno Rajahalme

[permalink] [raw]
Subject: Re: Long delays creating a netns after deleting one (possibly RCU related)


> On Nov 14, 2016, at 3:09 PM, Eric Dumazet <[email protected]> wrote:
>
> On Mon, 2016-11-14 at 14:46 -0800, Eric Dumazet wrote:
>> On Mon, 2016-11-14 at 16:12 -0600, Eric W. Biederman wrote:
>>
>>> synchronize_rcu_expidited is not enough if you have multiple network
>>> devices in play.
>>>
>>> Looking at the code it comes down to this commit, and it appears there
>>> is a promise add rcu grace period combining by Eric Dumazet.
>>>
>>> Eric since people are hitting noticable stalls because of the rcu grace
>>> period taking a long time do you think you could look at this code path
>>> a bit more?
>>>
>>> commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9
>>> Author: Eric Dumazet <[email protected]>
>>> Date: Wed Nov 18 06:31:03 2015 -0800
>>
>> Absolutely, I will take a loop asap.
>
> The worst offender should be fixed by the following patch.
>
> busy poll needs to poll the physical device, not a virtual one...
>
> diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h
> index d15214d673b2e8e08fd6437b572278fb1359f10d..2a1abbf8da74368cd01adc40cef6c0644e059ef2 100644
> --- a/include/net/gro_cells.h
> +++ b/include/net/gro_cells.h
> @@ -68,6 +68,9 @@ static inline int gro_cells_init(struct gro_cells *gcells, struct net_device *de
> struct gro_cell *cell = per_cpu_ptr(gcells->cells, i);
>
> __skb_queue_head_init(&cell->napi_skbs);
> +
> + set_bit(NAPI_STATE_NO_BUSY_POLL, &cell->napi.state);
> +
> netif_napi_add(dev, &cell->napi, gro_cell_poll, 64);
> napi_enable(&cell->napi);
> }
>

This solved a ~20 second slowdown between OVS datapath unit tests for me.

Jarno


2016-11-19 00:38:12

by Jarno Rajahalme

[permalink] [raw]
Subject: Re: Long delays creating a netns after deleting one (possibly RCU related)


> On Nov 14, 2016, at 3:09 PM, Eric Dumazet <[email protected]> wrote:
>
> On Mon, 2016-11-14 at 14:46 -0800, Eric Dumazet wrote:
>> On Mon, 2016-11-14 at 16:12 -0600, Eric W. Biederman wrote:
>>
>>> synchronize_rcu_expidited is not enough if you have multiple network
>>> devices in play.
>>>
>>> Looking at the code it comes down to this commit, and it appears there
>>> is a promise add rcu grace period combining by Eric Dumazet.
>>>
>>> Eric since people are hitting noticable stalls because of the rcu grace
>>> period taking a long time do you think you could look at this code path
>>> a bit more?
>>>
>>> commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9
>>> Author: Eric Dumazet <[email protected]>
>>> Date: Wed Nov 18 06:31:03 2015 -0800
>>
>> Absolutely, I will take a loop asap.
>
> The worst offender should be fixed by the following patch.
>
> busy poll needs to poll the physical device, not a virtual one...
>
> diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h
> index d15214d673b2e8e08fd6437b572278fb1359f10d..2a1abbf8da74368cd01adc40cef6c0644e059ef2 100644
> --- a/include/net/gro_cells.h
> +++ b/include/net/gro_cells.h
> @@ -68,6 +68,9 @@ static inline int gro_cells_init(struct gro_cells *gcells, struct net_device *de
> struct gro_cell *cell = per_cpu_ptr(gcells->cells, i);
>
> __skb_queue_head_init(&cell->napi_skbs);
> +
> + set_bit(NAPI_STATE_NO_BUSY_POLL, &cell->napi.state);
> +
> netif_napi_add(dev, &cell->napi, gro_cell_poll, 64);
> napi_enable(&cell->napi);
> }
>
>
>
>
>

This fixes the problem for me, so for whatever it’s worth:

Tested-by: Jarno Rajahalme <[email protected]>


2016-11-19 00:41:25

by Eric Dumazet

[permalink] [raw]
Subject: Re: Long delays creating a netns after deleting one (possibly RCU related)

On Fri, 2016-11-18 at 16:38 -0800, Jarno Rajahalme wrote:

> This fixes the problem for me, so for whatever it’s worth:
>
> Tested-by: Jarno Rajahalme <[email protected]>
>

Thanks for testing !

https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=e88a2766143a27bfe6704b4493b214de4094cf29