On Tue, Nov 14, 2017 at 04:53:33PM +0300, Kirill Tkhai wrote:
> Curently mutex is used to protect pernet operations list. It makes
> cleanup_net() to execute ->exit methods of the same operations set,
> which was used on the time of ->init, even after net namespace is
> unlinked from net_namespace_list.
>
> But the problem is it's need to synchronize_rcu() after net is removed
> from net_namespace_list():
>
> Destroy net_ns:
> cleanup_net()
> mutex_lock(&net_mutex)
> list_del_rcu(&net->list)
> synchronize_rcu() <--- Sleep there for ages
> list_for_each_entry_reverse(ops, &pernet_list, list)
> ops_exit_list(ops, &net_exit_list)
> list_for_each_entry_reverse(ops, &pernet_list, list)
> ops_free_list(ops, &net_exit_list)
> mutex_unlock(&net_mutex)
>
> This primitive is not fast, especially on the systems with many processors
> and/or when preemptible RCU is enabled in config. So, all the time, while
> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>
> Create net_ns:
> copy_net_ns()
> mutex_lock_killable(&net_mutex) <--- Sleep there for ages
>
> The solution is to convert net_mutex to the rw_semaphore. Then,
> pernet_operations::init/::exit methods, modifying the net-related data,
> will require down_read() locking only, while down_write() will be used
> for changing pernet_list.
>
> This gives signify performance increase, like you may see below. There
> is measured sequential net namespace creation in a cycle, in single
> thread, without other tasks (single user mode):
>
> 1)int main(int argc, char *argv[])
> {
> unsigned nr;
> if (argc < 2) {
> fprintf(stderr, "Provide nr iterations arg\n");
> return 1;
> }
> nr = atoi(argv[1]);
> while (nr-- > 0) {
> if (unshare(CLONE_NEWNET)) {
> perror("Can't unshare");
> return 1;
> }
> }
> return 0;
> }
>
> Origin, 100000 unshare():
> 0.03user 23.14system 1:39.85elapsed 23%CPU
>
> Patched, 100000 unshare():
> 0.03user 67.49system 1:08.34elapsed 98%CPU
>
> 2)for i in {1..10000}; do unshare -n bash -c exit; done
Hi Kirill,
This mutex has another role. You know that net namespaces are destroyed
asynchronously, and the net mutex gurantees that a backlog will be not
big. If we have something in backlog, we know that it will be handled
before creating a new net ns.
As far as I remember net namespaces are created much faster than
they are destroyed, so with this changes we can create a really big
backlog, can't we?
There was a discussion a few month ago:
https://lists.onap.org/pipermail/containers/2016-October/037509.html
>
> Origin:
> real 1m24,190s
> user 0m6,225s
> sys 0m15,132s
Here you measure time of creating and destroying net namespaces.
>
> Patched:
> real 0m18,235s (4.6 times faster)
> user 0m4,544s
> sys 0m13,796s
But here you measure time of crearing namespaces and you know nothing
when they will be destroyed.
Thanks,
Andrei
From 1584066129534824912@xxx Tue Nov 14 18:13:03 +0000 2017
X-GM-THRID: 1584049999819820517
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
On 14.11.2017 20:44, Andrei Vagin wrote:
> On Tue, Nov 14, 2017 at 04:53:33PM +0300, Kirill Tkhai wrote:
>> Curently mutex is used to protect pernet operations list. It makes
>> cleanup_net() to execute ->exit methods of the same operations set,
>> which was used on the time of ->init, even after net namespace is
>> unlinked from net_namespace_list.
>>
>> But the problem is it's need to synchronize_rcu() after net is removed
>> from net_namespace_list():
>>
>> Destroy net_ns:
>> cleanup_net()
>> mutex_lock(&net_mutex)
>> list_del_rcu(&net->list)
>> synchronize_rcu() <--- Sleep there for ages
>> list_for_each_entry_reverse(ops, &pernet_list, list)
>> ops_exit_list(ops, &net_exit_list)
>> list_for_each_entry_reverse(ops, &pernet_list, list)
>> ops_free_list(ops, &net_exit_list)
>> mutex_unlock(&net_mutex)
>>
>> This primitive is not fast, especially on the systems with many processors
>> and/or when preemptible RCU is enabled in config. So, all the time, while
>> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
>> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>>
>> Create net_ns:
>> copy_net_ns()
>> mutex_lock_killable(&net_mutex) <--- Sleep there for ages
>>
>> The solution is to convert net_mutex to the rw_semaphore. Then,
>> pernet_operations::init/::exit methods, modifying the net-related data,
>> will require down_read() locking only, while down_write() will be used
>> for changing pernet_list.
>>
>> This gives signify performance increase, like you may see below. There
>> is measured sequential net namespace creation in a cycle, in single
>> thread, without other tasks (single user mode):
>>
>> 1)int main(int argc, char *argv[])
>> {
>> unsigned nr;
>> if (argc < 2) {
>> fprintf(stderr, "Provide nr iterations arg\n");
>> return 1;
>> }
>> nr = atoi(argv[1]);
>> while (nr-- > 0) {
>> if (unshare(CLONE_NEWNET)) {
>> perror("Can't unshare");
>> return 1;
>> }
>> }
>> return 0;
>> }
>>
>> Origin, 100000 unshare():
>> 0.03user 23.14system 1:39.85elapsed 23%CPU
>>
>> Patched, 100000 unshare():
>> 0.03user 67.49system 1:08.34elapsed 98%CPU
>>
>> 2)for i in {1..10000}; do unshare -n bash -c exit; done
>
> Hi Kirill,
>
> This mutex has another role. You know that net namespaces are destroyed
> asynchronously, and the net mutex gurantees that a backlog will be not
> big. If we have something in backlog, we know that it will be handled
> before creating a new net ns.
>
> As far as I remember net namespaces are created much faster than
> they are destroyed, so with this changes we can create a really big
> backlog, can't we?
I don't think limitation is a good goal or a gool for the mutex,
because it's very easy to create many net namespaces in case of
the mutex exists. You may open /proc/[pid]/ns/net like a file,
and net_ns counter will increment. Then, do unshare(), and
the mutex has no a way to protect against that. Anyway, mutex
can't limit a number of something in general, I've never seen
a (good) example in kernel.
As I see, the real limitation happen in inc_net_namespaces(),
which is decremented after RCU grace period in cleanup_net(),
and it has not changed.
> There was a discussion a few month ago:
> https://lists.onap.org/pipermail/containers/2016-October/037509.html
>
>
>>
>> Origin:
>> real 1m24,190s
>> user 0m6,225s
>> sys 0m15,132s
>
> Here you measure time of creating and destroying net namespaces.
>
>>
>> Patched:
>> real 0m18,235s (4.6 times faster)
>> user 0m4,544s
>> sys 0m13,796s
>
> But here you measure time of crearing namespaces and you know nothing
> when they will be destroyed.
You're right, and I predict, the sum time, spent on cpu, will remain the same,
but the think is that now creation and destroying may be executed in parallel.
From 1584065494278221228@xxx Tue Nov 14 18:02:57 +0000 2017
X-GM-THRID: 1584049999819820517
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
On Tue, 2017-11-14 at 09:44 -0800, Andrei Vagin wrote:
> On Tue, Nov 14, 2017 at 04:53:33PM +0300, Kirill Tkhai wrote:
> > Curently mutex is used to protect pernet operations list. It makes
> > cleanup_net() to execute ->exit methods of the same operations set,
> > which was used on the time of ->init, even after net namespace is
> > unlinked from net_namespace_list.
> >
> > But the problem is it's need to synchronize_rcu() after net is removed
> > from net_namespace_list():
> >
> > Destroy net_ns:
> > cleanup_net()
> > mutex_lock(&net_mutex)
> > list_del_rcu(&net->list)
> > synchronize_rcu() <--- Sleep there for ages
> > list_for_each_entry_reverse(ops, &pernet_list, list)
> > ops_exit_list(ops, &net_exit_list)
> > list_for_each_entry_reverse(ops, &pernet_list, list)
> > ops_free_list(ops, &net_exit_list)
> > mutex_unlock(&net_mutex)
> >
> > This primitive is not fast, especially on the systems with many processors
> > and/or when preemptible RCU is enabled in config. So, all the time, while
> > cleanup_net() is waiting for RCU grace period, creation of new net namespaces
> > is not possible, the tasks, who makes it, are sleeping on the same mutex:
> >
> > Create net_ns:
> > copy_net_ns()
> > mutex_lock_killable(&net_mutex) <--- Sleep there for ages
> >
> > The solution is to convert net_mutex to the rw_semaphore. Then,
> > pernet_operations::init/::exit methods, modifying the net-related data,
> > will require down_read() locking only, while down_write() will be used
> > for changing pernet_list.
> >
> > This gives signify performance increase, like you may see below. There
> > is measured sequential net namespace creation in a cycle, in single
> > thread, without other tasks (single user mode):
> >
> > 1)int main(int argc, char *argv[])
> > {
> > unsigned nr;
> > if (argc < 2) {
> > fprintf(stderr, "Provide nr iterations arg\n");
> > return 1;
> > }
> > nr = atoi(argv[1]);
> > while (nr-- > 0) {
> > if (unshare(CLONE_NEWNET)) {
> > perror("Can't unshare");
> > return 1;
> > }
> > }
> > return 0;
> > }
> >
> > Origin, 100000 unshare():
> > 0.03user 23.14system 1:39.85elapsed 23%CPU
> >
> > Patched, 100000 unshare():
> > 0.03user 67.49system 1:08.34elapsed 98%CPU
> >
> > 2)for i in {1..10000}; do unshare -n bash -c exit; done
>
> Hi Kirill,
>
> This mutex has another role. You know that net namespaces are destroyed
> asynchronously, and the net mutex gurantees that a backlog will be not
> big. If we have something in backlog, we know that it will be handled
> before creating a new net ns.
>
> As far as I remember net namespaces are created much faster than
> they are destroyed, so with this changes we can create a really big
> backlog, can't we?
Please take a look at the recent patches I did :
8ca712c373a462cfa1b62272870b6c2c74aa83f9 Merge branch 'net-speedup-netns-create-delete-time'
64bc17811b72758753e2b64cd8f2a63812c61fe1 ipv4: speedup ipv6 tunnels dismantle
bb401caefe9d2c65e0c0fa23b21deecfbfa473fe ipv6: speedup ipv6 tunnels dismantle
789e6ddb0b2fb5d5024b760b178a47876e4de7a6 tcp: batch tcp_net_metrics_exit
a90c9347e90ed1e9323d71402ed18023bc910cd8 ipv6: addrlabel: per netns list
d464e84eed02993d40ad55fdc19f4523e4deee5b kobject: factorize skb setup in kobject_uevent_net_broadcast()
4a336a23d619e96aef37d4d054cfadcdd1b581ba kobject: copy env blob in one go
16dff336b33d87c15d9cbe933cfd275aae2a8251 kobject: add kobject_uevent_net_broadcast()
From 1584064180049940864@xxx Tue Nov 14 17:42:04 +0000 2017
X-GM-THRID: 1584049999819820517
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread