2020-12-11 12:25:21

by SeongJae Park

[permalink] [raw]
Subject: [PATCH v3 0/1] net: Reduce rcu_barrier() contentions from 'unshare(CLONE_NEWNET)'

From: SeongJae Park <[email protected]>

On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls
make the number of active slab objects including 'sock_inode_cache' type
rapidly and continuously increase. As a result, memory pressure occurs.

In more detail, I made an artificial reproducer that resembles the
workload that we found the problem and reproduce the problem faster. It
merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop. It takes
about 2 minutes. On 40 CPU cores, 70GB DRAM machine, the available
memory continuously reduced in a fast speed (about 120MB per second,
15GB in total within the 2 minutes). Note that the issue don't
reproduce on every machine. On my 6 CPU cores machine, the problem
didn't reproduce.

'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the
relevant memory objects. They are asynchronously invoked by the work
queues and internally use 'rcu_barrier()' to ensure safe destructions.
'cleanup_net()' works in a batched maneer in a single thread worker,
while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the
'system_wq'.

Therefore, 'fqdir_work_fn()' called frequently under the workload and
made the contention for 'rcu_barrier()' high. In more detail, the
global mutex, 'rcu_state.barrier_mutex' became the bottleneck.

I tried making 'rcu_barrier()' and subsequent lightweight works in
'fqdir_work_fn()' to be processed by a dedicated singlethread worker in
batch and confirmed it works. After the change, No continuous memory
reduction but some fluctuation observed. Nevertheless, the available
memory reduction was only up to about 400MB. The following patch is for
the change. I think this is the right solution for point fix of this
issue, but someone might blame different parts.

1. User: Frequent 'unshare()' calls
From some point of view, such frequent 'unshare()' calls might seem only
insane.

2. Global mutex in 'rcu_barrier()'
Because of the global mutex, 'rcu_barrier()' callers could wait long
even after the callbacks started before the call finished. Therefore,
similar issues could happen in another 'rcu_barrier()' usages. Maybe we
can use some wait queue like mechanism to notify the waiters when the
desired time came.

I personally believe applying the point fix for now and making
'rcu_barrier()' improvement in longterm make sense. If I'm missing
something or you have different opinion, please feel free to let me
know.


Patch History
-------------

Changes from v2
(https://lore.kernel.org/lkml/[email protected]/)
- Add numbers after the patch (Eric Dumazet)
- Make only 'rcu_barrier()' and subsequent lightweight works serialized
(Eric Dumazet)

Changes from v1
(https://lore.kernel.org/netdev/[email protected]/)
- Keep xmas tree variable ordering (Jakub Kicinski)
- Add more numbers (Eric Dumazet)
- Use 'llist_for_each_entry_safe()' (Eric Dumazet)


SeongJae Park (1):
net/ipv4/inet_fragment: Batch fqdir destroy works

include/net/inet_frag.h | 1 +
net/ipv4/inet_fragment.c | 45 +++++++++++++++++++++++++++++++++-------
2 files changed, 39 insertions(+), 7 deletions(-)

--
2.17.1


2020-12-11 16:52:34

by SeongJae Park

[permalink] [raw]
Subject: [PATCH v3 1/1] net/ipv4/inet_fragment: Batch fqdir destroy works

From: SeongJae Park <[email protected]>

For each 'fqdir_exit()' call, a work for destroy of the 'fqdir' is
enqueued. The work function, 'fqdir_work_fn()', internally calls
'rcu_barrier()'. In case of intensive 'fqdir_exit()' (e.g., frequent
'unshare()' systemcalls), this increased contention could result in
unacceptably high latency of 'rcu_barrier()'. This commit avoids such
contention by doing the 'rcu_barrier()' and subsequent lightweight works
in a batched manner using a dedicated singlethread worker, as similar to
that of 'cleanup_net()'.

Signed-off-by: SeongJae Park <[email protected]>
---
include/net/inet_frag.h | 1 +
net/ipv4/inet_fragment.c | 45 +++++++++++++++++++++++++++++++++-------
2 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index bac79e817776..48cc5795ceda 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -21,6 +21,7 @@ struct fqdir {
/* Keep atomic mem on separate cachelines in structs that include it */
atomic_long_t mem ____cacheline_aligned_in_smp;
struct work_struct destroy_work;
+ struct llist_node free_list;
};

/**
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 10d31733297d..a6fc4afcc323 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -145,12 +145,17 @@ static void inet_frags_free_cb(void *ptr, void *arg)
inet_frag_destroy(fq);
}

-static void fqdir_work_fn(struct work_struct *work)
+static struct workqueue_struct *fqdir_wq;
+static LLIST_HEAD(free_list);
+
+static void fqdir_free_fn(struct work_struct *work)
{
- struct fqdir *fqdir = container_of(work, struct fqdir, destroy_work);
- struct inet_frags *f = fqdir->f;
+ struct llist_node *kill_list;
+ struct fqdir *fqdir, *tmp;
+ struct inet_frags *f;

- rhashtable_free_and_destroy(&fqdir->rhashtable, inet_frags_free_cb, NULL);
+ /* Atomically snapshot the list of fqdirs to free */
+ kill_list = llist_del_all(&free_list);

/* We need to make sure all ongoing call_rcu(..., inet_frag_destroy_rcu)
* have completed, since they need to dereference fqdir.
@@ -158,12 +163,38 @@ static void fqdir_work_fn(struct work_struct *work)
*/
rcu_barrier();

- if (refcount_dec_and_test(&f->refcnt))
- complete(&f->completion);
+ llist_for_each_entry_safe(fqdir, tmp, kill_list, free_list) {
+ f = fqdir->f;
+ if (refcount_dec_and_test(&f->refcnt))
+ complete(&f->completion);

- kfree(fqdir);
+ kfree(fqdir);
+ }
}

+static DECLARE_WORK(fqdir_free_work, fqdir_free_fn);
+
+static void fqdir_work_fn(struct work_struct *work)
+{
+ struct fqdir *fqdir = container_of(work, struct fqdir, destroy_work);
+
+ rhashtable_free_and_destroy(&fqdir->rhashtable, inet_frags_free_cb, NULL);
+
+ if (llist_add(&fqdir->free_list, &free_list))
+ queue_work(fqdir_wq, &fqdir_free_work);
+}
+
+static int __init fqdir_wq_init(void)
+{
+ fqdir_wq = create_singlethread_workqueue("fqdir");
+ if (!fqdir_wq)
+ panic("Could not create fqdir workq");
+ return 0;
+}
+
+pure_initcall(fqdir_wq_init);
+
+
int fqdir_init(struct fqdir **fqdirp, struct inet_frags *f, struct net *net)
{
struct fqdir *fqdir = kzalloc(sizeof(*fqdir), GFP_KERNEL);
--
2.17.1

2020-12-11 16:54:30

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] net/ipv4/inet_fragment: Batch fqdir destroy works

On Fri, Dec 11, 2020 at 9:21 AM SeongJae Park <[email protected]> wrote:
>
> From: SeongJae Park <[email protected]>
>
> For each 'fqdir_exit()' call, a work for destroy of the 'fqdir' is
> enqueued. The work function, 'fqdir_work_fn()', internally calls
> 'rcu_barrier()'. In case of intensive 'fqdir_exit()' (e.g., frequent
> 'unshare()' systemcalls), this increased contention could result in
> unacceptably high latency of 'rcu_barrier()'. This commit avoids such
> contention by doing the 'rcu_barrier()' and subsequent lightweight works
> in a batched manner using a dedicated singlethread worker, as similar to
> that of 'cleanup_net()'.


Not sure why you submit a patch series with a single patch.

Your cover letter contains interesting info that would better be
captured in this changelog IMO

>
> Signed-off-by: SeongJae Park <[email protected]>
> ---
> include/net/inet_frag.h | 1 +
> net/ipv4/inet_fragment.c | 45 +++++++++++++++++++++++++++++++++-------
> 2 files changed, 39 insertions(+), 7 deletions(-)
>
> diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
> index bac79e817776..48cc5795ceda 100644
> --- a/include/net/inet_frag.h
> +++ b/include/net/inet_frag.h
> @@ -21,6 +21,7 @@ struct fqdir {
> /* Keep atomic mem on separate cachelines in structs that include it */
> atomic_long_t mem ____cacheline_aligned_in_smp;
> struct work_struct destroy_work;
> + struct llist_node free_list;
> };
>
> /**
> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index 10d31733297d..a6fc4afcc323 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -145,12 +145,17 @@ static void inet_frags_free_cb(void *ptr, void *arg)
> inet_frag_destroy(fq);
> }
>
> -static void fqdir_work_fn(struct work_struct *work)
> +static struct workqueue_struct *fqdir_wq;
> +static LLIST_HEAD(free_list);
> +
> +static void fqdir_free_fn(struct work_struct *work)
> {
> - struct fqdir *fqdir = container_of(work, struct fqdir, destroy_work);
> - struct inet_frags *f = fqdir->f;
> + struct llist_node *kill_list;
> + struct fqdir *fqdir, *tmp;
> + struct inet_frags *f;
>
> - rhashtable_free_and_destroy(&fqdir->rhashtable, inet_frags_free_cb, NULL);
> + /* Atomically snapshot the list of fqdirs to free */
> + kill_list = llist_del_all(&free_list);
>
> /* We need to make sure all ongoing call_rcu(..., inet_frag_destroy_rcu)
> * have completed, since they need to dereference fqdir.
> @@ -158,12 +163,38 @@ static void fqdir_work_fn(struct work_struct *work)
> */
> rcu_barrier();
>
> - if (refcount_dec_and_test(&f->refcnt))
> - complete(&f->completion);
> + llist_for_each_entry_safe(fqdir, tmp, kill_list, free_list) {
> + f = fqdir->f;
> + if (refcount_dec_and_test(&f->refcnt))
> + complete(&f->completion);
>
> - kfree(fqdir);
> + kfree(fqdir);
> + }
> }
>
> +static DECLARE_WORK(fqdir_free_work, fqdir_free_fn);
> +
> +static void fqdir_work_fn(struct work_struct *work)
> +{
> + struct fqdir *fqdir = container_of(work, struct fqdir, destroy_work);
> +
> + rhashtable_free_and_destroy(&fqdir->rhashtable, inet_frags_free_cb, NULL);
> +
> + if (llist_add(&fqdir->free_list, &free_list))
> + queue_work(fqdir_wq, &fqdir_free_work);

I think you misunderstood me.

Since this fqdir_free_work will have at most one instance, you can use
system_wq here, there is no risk of abuse.

My suggestion was to not use system_wq for fqdir_exit(), to better
control the number
of threads in rhashtable cleanups.

void fqdir_exit(struct fqdir *fqdir)
{
INIT_WORK(&fqdir->destroy_work, fqdir_work_fn);
- queue_work(system_wq, &fqdir->destroy_work);
+ queue_work(fqdir_wq, &fqdir->destroy_work);
}



> +}
> +
> +static int __init fqdir_wq_init(void)
> +{
> + fqdir_wq = create_singlethread_workqueue("fqdir");


And here, I suggest to use a non ordered work queue, allowing one
thread per cpu, to allow concurrent rhashtable cleanups

Also "fqdir" name is rather vague, this is an implementation detail ?

fqdir_wq =create_workqueue("inet_frag_wq");

> + if (!fqdir_wq)
> + panic("Could not create fqdir workq");
> + return 0;
> +}
> +
> +pure_initcall(fqdir_wq_init);
> +
> +
> int fqdir_init(struct fqdir **fqdirp, struct inet_frags *f, struct net *net)
> {
> struct fqdir *fqdir = kzalloc(sizeof(*fqdir), GFP_KERNEL);
> --
> 2.17.1
>