From: SeongJae Park <[email protected]>
On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls
make the number of active slab objects including 'sock_inode_cache' type
rapidly and continuously increase. As a result, memory pressure occurs.
In more detail, I made an artificial reproducer that resembles the
workload that we found the problem and reproduce the problem faster. It
merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop. It takes
about 2 minutes. On 40 CPU cores, 70GB DRAM machine, it reduced about
15GB of available memory in total. Note that the issue don't reproduce
on every machine. On my 6 CPU cores machine, the problem didn't
reproduce.
'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the
relevant memory objects. They are asynchronously invoked by the work
queues and internally use 'rcu_barrier()' to ensure safe destructions.
'cleanup_net()' works in a batched maneer in a single thread worker,
while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the
'system_wq'.
Therefore, 'fqdir_work_fn()' called frequently under the workload and
made the contention for 'rcu_barrier()' high. In more detail, the
global mutex, 'rcu_state.barrier_mutex' became the bottleneck.
I tried making 'fqdir_work_fn()' batched and confirmed it works. The
following patch is for the change. I think this is the right solution
for point fix of this issue, but someone might blame different parts.
1. User: Frequent 'unshare()' calls
From some point of view, such frequent 'unshare()' calls might seem only
insane.
2. Global mutex in 'rcu_barrier()'
Because of the global mutex, 'rcu_barrier()' callers could wait long
even after the callbacks started before the call finished. Therefore,
similar issues could happen in another 'rcu_barrier()' usages. Maybe we
can use some wait queue like mechanism to notify the waiters when the
desired time came.
I personally believe applying the point fix for now and making
'rcu_barrier()' improvement in longterm make sense. If I'm missing
something or you have different opinion, please feel free to let me
know.
Patch History
-------------
Changes from v1
(https://lore.kernel.org/netdev/[email protected]/)
- Keep xmas tree variable ordering (Jakub Kicinski)
- Add more numbers (Eric Dumazet)
- Use 'llist_for_each_entry_safe()' (Eric Dumazet)
SeongJae Park (1):
net/ipv4/inet_fragment: Batch fqdir destroy works
include/net/inet_frag.h | 2 +-
net/ipv4/inet_fragment.c | 28 ++++++++++++++++++++--------
2 files changed, 21 insertions(+), 9 deletions(-)
--
2.17.1
From: SeongJae Park <[email protected]>
In 'fqdir_exit()', a work for destroy of the 'fqdir' is enqueued. The
work function, 'fqdir_work_fn()', calls 'rcu_barrier()'. In case of
intensive 'fqdir_exit()' (e.g., frequent 'unshare()' systemcalls), this
increased contention could result in unacceptably high latency of
'rcu_barrier()'. This commit avoids such contention by doing the
destroy work in batched manner, as similar to that of 'cleanup_net()'.
Signed-off-by: SeongJae Park <[email protected]>
---
include/net/inet_frag.h | 2 +-
net/ipv4/inet_fragment.c | 28 ++++++++++++++++++++--------
2 files changed, 21 insertions(+), 9 deletions(-)
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index bac79e817776..558893d8810c 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -20,7 +20,7 @@ struct fqdir {
/* Keep atomic mem on separate cachelines in structs that include it */
atomic_long_t mem ____cacheline_aligned_in_smp;
- struct work_struct destroy_work;
+ struct llist_node destroy_list;
};
/**
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 10d31733297d..d5c40386a764 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -145,12 +145,19 @@ static void inet_frags_free_cb(void *ptr, void *arg)
inet_frag_destroy(fq);
}
+static LLIST_HEAD(destroy_list);
+
static void fqdir_work_fn(struct work_struct *work)
{
- struct fqdir *fqdir = container_of(work, struct fqdir, destroy_work);
- struct inet_frags *f = fqdir->f;
+ struct llist_node *kill_list;
+ struct fqdir *fqdir, *tmp;
+ struct inet_frags *f;
- rhashtable_free_and_destroy(&fqdir->rhashtable, inet_frags_free_cb, NULL);
+ /* Atomically snapshot the list of fqdirs to destroy */
+ kill_list = llist_del_all(&destroy_list);
+
+ llist_for_each_entry(fqdir, kill_list, destroy_list)
+ rhashtable_free_and_destroy(&fqdir->rhashtable, inet_frags_free_cb, NULL);
/* We need to make sure all ongoing call_rcu(..., inet_frag_destroy_rcu)
* have completed, since they need to dereference fqdir.
@@ -158,10 +165,13 @@ static void fqdir_work_fn(struct work_struct *work)
*/
rcu_barrier();
- if (refcount_dec_and_test(&f->refcnt))
- complete(&f->completion);
+ llist_for_each_entry_safe(fqdir, tmp, kill_list, destroy_list) {
+ f = fqdir->f;
+ if (refcount_dec_and_test(&f->refcnt))
+ complete(&f->completion);
- kfree(fqdir);
+ kfree(fqdir);
+ }
}
int fqdir_init(struct fqdir **fqdirp, struct inet_frags *f, struct net *net)
@@ -184,10 +194,12 @@ int fqdir_init(struct fqdir **fqdirp, struct inet_frags *f, struct net *net)
}
EXPORT_SYMBOL(fqdir_init);
+static DECLARE_WORK(fqdir_destroy_work, fqdir_work_fn);
+
void fqdir_exit(struct fqdir *fqdir)
{
- INIT_WORK(&fqdir->destroy_work, fqdir_work_fn);
- queue_work(system_wq, &fqdir->destroy_work);
+ if (llist_add(&fqdir->destroy_list, &destroy_list))
+ queue_work(system_wq, &fqdir_destroy_work);
}
EXPORT_SYMBOL(fqdir_exit);
--
2.17.1
On Thu, Dec 10, 2020 at 9:09 AM SeongJae Park <[email protected]> wrote:
>
> From: SeongJae Park <[email protected]>
>
> On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls
> make the number of active slab objects including 'sock_inode_cache' type
> rapidly and continuously increase. As a result, memory pressure occurs.
>
> In more detail, I made an artificial reproducer that resembles the
> workload that we found the problem and reproduce the problem faster. It
> merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop. It takes
> about 2 minutes. On 40 CPU cores, 70GB DRAM machine, it reduced about
> 15GB of available memory in total. Note that the issue don't reproduce
> on every machine. On my 6 CPU cores machine, the problem didn't
> reproduce.
OK, that is the number before the patch, but what is the number after
the patch ?
I think the idea is very nice, but this will serialize fqdir hash
tables destruction on one single cpu,
this might become a real issue _if_ these hash tables are populated.
(Obviously in your for (i=1;i<50000;i++) unshare(CLONE_NEWNET); all
these tables are empty...)
As you may now, frags are often used as vectors for DDOS attacks.
I would suggest maybe to not (ab)use system_wq, but a dedicated work queue
with a limit (@max_active argument set to 1 in alloc_workqueue()) , to
make sure that the number of
threads is optimal/bounded.
Only the phase after hash table removal could benefit from your
deferral to a single context,
so that a single rcu_barrier() is active, since the part after
rcu_barrier() is damn cheap and _can_ be serialized
if (refcount_dec_and_test(&f->refcnt))
complete(&f->completion);
Thanks !
>
> 'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the
> relevant memory objects. They are asynchronously invoked by the work
> queues and internally use 'rcu_barrier()' to ensure safe destructions.
> 'cleanup_net()' works in a batched maneer in a single thread worker,
> while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the
> 'system_wq'.
>
> Therefore, 'fqdir_work_fn()' called frequently under the workload and
> made the contention for 'rcu_barrier()' high. In more detail, the
> global mutex, 'rcu_state.barrier_mutex' became the bottleneck.
>
> I tried making 'fqdir_work_fn()' batched and confirmed it works. The
> following patch is for the change. I think this is the right solution
> for point fix of this issue, but someone might blame different parts.
>
> 1. User: Frequent 'unshare()' calls
> From some point of view, such frequent 'unshare()' calls might seem only
> insane.
>
> 2. Global mutex in 'rcu_barrier()'
> Because of the global mutex, 'rcu_barrier()' callers could wait long
> even after the callbacks started before the call finished. Therefore,
> similar issues could happen in another 'rcu_barrier()' usages. Maybe we
> can use some wait queue like mechanism to notify the waiters when the
> desired time came.
>
> I personally believe applying the point fix for now and making
> 'rcu_barrier()' improvement in longterm make sense. If I'm missing
> something or you have different opinion, please feel free to let me
> know.
>
>
> Patch History
> -------------
>
> Changes from v1
> (https://lore.kernel.org/netdev/[email protected]/)
> - Keep xmas tree variable ordering (Jakub Kicinski)
> - Add more numbers (Eric Dumazet)
> - Use 'llist_for_each_entry_safe()' (Eric Dumazet)
>
> SeongJae Park (1):
> net/ipv4/inet_fragment: Batch fqdir destroy works
>
> include/net/inet_frag.h | 2 +-
> net/ipv4/inet_fragment.c | 28 ++++++++++++++++++++--------
> 2 files changed, 21 insertions(+), 9 deletions(-)
>
> --
> 2.17.1
>
On Thu, 10 Dec 2020 15:09:10 +0100 Eric Dumazet <[email protected]> wrote:
> On Thu, Dec 10, 2020 at 9:09 AM SeongJae Park <[email protected]> wrote:
> >
> > From: SeongJae Park <[email protected]>
> >
> > On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls
> > make the number of active slab objects including 'sock_inode_cache' type
> > rapidly and continuously increase. As a result, memory pressure occurs.
> >
> > In more detail, I made an artificial reproducer that resembles the
> > workload that we found the problem and reproduce the problem faster. It
> > merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop. It takes
> > about 2 minutes. On 40 CPU cores, 70GB DRAM machine, it reduced about
> > 15GB of available memory in total. Note that the issue don't reproduce
> > on every machine. On my 6 CPU cores machine, the problem didn't
> > reproduce.
>
> OK, that is the number before the patch, but what is the number after
> the patch ?
No continuous memory reduction but some fluctuation observed. Nevertheless,
the available memory reduction was only up to about 400MB.
>
> I think the idea is very nice, but this will serialize fqdir hash
> tables destruction on one single cpu,
> this might become a real issue _if_ these hash tables are populated.
>
> (Obviously in your for (i=1;i<50000;i++) unshare(CLONE_NEWNET); all
> these tables are empty...)
>
> As you may now, frags are often used as vectors for DDOS attacks.
>
> I would suggest maybe to not (ab)use system_wq, but a dedicated work queue
> with a limit (@max_active argument set to 1 in alloc_workqueue()) , to
> make sure that the number of
> threads is optimal/bounded.
>
> Only the phase after hash table removal could benefit from your
> deferral to a single context,
> so that a single rcu_barrier() is active, since the part after
> rcu_barrier() is damn cheap and _can_ be serialized
>
> if (refcount_dec_and_test(&f->refcnt))
> complete(&f->completion);
Good point, thanks for this kind suggestion. I will do so in next version.
Thanks,
SeongJae Park