Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp34570pxu; Thu, 10 Dec 2020 17:26:03 -0800 (PST) X-Google-Smtp-Source: ABdhPJwa8EhGK6SMSxmxXRQlJSzQ9cDFR63wa1RQ5COXVtCmDG9/Lgu43f/NECpb4sQW21mJHlk+ X-Received: by 2002:a17:906:1cd4:: with SMTP id i20mr9092252ejh.415.1607649963723; Thu, 10 Dec 2020 17:26:03 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1607649963; cv=none; d=google.com; s=arc-20160816; b=dppCe6GxZfbq/SlQplt6ZNcXavt3ZUKGIqe1DIt6s5tqBC5tCQuOf0CQwn8LeiJ9yB sQpD++eOs3HUtyPLqNQXXIEvnf1s9I2e+WdPboM8oEU+smM0TJQdk3hCEwKAqU/72nQv 2vod2FyA8tN5F+UUTGK7HOl3yA2EjFlvgfPU91r+MySwQm9J5jc5lMweH4FaEWQcICmp +2aPcQ42ceaL4srStD69xfdXt1TvizitvY5CKXWoF3kIAm05NZtV2vfvc6I1DSrLk+6P HnN8mHHZdR9xwywQlZIMxzPmH2ZrUxLzFo6d8gPLFCxF+RwOloDny83GUOp6I3qYmfDa o1rA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=IPBbRhY2fa8scqQYa6fuHv3sg6vR+F2lX2rvUeIAO48=; b=zN7nXxbNrolboX1y3fscDV1kZzThCZy/yQSMtJesrNNqeTJr3jPq+gcygbmqLozNOH cDVMRc7G2GzXUNOM68NBtWutVPpX2Z4kOIyQIh5DWqNLrVymuXkQ7wvp9naoif+xi2O4 ZI7TANF89w7OzB4EYZrx2BXd1wa3qfczA6HGCcASD/J2weDJUVW8ImBe52QXbxcFwVih TDKvX8izF2SH+5XNVP12XHyZPkvZQnv0jDVV5OKwMDwKloek2c7yvnGQoidSfmdtLPtm 9O2C5gHbW9evG+5UHXtLWp7lV0DFL5L3QbCzGxcR0fqKUobZIykDMjnk5CZX6kSY90ZK c/AA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=JkdYsiUI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id j4si2532289ejf.307.2020.12.10.17.25.41; Thu, 10 Dec 2020 17:26:03 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=JkdYsiUI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387888AbgLJOKZ (ORCPT + 99 others); Thu, 10 Dec 2020 09:10:25 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57562 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728614AbgLJOKD (ORCPT ); Thu, 10 Dec 2020 09:10:03 -0500 Received: from mail-il1-x141.google.com (mail-il1-x141.google.com [IPv6:2607:f8b0:4864:20::141]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 08786C0613D6 for ; Thu, 10 Dec 2020 06:09:23 -0800 (PST) Received: by mail-il1-x141.google.com with SMTP id g1so5404451ilk.7 for ; Thu, 10 Dec 2020 06:09:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=IPBbRhY2fa8scqQYa6fuHv3sg6vR+F2lX2rvUeIAO48=; b=JkdYsiUIuaKjkh9mnEn6K9UjsggLbkAEstkUertGwGfUsrbxgckPjN8K4kAMT+Qtn8 E1t1cBIBcazvAi9hCvM7zR3BLzqI8d1JXZZqzB4xP6jFGjrIpE9B39At2l7g5WeuGT+u SUeS+KVEo87WLtzh1KVlcunTMvAhrHDXOXfIFHyEJjxSPl9zK7NzQ2NoGP2FqV6bV8JS WUuNs4AFeWDtn9cZXLh7WR1kMYqCQ1j/ZE7yoylDLN1I3uPqQHo5/HuELmzpXUgPjOg8 WOqCY8CyqI9d81Lq6oF/oADMTwgUau75o3QgtjlBOZ0n/opE23HyfFKRTjq+oBfqVfV/ 3GWQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=IPBbRhY2fa8scqQYa6fuHv3sg6vR+F2lX2rvUeIAO48=; b=SBRw02r6J3SZwfDgieEV4oGWcmPYEcqKF6q40fqXdYrFBFYF5yMpgViOhhK/EmJu2c 3h+AeG5rVYuWSFHk/q6mrzyxm3mYXqswr6bRxvtiGpv+xq6EkCrgCIBD6BzSDE/baLv1 mwCWje6n/i38LNJ8hddm83m13yQNhmOMWabv95rvX9S+lcqbS9eLrtIiXasqVA7QYriL OyFbELTaNNGs7wEkAtz57YJ0k1gcds1n5m7Uj3KZjHbQiWtj6+Yg5AeY+eoMpt4gvX3b iOYAE3rsdBwNg+u8GlqKm0aYKkPBgZE5FLITjPU/xDcnWO0dYJp7plDgo29lH+UrnHfc rnLQ== X-Gm-Message-State: AOAM532ysb/2yV809t2HC1GjKPTd64ISW7hIRBy6taBE1gvq7Mol+OXv oAr806ZE3VoY2ryx1YElCFW++pMYEyRMbHjehDAQEA== X-Received: by 2002:a92:b12:: with SMTP id b18mr9243965ilf.216.1607609362100; Thu, 10 Dec 2020 06:09:22 -0800 (PST) MIME-Version: 1.0 References: <20201210080844.23741-1-sjpark@amazon.com> In-Reply-To: <20201210080844.23741-1-sjpark@amazon.com> From: Eric Dumazet Date: Thu, 10 Dec 2020 15:09:10 +0100 Message-ID: Subject: Re: [PATCH v2 0/1] net: Reduce rcu_barrier() contentions from 'unshare(CLONE_NEWNET)' To: SeongJae Park Cc: David Miller , SeongJae Park , Jakub Kicinski , Alexey Kuznetsov , Florian Westphal , "Paul E. McKenney" , netdev , rcu@vger.kernel.org, LKML Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 10, 2020 at 9:09 AM SeongJae Park wrote: > > From: SeongJae Park > > On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls > make the number of active slab objects including 'sock_inode_cache' type > rapidly and continuously increase. As a result, memory pressure occurs. > > In more detail, I made an artificial reproducer that resembles the > workload that we found the problem and reproduce the problem faster. It > merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop. It takes > about 2 minutes. On 40 CPU cores, 70GB DRAM machine, it reduced about > 15GB of available memory in total. Note that the issue don't reproduce > on every machine. On my 6 CPU cores machine, the problem didn't > reproduce. OK, that is the number before the patch, but what is the number after the patch ? I think the idea is very nice, but this will serialize fqdir hash tables destruction on one single cpu, this might become a real issue _if_ these hash tables are populated. (Obviously in your for (i=1;i<50000;i++) unshare(CLONE_NEWNET); all these tables are empty...) As you may now, frags are often used as vectors for DDOS attacks. I would suggest maybe to not (ab)use system_wq, but a dedicated work queue with a limit (@max_active argument set to 1 in alloc_workqueue()) , to make sure that the number of threads is optimal/bounded. Only the phase after hash table removal could benefit from your deferral to a single context, so that a single rcu_barrier() is active, since the part after rcu_barrier() is damn cheap and _can_ be serialized if (refcount_dec_and_test(&f->refcnt)) complete(&f->completion); Thanks ! > > 'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the > relevant memory objects. They are asynchronously invoked by the work > queues and internally use 'rcu_barrier()' to ensure safe destructions. > 'cleanup_net()' works in a batched maneer in a single thread worker, > while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the > 'system_wq'. > > Therefore, 'fqdir_work_fn()' called frequently under the workload and > made the contention for 'rcu_barrier()' high. In more detail, the > global mutex, 'rcu_state.barrier_mutex' became the bottleneck. > > I tried making 'fqdir_work_fn()' batched and confirmed it works. The > following patch is for the change. I think this is the right solution > for point fix of this issue, but someone might blame different parts. > > 1. User: Frequent 'unshare()' calls > From some point of view, such frequent 'unshare()' calls might seem only > insane. > > 2. Global mutex in 'rcu_barrier()' > Because of the global mutex, 'rcu_barrier()' callers could wait long > even after the callbacks started before the call finished. Therefore, > similar issues could happen in another 'rcu_barrier()' usages. Maybe we > can use some wait queue like mechanism to notify the waiters when the > desired time came. > > I personally believe applying the point fix for now and making > 'rcu_barrier()' improvement in longterm make sense. If I'm missing > something or you have different opinion, please feel free to let me > know. > > > Patch History > ------------- > > Changes from v1 > (https://lore.kernel.org/netdev/20201208094529.23266-1-sjpark@amazon.com/) > - Keep xmas tree variable ordering (Jakub Kicinski) > - Add more numbers (Eric Dumazet) > - Use 'llist_for_each_entry_safe()' (Eric Dumazet) > > SeongJae Park (1): > net/ipv4/inet_fragment: Batch fqdir destroy works > > include/net/inet_frag.h | 2 +- > net/ipv4/inet_fragment.c | 28 ++++++++++++++++++++-------- > 2 files changed, 21 insertions(+), 9 deletions(-) > > -- > 2.17.1 >