Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp743591imm; Wed, 25 Jul 2018 05:34:57 -0700 (PDT) X-Google-Smtp-Source: AAOMgpcG47ddyzjbcE9j9SOySRnJmH9qjt2a1hp5PpPYkjkI1/uCsVpebnPd49gT8YN/W9MZEOg2 X-Received: by 2002:a65:5b8e:: with SMTP id i14-v6mr20505376pgr.242.1532522097632; Wed, 25 Jul 2018 05:34:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1532522097; cv=none; d=google.com; s=arc-20160816; b=eDzSWa0zZliFY/qzL/jh4O0kySFN0GiWZeBNasKko7BaNHJYuUBdcvjHEURlVQlAdx aDl8zMrJQFnfDrl9etGSnU5DsUseu0Gkgy8KTKjAPVXWBD8+3dxj3lo537IrKgQAyasF rj4BbdqiAVrZLgiffvyGF9jHqnltxpX23u2TAmmyJuqUSxleepllsBxslhAVrK8mN8bn esSJyEmgriT9tQiccT7d4qk7uvH8hf2KIr3eFC28gtcQvKlGoXTJRi57zi8dL/L08SyT YBAEbe4zeDLPJwiQXgKgnkzDN61GSVLhZuvfMr4JWuESqJtaA1IxCuUFGDQl51iC+sIl QxQA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:subject:mime-version:user-agent :message-id:in-reply-to:date:references:cc:to:from :arc-authentication-results; bh=VXrqYsriAXwCV+ow0GJgZo/yLxRnSfgpNQMqYUmwwEY=; b=wFlgE772oafktyCH0jMEWHRXYj8SF20y9m19XnprQkxCZNRFdpR7M0F1/x3YbkdhTX m/tlRrgG0xIagvpS+Qc2nX3f5Tt5wvwcFF5mjEganBWArMMEDBWic9U/5w0f2nV+hFvN iUSrlcIkpAYR2M3gvO5LNP+oA65bU03GXN8WZ0iJ8aXyRiXiBXGrbio8WyB6CvFFAxoh JB9Ckwg1J23rh1Bu0N9Pu7OqGTt6ezw9sIllpPRYB073ZhSIysnmYTi4kX1F44fjcm6o ascWgFcDPcvOC8dS86cpg49Gwo4vDyFmlLtuEycBJ21k7fzuLqVVnDMcsDQG/InA+G2F hejw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id s13-v6si12577699pgo.505.2018.07.25.05.34.42; Wed, 25 Jul 2018 05:34:57 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729018AbeGYNpV (ORCPT + 99 others); Wed, 25 Jul 2018 09:45:21 -0400 Received: from out01.mta.xmission.com ([166.70.13.231]:33223 "EHLO out01.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728725AbeGYNpV (ORCPT ); Wed, 25 Jul 2018 09:45:21 -0400 Received: from in01.mta.xmission.com ([166.70.13.51]) by out01.mta.xmission.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1fiIzG-0001Ra-4m; Wed, 25 Jul 2018 06:33:50 -0600 Received: from [97.119.167.31] (helo=x220.xmission.com) by in01.mta.xmission.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1fiIzE-0008LB-Pt; Wed, 25 Jul 2018 06:33:49 -0600 From: ebiederm@xmission.com (Eric W. Biederman) To: Cong Wang Cc: David Ahern , David Miller , Linux Kernel Network Developers , nikita.leshchenko@oracle.com, Roopa Prabhu , Stephen Hemminger , Ido Schimmel , Jiri Pirko , Saeed Mahameed , Alexander Aring , linux-wpan@vger.kernel.org, NetFilter , LKML References: <1a3f59a9-0ba5-c83f-16a6-f9550a84f693@gmail.com> <1a27e301-3275-b349-a2f8-afdfdc02f04f@gmail.com> <20180718.125938.2271502580775162784.davem@davemloft.net> <28c30574-391c-b4bd-c337-51d3040d901a@gmail.com> <5021d874-8e99-6eba-f24b-4257c62d4457@gmail.com> Date: Wed, 25 Jul 2018 07:33:35 -0500 In-Reply-To: (Cong Wang's message of "Tue, 24 Jul 2018 15:09:25 -0700") Message-ID: <87muufze8w.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1fiIzE-0008LB-Pt;;;mid=<87muufze8w.fsf@xmission.com>;;;hst=in01.mta.xmission.com;;;ip=97.119.167.31;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX187sLzmGWta+48ibDD+bkVYfPVl42k1m9g= X-SA-Exim-Connect-IP: 97.119.167.31 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on sa06.xmission.com X-Spam-Level: X-Spam-Status: No, score=0.5 required=8.0 tests=ALL_TRUSTED,BAYES_50, DCC_CHECK_NEGATIVE,T_TM2_M_HEADER_IN_MSG,T_TooManySym_01,XMSubLong autolearn=disabled version=3.4.1 X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.7 XMSubLong Long Subject * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5000] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa06 1397; Body=1 Fuz1=1 Fuz2=1] * 0.0 T_TooManySym_01 4+ unique symbols in subject X-Spam-DCC: XMission; sa06 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: ;Cong Wang X-Spam-Relay-Country: X-Spam-Timing: total 803 ms - load_scoreonly_sql: 0.06 (0.0%), signal_user_changed: 2.6 (0.3%), b_tie_ro: 1.74 (0.2%), parse: 1.51 (0.2%), extract_message_metadata: 16 (2.0%), get_uri_detail_list: 4.5 (0.6%), tests_pri_-1000: 5 (0.6%), tests_pri_-950: 1.23 (0.2%), tests_pri_-900: 1.02 (0.1%), tests_pri_-400: 30 (3.7%), check_bayes: 29 (3.6%), b_tokenize: 10 (1.2%), b_tok_get_all: 10 (1.3%), b_comp_prob: 3.3 (0.4%), b_tok_touch_all: 3.2 (0.4%), b_finish: 0.61 (0.1%), tests_pri_0: 268 (33.4%), check_dkim_signature: 0.52 (0.1%), check_dkim_adsp: 3.1 (0.4%), tests_pri_500: 474 (59.0%), poll_dns_idle: 467 (58.2%), rewrite_mail: 0.00 (0.0%) Subject: Re: [PATCH RFC/RFT net-next 00/17] net: Convert neighbor tables to per-namespace X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Cong Wang writes: > On Tue, Jul 24, 2018 at 8:14 AM David Ahern wrote: >> >> On 7/19/18 11:12 AM, Cong Wang wrote: >> > On Thu, Jul 19, 2018 at 9:16 AM David Ahern wrote: >> >> >> >> Chatting with Nikolay about this and he brought up a good corollary - ip >> >> fragmentation. It really is a similar problem in that memory is consumed >> >> as a result of packets received from an external entity. The ipfrag >> >> sysctls are per namespace with a limit that non-init_net namespaces can >> >> not set high_thresh > the current value of init_net. Potential memory >> >> consumed by fragments scales with the number of namespaces which is the >> >> primary concern with making neighbor tables per namespace. >> > >> > Nothing new, already discussed: >> > https://marc.info/?l=linux-netdev&m=140391416215988&w=2 >> > >> > :) >> > >> >> Neighbor tables, bridge fdbs, vxlan fdbs and ip fragments all consume >> local memory resources due to received packets. bridge and vxlan fdb's >> are fairly straightforward analogs to neighbor entries; they are per >> device with no limits on the number of entries. Fragments have memory >> limits per namespace. So neighbor tables are the only ones with this >> strict limitation and concern on memory consumption. >> >> I get the impression there is no longer a strong resistance against >> moving the tables to per namespace, but deciding what is the right >> approach to handle backwards compatibility. Correct? Changing the >> accounting is inevitably going to be noticeable to some use case(s), but >> with sysctl settings it is a simple runtime update once the user knows >> to make the change. > > This question definitely should go to Eric Biederman who was against > my proposal. > > Let's add Eric into CC. Given that the entries are per device and the devices are per-namespace, semantically neighbours are already kept in a per-namespace manner. So this is all about making the code not honoring global resource limits. Making the code not honor gc_thresh3. Skimming through the code today the default for gc_thresh3 is 1024. Which means that we limit the neighbour tables to 1024 entries per protocol type. There are some pretty compelling reasons especially with ipv4 to keep the subnet size down. Arp storms are a real thing. I don't know off the top of my head what the reasons for limiting the neighbour table sizes. I would be much more comfortable with a patchset like this if we did some research and figured out the reasons why we have a global limit. Then changed the code to remove those limits. When the limits are gone. When the code can support large subnets without tuning. We we don't have to worry about someone scanning an all addresses in an ipv6 subnet and causing a DOS on working machines. I think it is completely appropriate to look to see if something per network namespace needs to happen. So please let's address the limits, not the fact that some specific corner case ran into them. If we are going to neuter gc_thresh3 let's go as far as removing it entirely. If we are going to make the neighbour table per something let's make it per network device. If we can afford the multiple hash tables then a hash table per device is better. Perhaps we want to move to rhash tables while we look at this, instead of an old hand grown version of resizable hash table. Unless I misread something all your patchset did is reshuffle code and data structures so that gc_thresh3 does not apply accross namespaces. That does not feel like it really fixes anything. That just lies to people. Further unless I misread something you are increasing the number of timers to 3 per namespace. If I create create a thousand network namespaces that feels like it will hurt system performance overall. Eric