Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp372825imm; Tue, 14 Aug 2018 21:38:14 -0700 (PDT) X-Google-Smtp-Source: AA+uWPzBoCnwDcFBPv/mlbSsKveTwcLleC/abRyd7VsQ0sFLbScN86sp1z+Aqgc4r9Vp0oS8is7q X-Received: by 2002:a17:902:82c7:: with SMTP id u7-v6mr22788120plz.83.1534307894874; Tue, 14 Aug 2018 21:38:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1534307894; cv=none; d=google.com; s=arc-20160816; b=DVpKHGn8aicznIEh+lf4H5Mj31Gac6nNfEW3GkqnlliMOlGxabrvwWUa5a4y/LKAdP WJo0MMQoFnvSdmhn6ZOtcPCYXXTiIxircoKhF1nHkwgAxZ+ppKUgydfKRKUu/tI+KdKk aDuhXl+LvCcCg/WAbTvNcrbyVy1hoqU0pvPJa93/cY22DVALxdTEbEq+hmlasxiinsNk Fp41UZv/Qt08igdfteziNqoBQI2QgO0DIEfwnFe6fxqxaNCpDYhFZchqaD2OBraQnbU4 KDOLEW62VCnyCboRQKHjSipws6Qrd20Prt76Wp89cJ4QKIbEEV67jevG58sTDnOMrwmX nWOQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:subject:mime-version:user-agent :message-id:in-reply-to:date:references:cc:to:from :arc-authentication-results; bh=TaTIIPj/77yxIErFEQjitafdDUR2Rqi4HwOR7oP93Yk=; b=NIe6eHhNwlLmX9aPqmCeTzC0SgO7QsOSK6ToTf1MIiJtvgbSqf83ywedZNRp8FPJ9G ECjgPcHKftdb/z9bP5+DoPvnF/nCHTWzDVGzDyIeBmtp9KtRemjUAr8JoqgvKgJHw8Oh LrW9HUR6AxB/rubQS7UtZqSuIIF4OUc5DfFgvf3UgdcjLrnm4Mak07p+XIUc69Xh5FAq 0vAOsrfnlbRbVLHBt5sgeX8LO7AP0xLv7HpgExiu4ZOlEFGZD9eCbW0o4nrHb27KrtGz EdDy5gGzAPkH0jlOMYo2tTMI7WfKHTbCro3qWXS2Cl4p3pwkBc2xSmu8ToMfQlQdmYkH vH7Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 68-v6si20327865pla.332.2018.08.14.21.37.57; Tue, 14 Aug 2018 21:38:14 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727651AbeHOH0q (ORCPT + 99 others); Wed, 15 Aug 2018 03:26:46 -0400 Received: from out02.mta.xmission.com ([166.70.13.232]:42824 "EHLO out02.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725876AbeHOH0p (ORCPT ); Wed, 15 Aug 2018 03:26:45 -0400 Received: from in01.mta.xmission.com ([166.70.13.51]) by out02.mta.xmission.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1fpnXc-00074b-F5; Tue, 14 Aug 2018 22:36:16 -0600 Received: from [97.119.167.31] (helo=x220.xmission.com) by in01.mta.xmission.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1fpnXb-0000Rx-Gw; Tue, 14 Aug 2018 22:36:16 -0600 From: ebiederm@xmission.com (Eric W. Biederman) To: David Ahern Cc: Cong Wang , David Miller , Linux Kernel Network Developers , nikita.leshchenko@oracle.com, Roopa Prabhu , Stephen Hemminger , Ido Schimmel , Jiri Pirko , Saeed Mahameed , Alexander Aring , linux-wpan@vger.kernel.org, NetFilter , LKML References: <1a3f59a9-0ba5-c83f-16a6-f9550a84f693@gmail.com> <1a27e301-3275-b349-a2f8-afdfdc02f04f@gmail.com> <20180718.125938.2271502580775162784.davem@davemloft.net> <28c30574-391c-b4bd-c337-51d3040d901a@gmail.com> <5021d874-8e99-6eba-f24b-4257c62d4457@gmail.com> <87muufze8w.fsf@xmission.com> <4b03b5f6-87ce-9ff2-7c14-598beebd8fb8@gmail.com> <87zhyfw70m.fsf@xmission.com> <87o9evt9a6.fsf@xmission.com> Date: Tue, 14 Aug 2018 23:36:10 -0500 In-Reply-To: (David Ahern's message of "Mon, 13 Aug 2018 15:48:17 -0600") Message-ID: <877eksi6v9.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1fpnXb-0000Rx-Gw;;;mid=<877eksi6v9.fsf@xmission.com>;;;hst=in01.mta.xmission.com;;;ip=97.119.167.31;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX1/bD1e4s0BlxO9z14CmSn2weQITnA2o1ko= X-SA-Exim-Connect-IP: 97.119.167.31 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on sa07.xmission.com X-Spam-Level: * X-Spam-Status: No, score=1.5 required=8.0 tests=ALL_TRUSTED,BAYES_50, DCC_CHECK_NEGATIVE,T_TM2_M_HEADER_IN_MSG,T_TooManySym_01,T_XMDrugObfuBody_08, XMSubLong autolearn=disabled version=3.4.1 X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.7 XMSubLong Long Subject * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.4760] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa07 1397; Body=1 Fuz1=1 Fuz2=1] * 1.0 T_XMDrugObfuBody_08 obfuscated drug references * 0.0 T_TooManySym_01 4+ unique symbols in subject X-Spam-DCC: XMission; sa07 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: *;David Ahern X-Spam-Relay-Country: X-Spam-Timing: total 387 ms - load_scoreonly_sql: 0.04 (0.0%), signal_user_changed: 3.3 (0.8%), b_tie_ro: 2.2 (0.6%), parse: 1.11 (0.3%), extract_message_metadata: 6 (1.5%), get_uri_detail_list: 3.8 (1.0%), tests_pri_-1000: 3.2 (0.8%), tests_pri_-950: 1.15 (0.3%), tests_pri_-900: 0.99 (0.3%), tests_pri_-400: 29 (7.5%), check_bayes: 28 (7.3%), b_tokenize: 10 (2.6%), b_tok_get_all: 10 (2.5%), b_comp_prob: 3.3 (0.9%), b_tok_touch_all: 2.9 (0.7%), b_finish: 0.66 (0.2%), tests_pri_0: 328 (84.8%), check_dkim_signature: 0.53 (0.1%), check_dkim_adsp: 2.6 (0.7%), tests_pri_500: 6 (1.5%), rewrite_mail: 0.00 (0.0%) Subject: Re: [PATCH RFC/RFT net-next 00/17] net: Convert neighbor tables to per-namespace X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org David Ahern writes: > On 7/25/18 1:17 PM, Eric W. Biederman wrote: >> David Ahern writes: >> >>> On 7/25/18 11:38 AM, Eric W. Biederman wrote: >>>> >>>> Absolutely NOT. Global thresholds are exactly correct given the fact >>>> you are running on a single kernel. >>>> >>>> Memory is not free (Even though we are swimming in enough of it memory >>>> rarely matters). One of the few remaining challenges is for containers >>>> is finding was to limit resources in such a way that one application >>>> does not mess things up for another container during ordinary usage. >>>> >>>> It looks like the neighbour tables absolutely are that kind of problem, >>>> because the artificial limits are too strict. Completely giving up on >>>> limits does not seem right approach either. We need to fix the limits >>>> we have (perhaps making them go away entirely), not just apply a >>>> band-aid. Let's get to the bottom of this and make the system better. >>> >>> Eric: yes, they all share the global resource of memory and there should >>> be limits on how many entries a remote entity can create. >>> >>> Network namespaces can provide a separation such that one namespace does >>> not disrupt networking in another. It is absolutely appropriate to do >>> so. Your rigid stance is inconsistent given the basic meaning of a >>> network namespace and the parallels to this same problem -- bridges, >>> vxlans, and ip fragments. Only neighbor tables are not per-device or per >>> namespace; your insistence on global limits is missing the mark and wrong. >> >> That is not what I said. Let me rephrase and see if you understand. >> >> The problem appears to be of lots of devices. Fundamentally if you use >> lots of network devices today unless you adjust gc_thresh3 you will run >> out of neighbour table entries. >> >> The problem has a bigger scope than what you are looking at. >> >> If you fix the core problem you won't see the problem in the context >> of network namespaces either. >> >> Default limits should be something that will never be hit unless >> something goes crazy. We are hitting them. Therefore by definition >> there is a bug in these limits. > > I disagree that the problem is a global limit. It is trivial for users > to increase gc_thresh3. That does not solve the fundamental problem. > >> >> >> And yes there is absolutely a place for global limits on things like >> inodes, file descriptors etc, that does not care about which part of the >> kernel you are in. However hitting those limits in normal operation is >> a bug. >> >> We have ourselves a bug. > > I agree we have a bug; we disagree on what that bug is. > > I am just back from vacation and re-read your responses. No where do you > acknowledge the fundamental point of this patch set - that adding a new > neighbor entry in one namespace can evict an entry in another namespace > or worse networking in one namespace can fail due to table overflow > because of entries from another. That is a real problem. > > It is not a matter of increasing the default gc_thresh3 to some number > N; it is ensuring that regardless of the value of gc_thresh3 one > namespace is not affected by another. My suggestion is to look at the problem and it's requirements and figure out how to safely remove gc_thresh3 entirely. We do have to ensure neighbour tables don't grow too large, I expect we can do it in a way that can scale from a small machine with few neighbours to a large machine with many neighbours. Perhaps the code just needs to limit the number of neighbours who have never replied and the code is probing for an a per interface basis. It still may make sense to have a global limit of perhaps a million entries just because that would be an indicator that something has truly gone weird. > You created network namespaces and it provides isolation -- separate > tables essentially -- for devices, FIB entries, sockets, etc, but you > argue against completing the task with separate neighbor tables which is > very strange given the impact (completely broken networking). Namespaces provide isolation at the level of names. The objects still share a kernel and compete for resources. Not competing for resources would require each namespace have it's own dedicated pool of resources which over the whole machine would be much less efficient. That is the fundamental design difference between namespaces and VM's and it is why namespaces can be much cheaper and much more resource efficient. Reserving your worst case resource usage ahead of time tends to result in a lot of inefficiencies. Eric