From: Olof Johansson Subject: Re: [PATCH] net/sock: move memory_allocated over to percpu_counter variables Date: Thu, 6 Sep 2018 23:20:15 -0700 Message-ID: References: <20180906192034.8467-1-olof@lixom.net> <20180907033257.2nlgiqm2t4jiwhzc@gondor.apana.org.au> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Cc: Eric Dumazet , David Miller , Neil Horman , Marcelo Ricardo Leitner , Vladislav Yasevich , Alexey Kuznetsov , Hideaki YOSHIFUJI , linux-crypto@vger.kernel.org, LKML , linux-sctp@vger.kernel.org, netdev , linux-decnet-user@lists.sourceforge.net, kernel-team To: Herbert Xu Return-path: In-Reply-To: <20180907033257.2nlgiqm2t4jiwhzc@gondor.apana.org.au> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-crypto.vger.kernel.org Hi, On Thu, Sep 6, 2018 at 8:32 PM, Herbert Xu wrote: > On Thu, Sep 06, 2018 at 12:33:58PM -0700, Eric Dumazet wrote: >> On Thu, Sep 6, 2018 at 12:21 PM Olof Johansson wrote: >> > >> > Today these are all global shared variables per protocol, and in >> > particular tcp_memory_allocated can get hot on a system with >> > large number of CPUs and a substantial number of connections. >> > >> > Moving it over to a per-cpu variable makes it significantly cheaper, >> > and the added overhead when summing up the percpu copies is still smaller >> > than the cost of having a hot cacheline bouncing around. >> >> I am curious. We never noticed contention on this variable, at least for TCP. > > Yes these variables are heavily amortised so I'm surprised that > they would cause much contention. > >> Please share some numbers with us. > > Indeed. Certainly, just had to collect them again. This is on a dual xeon box, with ~150-200k TCP connections. I see about .7% CPU spent in __sk_mem_{reduce,raise}_allocated in the inlined atomic ops, most of those in reduce. Call path for reduce is practically all from tcp_write_timer on softirq: __sk_mem_reduce_allocated tcp_write_timer call_timer_fn run_timer_softirq __do_softirq irq_exit smp_apic_timer_interrupt apic_timer_interrupt cpuidle_enter_state With this patch, I see about .18+.11+.07 = .36% in percpu-related functions called from the same __sk_mem functions. Now, that's a halving of cycles samples on that specific setup. The real difference though, is on another platform where atomics are more expensive. There, this makes a significant difference. Unfortunately, I can't share specifics but I think this change stands on its own on the dual xeon setup as well, maybe with slightly less strong wording on just how hot the variable/line happens to be. -Olof