Received: by 2002:a25:1506:0:0:0:0:0 with SMTP id 6csp6181053ybv; Wed, 12 Feb 2020 07:27:32 -0800 (PST) X-Google-Smtp-Source: APXvYqyhEQef2loSkMs6ime0mFYUxUvOhAGAmCjeXQAWKv7B8RRkcW+LJ4/NSNIAOepHqrW9Vj+q X-Received: by 2002:a05:6808:1:: with SMTP id u1mr6713928oic.74.1581521252323; Wed, 12 Feb 2020 07:27:32 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1581521252; cv=none; d=google.com; s=arc-20160816; b=GrBlhxU1a7CKYeAs/ANGQkRUq+K9G7sxsSulo6svXqYmbsjeIweMlpXHvgRRYpXlMY JLmjyG6jigUfPyDO5ObztY4v7m1p2aW5dRU4Udf3I83DcNguK1I4uD3xoVm6QZ1AJIWv VplKIqOY/Ujuw6MeDD+W82smVMcpy535QpdQHi7Nm4U1Upxl0E4uDXeTIIs+quk/vHNL 3zhahzAw1LnmIGcte3imL0swqg6c+VqQ52Wh0dOLmgIUZs7bLSe58GCXeKUe4b0Yf/LG O/0wuAVR/AmYgYR1cj7Fre+OBTLBth9lbgUaUST/imACsbrXBSn1EjTCjLZr2h2OhPrf S9UQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=dZ5LvRF7Nzu9eLZxGJbo6W4NAbQfMiFQKykxQ9VGzts=; b=DuwQsL1My4lWD2AkcQF2RMJ8BcXSdErzpeRZIN1oSg2Iw48R5LQIP8DZR8Bh68ujEK 4gJ5JCZnUc2zykYQScUcboqpYTrkLNShfC1MpaPNBw0TRQWA7EH3GrHWGbXUCZdt7IDX uomkAN0kMt0FT9PzQNafih9ErZRhvi0Rtn+f+P3WciubiwD4yM+Nl34K8M90kPnAtCMg 72wIG0iN6BZJrflHGNzjgikTtMbP8xq0Oy+AME5GLaSJ1MqH2hVGb0PAVWP5T0lD1MIe iJpuMsZ+twTXmUki5RlVPhT1wMCGz8KfTWQPWsbnZGvtMouIEBeyybRx0fJXSxqPigWR ErsA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r1si324430otn.150.2020.02.12.07.26.55; Wed, 12 Feb 2020 07:27:32 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728458AbgBLP03 (ORCPT + 99 others); Wed, 12 Feb 2020 10:26:29 -0500 Received: from foss.arm.com ([217.140.110.172]:34074 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727519AbgBLP03 (ORCPT ); Wed, 12 Feb 2020 10:26:29 -0500 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 58FD9328; Wed, 12 Feb 2020 07:26:28 -0800 (PST) Received: from arrakis.emea.arm.com (arrakis.cambridge.arm.com [10.1.196.71]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 0E2063F68F; Wed, 12 Feb 2020 07:26:26 -0800 (PST) Date: Wed, 12 Feb 2020 15:26:24 +0000 From: Catalin Marinas To: "qi.fuli@fujitsu.com" Cc: Andrea Arcangeli , Will Deacon , Jon Masters , Rafael Aquini , Mark Salter , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "linux-arm-kernel@lists.infradead.org" Subject: Re: [PATCH 2/2] arm64: tlb: skip tlbi broadcast for single threaded TLB flushes Message-ID: <20200212152624.GA587247@arrakis.emea.arm.com> References: <20200203201745.29986-1-aarcange@redhat.com> <20200203201745.29986-3-aarcange@redhat.com> <6e59905d-3e5b-bbd5-d192-9f18a0a152f5@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6e59905d-3e5b-bbd5-d192-9f18a0a152f5@jp.fujitsu.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 12, 2020 at 02:13:56PM +0000, qi.fuli@fujitsu.com wrote: > On 2/4/20 5:17 AM, Andrea Arcangeli wrote: > > With multiple NUMA nodes and multiple sockets, the tlbi broadcast > > shall be delivered through the interconnects in turn increasing the > > interconnect traffic and the latency of the tlbi broadcast instruction. > > > > Even within a single NUMA node the latency of the tlbi broadcast > > instruction increases almost linearly with the number of CPUs trying to > > send tlbi broadcasts at the same time. > > > > When the process is single threaded however we can achieve full SMP > > scalability by skipping the tlbi broadcasting. Other arches already > > deploy this optimization. > > > > After the local TLB flush this however means the ASID context goes out > > of sync in all CPUs except the local one. This can be tracked in the > > mm_cpumask(mm): if the bit is set it means the asid context is stale > > for that CPU. This results in an extra local ASID TLB flush only if a > > single threaded process is migrated to a different CPU and only after a > > TLB flush. No extra local TLB flush is needed for the common case of > > single threaded processes context scheduling within the same CPU and for > > multithreaded processes. > > > > Skipping the tlbi instruction broadcasting is already implemented in > > local_flush_tlb_all(), this patch only extends it to flush_tlb_mm(), > > flush_tlb_range() and flush_tlb_page() too. > > > > Here's the result of 32 CPUs (ARMv8 Ampere) running mprotect at the same > > time from 32 single threaded processes before the patch: > > > > Performance counter stats for './loop' (3 runs): > > > > 0 dummy > > > > 2.121353 +- 0.000387 seconds time elapsed ( +- 0.02% ) > > > > and with the patch applied: > > > > Performance counter stats for './loop' (3 runs): > > > > 0 dummy > > > > 0.1197750 +- 0.0000827 seconds time elapsed ( +- 0.07% ) > > I have tested this patch on thunderX2 with Himeno benchmark[1] with > LARGE calculation size. Here are the results. > > w/o patch: MFLOPS : 1149.480174 > w/ patch: MFLOPS : 1110.653003 > > In order to validate the effectivness of the patch, I ran a > single-threded program, which calls mprotect() in a loop to issue the > tlbi broadcast instruction on a CPU core. At the same time, I ran Himeno > benchmark on another CPU core. The results are: > > w/o patch: MFLOPS : 860.238792 > w/ patch: MFLOPS : 1110.449666 > > Though Himeno benchmark is a microbenchmark, I hope it helps. It doesn't really help. What if you have a two-thread program calling mprotect() in a loop? IOW, how is this relevant to real-world scenarios? Thanks. -- Catalin