Received: by 2002:a05:6358:489b:b0:bb:da1:e618 with SMTP id x27csp1509879rwn; Thu, 8 Sep 2022 22:26:55 -0700 (PDT) X-Google-Smtp-Source: AA6agR4evor5dKV8cfKrgUgftqu7vxXVKW7N0GOi6CQOXpNJYcrItrFix/28m4pD3Qu9CvodpZZp X-Received: by 2002:a05:6a00:ccf:b0:536:63ad:25d1 with SMTP id b15-20020a056a000ccf00b0053663ad25d1mr12440272pfv.56.1662701214935; Thu, 08 Sep 2022 22:26:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1662701214; cv=none; d=google.com; s=arc-20160816; b=i+mYXtNDPmeoUmX8sm7FrWOevau8rbgWma3KjSDumdpv1edevpTujijFesQoELEzIz OSg1J41NlLvl+bn5fSTRbQ1jSyJ93w8SA8WKTJ53vK+F087g57knZdZ+/nOIs307ZXAe l1okQ+eSKL1rfGcxEp7ngfwKMEjUfZ1mjvSSNcnoPIDWjhXkv7dRqnBwWs4ncjHQOTUz PVptU1/01coduLARQf/IeukQ0cIp3D96UFli/dTYn1tMZwqUb7eZkH/VXOW1NqFBMPdI nw0/MlxpDQvAr7UWMkFOQsiWx1yo+iL2+qRxa3Ww2DHTByHDbAyNCBGGr5Z3CxX0I4Mn 5QZg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=ney2+dU1gPab2UaIlIXyHgU7cWbz1R8GXRfnRcxe/4U=; b=ddFZCyYU6n3zNwm9hkCMKZrDodHwGvdDynnFixGWYvE862cs0BzU/uqgzwNDMIDW8/ skcRfIVp59r15h7cZ/a50LzmhItYgGPZmxXeIX/JKoumDEPXw7hZAPINYO9QEzX1ySth sxE3yUmMsy3vqkl3Yieb0JrJRTJDDJZEuo1JC06xfrQ0bm2n8XMZd4FKPkQoCGCZKjTH H3DLIW+zZ+gRgCV20WV4sl1RAFIaepXIqYIenEoerShVhPUvh+wzeazJLsX4Nk5jG9Kr IdQb+wl+oJDUYSZjx1ys5gHkw0O2unx/ZJEG77W2TLdMytT17JlJ6CGew51ACp7x0i/S Su8w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x32-20020a056a0018a000b0053aa6311dc4si960223pfh.19.2022.09.08.22.26.44; Thu, 08 Sep 2022 22:26:54 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230162AbiIIFYh (ORCPT + 99 others); Fri, 9 Sep 2022 01:24:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43234 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229562AbiIIFYf (ORCPT ); Fri, 9 Sep 2022 01:24:35 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 0F53C7669; Thu, 8 Sep 2022 22:24:31 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 53433153B; Thu, 8 Sep 2022 22:24:37 -0700 (PDT) Received: from [10.162.41.8] (unknown [10.162.41.8]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id AEF913F73D; Thu, 8 Sep 2022 22:24:49 -0700 (PDT) Message-ID: <1e8642d5-0e2d-5747-d0d2-5aa0817ea4af@arm.com> Date: Fri, 9 Sep 2022 10:54:18 +0530 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 Subject: Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation Content-Language: en-US To: Yicong Yang , akpm@linux-foundation.org, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, x86@kernel.org, catalin.marinas@arm.com, will@kernel.org, linux-doc@vger.kernel.org Cc: corbet@lwn.net, peterz@infradead.org, arnd@arndb.de, linux-kernel@vger.kernel.org, darren@os.amperecomputing.com, yangyicong@hisilicon.com, huzhanyuan@oppo.com, lipeifeng@oppo.com, zhangshiming@oppo.com, guojian@oppo.com, realmz6@gmail.com, linux-mips@vger.kernel.org, openrisc@lists.librecores.org, linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org, Barry Song <21cnbao@gmail.com>, wangkefeng.wang@huawei.com, xhao@linux.alibaba.com, prime.zeng@hisilicon.com, Barry Song , Nadav Amit , Mel Gorman References: <20220822082120.8347-1-yangyicong@huawei.com> <20220822082120.8347-5-yangyicong@huawei.com> From: Anshuman Khandual In-Reply-To: <20220822082120.8347-5-yangyicong@huawei.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-10.1 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_HI,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 8/22/22 13:51, Yicong Yang wrote: > From: Barry Song > > on x86, batched and deferred tlb shootdown has lead to 90% > performance increase on tlb shootdown. on arm64, HW can do > tlb shootdown without software IPI. But sync tlbi is still > quite expensive. > > Even running a simplest program which requires swapout can > prove this is true, > #include > #include > #include > #include > > int main() > { > #define SIZE (1 * 1024 * 1024) > volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, > MAP_SHARED | MAP_ANONYMOUS, -1, 0); > > memset(p, 0x88, SIZE); > > for (int k = 0; k < 10000; k++) { > /* swap in */ > for (int i = 0; i < SIZE; i += 4096) { > (void)p[i]; > } > > /* swap out */ > madvise(p, SIZE, MADV_PAGEOUT); > } > } > > Perf result on snapdragon 888 with 8 cores by using zRAM > as the swap block device. > > ~ # perf record taskset -c 4 ./a.out > [ perf record: Woken up 10 times to write data ] > [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] > ~ # perf report > # To display the perf.data header info, please use --header/--header-only options. > # To display the perf.data header info, please use --header/--header-only options. > # > # > # Total Lost Samples: 0 > # > # Samples: 60K of event 'cycles' > # Event count (approx.): 35706225414 > # > # Overhead Command Shared Object Symbol > # ........ ....... ................. ............................................................................. > # > 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq > 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore > 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages > 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write > 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush > 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock > 3.49% a.out [kernel.kallsyms] [k] memset64 > 1.63% a.out [kernel.kallsyms] [k] clear_page > 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock > 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 > 1.23% a.out [kernel.kallsyms] [k] xas_load > 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock > > ptep_clear_flush() takes 5.36% CPU in the micro-benchmark > swapping in/out a page mapped by only one process. If the > page is mapped by multiple processes, typically, like more > than 100 on a phone, the overhead would be much higher as > we have to run tlb flush 100 times for one single page. > Plus, tlb flush overhead will increase with the number > of CPU cores due to the bad scalability of tlb shootdown > in HW, so those ARM64 servers should expect much higher > overhead. > > Further perf annonate shows 95% cpu time of ptep_clear_flush > is actually used by the final dsb() to wait for the completion > of tlb flush. This provides us a very good chance to leverage > the existing batched tlb in kernel. The minimum modification > is that we only send async tlbi in the first stage and we send > dsb while we have to sync in the second stage. > > With the above simplest micro benchmark, collapsed time to > finish the program decreases around 5%. > > Typical collapsed time w/o patch: > ~ # time taskset -c 4 ./a.out > 0.21user 14.34system 0:14.69elapsed > w/ patch: > ~ # time taskset -c 4 ./a.out > 0.22user 13.45system 0:13.80elapsed > > Also, Yicong Yang added the following observation. > Tested with benchmark in the commit on Kunpeng920 arm64 server, > observed an improvement around 12.5% with command > `time ./swap_bench`. > w/o w/ > real 0m13.460s 0m11.771s > user 0m0.248s 0m0.279s > sys 0m12.039s 0m11.458s > > Originally it's noticed a 16.99% overhead of ptep_clear_flush() > which has been eliminated by this patch: > > [root@localhost yang]# perf record -- ./swap_bench && perf report > [...] > 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush > > Cc: Jonathan Corbet > Cc: Nadav Amit > Cc: Mel Gorman > Tested-by: Yicong Yang > Tested-by: Xin Hao > Signed-off-by: Barry Song > Signed-off-by: Yicong Yang > --- > .../features/vm/TLB/arch-support.txt | 2 +- > arch/arm64/Kconfig | 1 + > arch/arm64/include/asm/tlbbatch.h | 12 ++++++++ > arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++-- > 4 files changed, 40 insertions(+), 3 deletions(-) > create mode 100644 arch/arm64/include/asm/tlbbatch.h > > diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt > index 1c009312b9c1..2caf815d7c6c 100644 > --- a/Documentation/features/vm/TLB/arch-support.txt > +++ b/Documentation/features/vm/TLB/arch-support.txt > @@ -9,7 +9,7 @@ > | alpha: | TODO | > | arc: | TODO | > | arm: | TODO | > - | arm64: | TODO | > + | arm64: | ok | > | csky: | TODO | > | hexagon: | TODO | > | ia64: | TODO | > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index 571cc234d0b3..09d45cd6d665 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -93,6 +93,7 @@ config ARM64 > select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 > select ARCH_SUPPORTS_NUMA_BALANCING > select ARCH_SUPPORTS_PAGE_TABLE_CHECK > + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT > select ARCH_WANT_DEFAULT_BPF_JIT > select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT > diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h > new file mode 100644 > index 000000000000..fedb0b87b8db > --- /dev/null > +++ b/arch/arm64/include/asm/tlbbatch.h > @@ -0,0 +1,12 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _ARCH_ARM64_TLBBATCH_H > +#define _ARCH_ARM64_TLBBATCH_H > + > +struct arch_tlbflush_unmap_batch { > + /* > + * For arm64, HW can do tlb shootdown, so we don't > + * need to record cpumask for sending IPI > + */ > +}; > + > +#endif /* _ARCH_ARM64_TLBBATCH_H */ > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h > index 412a3b9a3c25..23cbc987321a 100644 > --- a/arch/arm64/include/asm/tlbflush.h > +++ b/arch/arm64/include/asm/tlbflush.h > @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm) > dsb(ish); > } > > -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, > + > +static inline void __flush_tlb_page_nosync(struct mm_struct *mm, > unsigned long uaddr) > { > unsigned long addr; > > dsb(ishst); > - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm)); > + addr = __TLBI_VADDR(uaddr, ASID(mm)); > __tlbi(vale1is, addr); > __tlbi_user(vale1is, addr); > } > > +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, > + unsigned long uaddr) > +{ > + return __flush_tlb_page_nosync(vma->vm_mm, uaddr); > +} > + > static inline void flush_tlb_page(struct vm_area_struct *vma, > unsigned long uaddr) > { > @@ -272,6 +279,23 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, > dsb(ish); > } > > +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > +{ > + return true; > +} Always defer and batch up TLB flush, unconditionally ? > + > +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, > + struct mm_struct *mm, > + unsigned long uaddr) > +{ > + __flush_tlb_page_nosync(mm, uaddr); > +} > + > +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) > +{ > + dsb(ish); > +} Adding up __flush_tlb_page_nosync() without a corresponding dsb(ish) and then doing once via arch_tlbbatch_flush() will have the same effect from an architecture perspective ? > + > /* > * This is meant to avoid soft lock-ups on large TLB flushing ranges and not > * necessarily a performance improvement.