Received: by 2002:a05:7412:98c1:b0:fa:551:50a7 with SMTP id kc1csp706464rdb; Sat, 6 Jan 2024 06:00:09 -0800 (PST) X-Google-Smtp-Source: AGHT+IET9AgZuK+gn0Uyt8uS+AQwRLxiqFl2UyayCRPT034QGZreShg8GWU79gbdvmG/NOTQ/Cjv X-Received: by 2002:a05:6870:ff91:b0:203:f1d5:2e9f with SMTP id qp17-20020a056870ff9100b00203f1d52e9fmr1444661oab.36.1704549608765; Sat, 06 Jan 2024 06:00:08 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1704549608; cv=none; d=google.com; s=arc-20160816; b=zzkrRzDkKw+c/rJWhYfjxH5y9o5v7+1/M3gUpC+3naaVLXai3Hq0Y+sLDNQI1eG+AL DoztYJBkdvfPBTXDjsoRM4PTAL/nzLn4MZnB9rAKCI1ew2mHi8ISO9dyIUb9tn/DqZZX Cufcf68hmtCag1o3Zub8RmroaJz39USCY67iMDKt2aNerogx65PmmXbr06Pn88TkDOpH 6s6tkdtQ8KTqJkqLZEswIia9L7prZvl3lZgINAwUlbNax6Mktl31bcMqfGW1OmnxToJl 2NdEI/8DyJzCZRNtSNrdLt4AT+IoKxgWERXlCXRekqOdETawG65xDfeSDvCJ16QnH0Qg W/eA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :references:message-id:subject:cc:to:from:date:dkim-signature; bh=7q6sdiJUiUhNXQ2AxfmZ4vJ3uXR7VWhsJohPG2ROsTo=; fh=yxXPvcTuPQEyubPSIO04qRGo0IjZiKArsM3auiva//g=; b=o/hHp5pbvPLSHfpF3PR1TKROT+aXuBV0jT+1HlaUtaxMWhhZQyb/juXaStbZc4LEQb 2nKP0tttxy6vL6PoQNDNzecGHe+6uEU+eZ9E3HnrsvtvtliQZORv+uHpdfPcAJiJOuxG hSEpYyFhZUyCtyMODcfHPGgnC7Nzks3VCrTE1sm7CvPb9wDjxcQXrBeP1khrOGKdaMui 1lF0quzJ2+XJwPYuuBzusG0pC+w9TBhfyBNjcgO7o//wWpat6oJYVYgP9mIAcgfG0b+K uxjF16cmJPnQCij3YTiczc+A4JlHrL4QBsmCCDIHj2mkSlkHswaEVH3871cn36SF2sKc 3FgA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=pOJIMp+C; spf=pass (google.com: domain of linux-kernel+bounces-18584-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-18584-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [2604:1380:40f1:3f00::1]) by mx.google.com with ESMTPS id p27-20020a637f5b000000b005cd7c0b4eaesi3006342pgn.72.2024.01.06.06.00.08 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 06 Jan 2024 06:00:08 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-18584-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=pOJIMp+C; spf=pass (google.com: domain of linux-kernel+bounces-18584-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-18584-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 1BD3AB22494 for ; Sat, 6 Jan 2024 13:59:53 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id C28027490; Sat, 6 Jan 2024 13:59:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="pOJIMp+C" X-Original-To: linux-kernel@vger.kernel.org Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F27F27482 for ; Sat, 6 Jan 2024 13:59:44 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 41CF3C433C7; Sat, 6 Jan 2024 13:59:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1704549584; bh=Pq8SfrXdgj+ly7z2v4/coodhuaROaXljdjD7G4bZz18=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=pOJIMp+CI1AcyHVPfUiCCgrfFWVuKGAJMbpb2JUC83deTMMJGquMbGaVnLg5dtKMM L2X2KDLTssUBWSUobZHiXPXqJXC68vfb+KMUBYLy/3fAHlKXO71oqoJR4S8xCwCcm1 uwDbHzPQgK/piaq7bgKLjqUx/M6o8lYFXqV1u4UEM9HH+/dlm5k4iy6lbVD3jLIuwC V8VWaVN0PA8XlzusPtvQCnXqexFVWF4hn1Yh7Z5c4lQtFqVmhx3JvHnFNCoLeDbRX7 fmn6G90LiIRWqqL/i2aT36VDYSv0ktHWj4f8o5p/tSRmMsJ3ej3HgYDjiDItwZ1QjO SEAf8c09HCphw== Date: Sat, 6 Jan 2024 21:47:00 +0800 From: Jisheng Zhang To: Alexandre Ghiti Cc: Paul Walmsley , Palmer Dabbelt , Albert Ou , linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] riscv: Add support for BATCHED_UNMAP_TLB_FLUSH Message-ID: References: <20240102141851.105144-1-alexghiti@rivosinc.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On Fri, Jan 05, 2024 at 02:36:44PM +0100, Alexandre Ghiti wrote: > On Thu, Jan 4, 2024 at 6:42 PM Alexandre Ghiti wrote: > > > > Hi Jisheng, > > > > On Wed, Jan 3, 2024 at 12:10 PM Jisheng Zhang wrote: > > > > > > On Tue, Jan 02, 2024 at 03:18:51PM +0100, Alexandre Ghiti wrote: > > > > Allow to defer the flushing of the TLB when unmapping pges, which allows > > > > to reduce the numbers of IPI and the number of sfence.vma. > > > > > > > > The ubenchmarch used in commit 43b3dfdd0455 ("arm64: support > > > > batched/deferred tlb shootdown during page reclamation/migration") shows > > > > good performance improvement and perf reports an important decrease in > > > > time spent flushing the tlb (results come from qemu): > > > > > > Hi Alex, > > > > > > I tried this micro benchmark with your patch on T-HEAD TH1520 platform, I > > > didn't see any performance improvement for the micro benchmark. Per > > > myunderstanding, the micro benchmark is special case for arm64 because > > > in a normal tlb flush flow, below sequence is necessary: > > > > > > tlbi > > > dsb > > > > > > > > > while with BATCHED_UNMAP_TLB_FLUSH, the arm64 just does 'tlbi', leaving > > > the dsb to the arch_tlbbatch_flush(). So the final result is > > > > > > several 'tlbi + dsb' sequence VS. several 'tlbi' instructions + only one dsb > > > The performance improvement comes from the unnecessary dsb eliminations. > > > > Some batching should take place, and with this patch, we only send one > > "full" sfence.vma instead of a "local" sfence.vma for each page, it > > seems weird that you don't see any improvement, I would have thought > > that one "full" sfence.vma would be better. > > > > > > > > Do you have suitable benchmark(s) for BATCHED_UNMAP_TLB_FLUSH on riscv? > > > > Can you give the following benchmark a try? I simply created threads > > and dispatched them on all the cpus to force IPI usage, that should be > > way better if the batching of the first ubenchmark is not enough to > > exacerbate performance improvements, let me know and thanks for your > > tests! > > > > #define _GNU_SOURCE > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > > > int stick_this_thread_to_core(int core_id) { > > int num_cores = sysconf(_SC_NPROCESSORS_ONLN); > > if (core_id < 0 || core_id >= num_cores) > > return EINVAL; > > > > cpu_set_t cpuset; > > CPU_ZERO(&cpuset); > > CPU_SET(core_id, &cpuset); > > > > pthread_t current_thread = pthread_self(); > > return pthread_setaffinity_np(current_thread, > > sizeof(cpu_set_t), &cpuset); > > } > > > > static void *fn_thread (void *p_data) > > { > > int ret; > > pthread_t thread; > > > > stick_this_thread_to_core((int)p_data); > > > > while (1) { > > sleep(1); > > } > > > > return NULL; > > } > > > > int main() > > { > > #define SIZE (1 * 1024 * 1024) > > volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, > > MAP_SHARED | MAP_ANONYMOUS, -1, 0); > > pthread_t threads[4]; > > int ret; > > > > for (int i = 0; i < 4; ++i) { > > ret = pthread_create(&threads[i], NULL, fn_thread, (void *)i); > > if (ret) > > { > > printf("%s", strerror (ret)); > > } > > } > > > > memset(p, 0x88, SIZE); > > > > for (int k = 0; k < 500 /* 10000 */; k++) { > > /* swap in */ > > for (int i = 0; i < SIZE; i += 4096) { > > (void)p[i]; > > } > > > > /* swap out */ > > madvise(p, SIZE, MADV_PAGEOUT); > > } > > > > for (int i = 0; i < 4; i++) > > { > > pthread_cancel(threads[i]); > > } > > > > for (int i = 0; i < 4; i++) > > { > > pthread_join(threads[i], NULL); > > } > > > > return 0; > > > > } > > > > So I removed the dust from my unmatched and ran the benchmarks I proposed: > > Without this patch: > * benchmark from commit 43b3dfdd0455 (4 runs) : ~20.3s > * same benchmark with threads (4 runs) : ~27.4s > > With this patch: > * benchmark from commit 43b3dfdd0455 (4 runs) : ~17.9s > * same benchmark with threads (4 runs) : ~18.1s > > So a small improvement for the single thread benchmark, but it depends > on the number of pages that get flushed, so to me that's not > applicable for the general case. For the same benchmark with multiple > threads, that's ~34% improvement. I'll add those numbers to the v2, > and JIsheng if you can provide some too, I'll add them too! Hi Alex, the threaded version show ~78% improvement! impressive! So for the patch: Reviewed-by: Jisheng Zhang Tested-by: Jisheng Zhang Thanks > > Thanks, > > Alex > > > > > > > Thanks > > > > > > > > > > > Before this patch: > > > > > > > > real 2m1.135s > > > > user 0m0.980s > > > > sys 2m0.096s > > > > > > > > 4.83% batch_tlb [kernel.kallsyms] [k] __flush_tlb_range > > > > > > > > After this patch: > > > > > > > > real 1m0.543s > > > > user 0m1.059s > > > > sys 0m59.489s > > > > > > > > 0.14% batch_tlb [kernel.kallsyms] [k] __flush_tlb_range > > > > > > > > Signed-off-by: Alexandre Ghiti > > > > --- > > > > arch/riscv/Kconfig | 1 + > > > > arch/riscv/include/asm/tlbbatch.h | 15 +++++++ > > > > arch/riscv/include/asm/tlbflush.h | 10 +++++ > > > > arch/riscv/mm/tlbflush.c | 71 ++++++++++++++++++++++--------- > > > > 4 files changed, 77 insertions(+), 20 deletions(-) > > > > create mode 100644 arch/riscv/include/asm/tlbbatch.h > > > > > > > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig > > > > index 7603bd8ab333..aa07bd43b138 100644 > > > > --- a/arch/riscv/Kconfig > > > > +++ b/arch/riscv/Kconfig > > > > @@ -53,6 +53,7 @@ config RISCV > > > > select ARCH_USE_MEMTEST > > > > select ARCH_USE_QUEUED_RWLOCKS > > > > select ARCH_USES_CFI_TRAPS if CFI_CLANG > > > > + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if SMP && MMU > > > > select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU > > > > select ARCH_WANT_FRAME_POINTERS > > > > select ARCH_WANT_GENERAL_HUGETLB if !RISCV_ISA_SVNAPOT > > > > diff --git a/arch/riscv/include/asm/tlbbatch.h b/arch/riscv/include/asm/tlbbatch.h > > > > new file mode 100644 > > > > index 000000000000..46014f70b9da > > > > --- /dev/null > > > > +++ b/arch/riscv/include/asm/tlbbatch.h > > > > @@ -0,0 +1,15 @@ > > > > +/* SPDX-License-Identifier: GPL-2.0-only */ > > > > +/* > > > > + * Copyright (C) 2023 Rivos Inc. > > > > + */ > > > > + > > > > +#ifndef _ASM_RISCV_TLBBATCH_H > > > > +#define _ASM_RISCV_TLBBATCH_H > > > > + > > > > +#include > > > > + > > > > +struct arch_tlbflush_unmap_batch { > > > > + struct cpumask cpumask; > > > > +}; > > > > + > > > > +#endif /* _ASM_RISCV_TLBBATCH_H */ > > > > diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h > > > > index 8f3418c5f172..f0b731ccc0c2 100644 > > > > --- a/arch/riscv/include/asm/tlbflush.h > > > > +++ b/arch/riscv/include/asm/tlbflush.h > > > > @@ -46,6 +46,16 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end); > > > > void flush_pmd_tlb_range(struct vm_area_struct *vma, unsigned long start, > > > > unsigned long end); > > > > #endif > > > > + > > > > +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > > > > +bool arch_tlbbatch_should_defer(struct mm_struct *mm); > > > > +void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, > > > > + struct mm_struct *mm, > > > > + unsigned long uaddr); > > > > +void arch_flush_tlb_batched_pending(struct mm_struct *mm); > > > > +void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); > > > > +#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ > > > > + > > > > #else /* CONFIG_SMP && CONFIG_MMU */ > > > > > > > > #define flush_tlb_all() local_flush_tlb_all() > > > > diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c > > > > index e6659d7368b3..bb623bca0a7d 100644 > > > > --- a/arch/riscv/mm/tlbflush.c > > > > +++ b/arch/riscv/mm/tlbflush.c > > > > @@ -93,29 +93,23 @@ static void __ipi_flush_tlb_range_asid(void *info) > > > > local_flush_tlb_range_asid(d->start, d->size, d->stride, d->asid); > > > > } > > > > > > > > -static void __flush_tlb_range(struct mm_struct *mm, unsigned long start, > > > > - unsigned long size, unsigned long stride) > > > > +static void __flush_tlb_range(struct cpumask *cmask, unsigned long asid, > > > > + unsigned long start, unsigned long size, > > > > + unsigned long stride) > > > > { > > > > struct flush_tlb_range_data ftd; > > > > - const struct cpumask *cmask; > > > > - unsigned long asid = FLUSH_TLB_NO_ASID; > > > > bool broadcast; > > > > > > > > - if (mm) { > > > > - unsigned int cpuid; > > > > + if (cpumask_empty(cmask)) > > > > + return; > > > > > > > > - cmask = mm_cpumask(mm); > > > > - if (cpumask_empty(cmask)) > > > > - return; > > > > + if (cmask != cpu_online_mask) { > > > > + unsigned int cpuid; > > > > > > > > cpuid = get_cpu(); > > > > /* check if the tlbflush needs to be sent to other CPUs */ > > > > broadcast = cpumask_any_but(cmask, cpuid) < nr_cpu_ids; > > > > - > > > > - if (static_branch_unlikely(&use_asid_allocator)) > > > > - asid = atomic_long_read(&mm->context.id) & asid_mask; > > > > } else { > > > > - cmask = cpu_online_mask; > > > > broadcast = true; > > > > } > > > > > > > > @@ -135,25 +129,34 @@ static void __flush_tlb_range(struct mm_struct *mm, unsigned long start, > > > > local_flush_tlb_range_asid(start, size, stride, asid); > > > > } > > > > > > > > - if (mm) > > > > + if (cmask != cpu_online_mask) > > > > put_cpu(); > > > > } > > > > > > > > +static inline unsigned long get_mm_asid(struct mm_struct *mm) > > > > +{ > > > > + return static_branch_unlikely(&use_asid_allocator) ? > > > > + atomic_long_read(&mm->context.id) & asid_mask : FLUSH_TLB_NO_ASID; > > > > +} > > > > + > > > > void flush_tlb_mm(struct mm_struct *mm) > > > > { > > > > - __flush_tlb_range(mm, 0, FLUSH_TLB_MAX_SIZE, PAGE_SIZE); > > > > + __flush_tlb_range(mm_cpumask(mm), get_mm_asid(mm), > > > > + 0, FLUSH_TLB_MAX_SIZE, PAGE_SIZE); > > > > } > > > > > > > > void flush_tlb_mm_range(struct mm_struct *mm, > > > > unsigned long start, unsigned long end, > > > > unsigned int page_size) > > > > { > > > > - __flush_tlb_range(mm, start, end - start, page_size); > > > > + __flush_tlb_range(mm_cpumask(mm), get_mm_asid(mm), > > > > + start, end - start, page_size); > > > > } > > > > > > > > void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr) > > > > { > > > > - __flush_tlb_range(vma->vm_mm, addr, PAGE_SIZE, PAGE_SIZE); > > > > + __flush_tlb_range(mm_cpumask(vma->vm_mm), get_mm_asid(vma->vm_mm), > > > > + addr, PAGE_SIZE, PAGE_SIZE); > > > > } > > > > > > > > void flush_tlb_range(struct vm_area_struct *vma, unsigned long start, > > > > @@ -185,18 +188,46 @@ void flush_tlb_range(struct vm_area_struct *vma, unsigned long start, > > > > } > > > > } > > > > > > > > - __flush_tlb_range(vma->vm_mm, start, end - start, stride_size); > > > > + __flush_tlb_range(mm_cpumask(vma->vm_mm), get_mm_asid(vma->vm_mm), > > > > + start, end - start, stride_size); > > > > } > > > > > > > > void flush_tlb_kernel_range(unsigned long start, unsigned long end) > > > > { > > > > - __flush_tlb_range(NULL, start, end - start, PAGE_SIZE); > > > > + __flush_tlb_range((struct cpumask *)cpu_online_mask, FLUSH_TLB_NO_ASID, > > > > + start, end - start, PAGE_SIZE); > > > > } > > > > > > > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > > void flush_pmd_tlb_range(struct vm_area_struct *vma, unsigned long start, > > > > unsigned long end) > > > > { > > > > - __flush_tlb_range(vma->vm_mm, start, end - start, PMD_SIZE); > > > > + __flush_tlb_range(mm_cpumask(vma->vm_mm), get_mm_asid(vma->vm_mm), > > > > + start, end - start, PMD_SIZE); > > > > } > > > > #endif > > > > + > > > > +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > > > > +bool arch_tlbbatch_should_defer(struct mm_struct *mm) > > > > +{ > > > > + return true; > > > > +} > > > > + > > > > +void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, > > > > + struct mm_struct *mm, > > > > + unsigned long uaddr) > > > > +{ > > > > + cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm)); > > > > +} > > > > + > > > > +void arch_flush_tlb_batched_pending(struct mm_struct *mm) > > > > +{ > > > > + flush_tlb_mm(mm); > > > > +} > > > > + > > > > +void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) > > > > +{ > > > > + __flush_tlb_range(&batch->cpumask, FLUSH_TLB_NO_ASID, 0, > > > > + FLUSH_TLB_MAX_SIZE, PAGE_SIZE); > > > > +} > > > > +#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ > > > > -- > > > > 2.39.2 > > > > > > > > > > > > _______________________________________________ > > > > linux-riscv mailing list > > > > linux-riscv@lists.infradead.org > > > > http://lists.infradead.org/mailman/listinfo/linux-riscv