Received: by 2002:a05:7412:98c1:b0:fa:551:50a7 with SMTP id kc1csp1657645rdb; Mon, 8 Jan 2024 06:27:53 -0800 (PST) X-Google-Smtp-Source: AGHT+IH5DqTce518fBQj+JQdvFU3BE9VdnItgXdUvy4nh471nNDRC40ckuVFL2l6boEw9kZ9k7lL X-Received: by 2002:a17:907:770a:b0:a2a:9e31:7ea5 with SMTP id kw10-20020a170907770a00b00a2a9e317ea5mr798814ejc.132.1704724073600; Mon, 08 Jan 2024 06:27:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1704724073; cv=none; d=google.com; s=arc-20160816; b=J3oNr5g4wfSfEqD9gEkDU9J2CGG2dnWZlIKmSqeIDuOxbzVEj1BzMHGJY0bltxTZ/7 KxTvcoQvyIbXIPy06ATqcfOyxfQATfsHrLwx4Gax6/wOQLBWUwc4LzvrQT03uvTW2hFd q9xZ6/99ynmFmRVl99S6lGt+fpRaq+aPABsc6jOYehBCrBL9fRnNtpaw94F3TLmNof7l eKJFhmdec79s7IcUdWFe7SNyGkdW65sSU98zXlPB9b5kYOKdqar1hseKtOTsLO8l+Cjq ltYQBDWuOKjdywFtdaidodmPVABlwXIzEezzSFbcA4bIa1YxirhV9ARIJSXUV8twFOvY 0fCg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:message-id; bh=06aThlvSmJTlx/NUgRuwBmya5YRis1Yce0FYes6bUaY=; fh=147MRNkovhkLs8yQYyp+JL8Ov3MKKhNVWgTL8xXGgGM=; b=R4gJLAXcG4J7tHjyZCN6j52GP+e4bPXmDRM3l6CcjGegXCcS+EfC2kYC7qeqaOrOeE VmJCKbh4B++ODHhdX8TObrGCzsMJ9AZB204bTWJ2Ys4o/Dpg5LRbFrZtYP2gjJxpdhJK cdv7NLE3+arhZSySfoKv1hNlDoF+I+a0bjJRrfsMyR9/ZrWqPgunx6WwV1OXfEF6YQEC 7ziWatsys7DT9zcWOPAcOxbMV6KP1vFymQhbKdnVGNO6zoxQz16SxrEiy/muPgPaKZhO bYvK9wf0ihG2Yq/Z66pQXJCrBnvPaGsivodwCNonlNGOrUGPxjiUzpZnXS3RM/9eNNoJ srpw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel+bounces-19708-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-19708-linux.lists.archive=gmail.com@vger.kernel.org" Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [2604:1380:4601:e00::3]) by mx.google.com with ESMTPS id m19-20020a1709062b9300b00a2b0613cb79si238980ejg.350.2024.01.08.06.27.53 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Jan 2024 06:27:53 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-19708-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel+bounces-19708-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-19708-linux.lists.archive=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 2EF371F23110 for ; Mon, 8 Jan 2024 14:27:53 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 9C2A0537EC; Mon, 8 Jan 2024 14:27:46 +0000 (UTC) X-Original-To: linux-kernel@vger.kernel.org Received: from relay9-d.mail.gandi.net (relay9-d.mail.gandi.net [217.70.183.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 24D8C537E4 for ; Mon, 8 Jan 2024 14:27:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=ghiti.fr Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=ghiti.fr Received: by mail.gandi.net (Postfix) with ESMTPSA id 2266DFF806; Mon, 8 Jan 2024 14:27:38 +0000 (UTC) Message-ID: Date: Mon, 8 Jan 2024 15:27:38 +0100 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] riscv: Add support for BATCHED_UNMAP_TLB_FLUSH Content-Language: en-US To: Jisheng Zhang , Alexandre Ghiti Cc: Paul Walmsley , Palmer Dabbelt , Albert Ou , linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org References: <20240102141851.105144-1-alexghiti@rivosinc.com> From: Alexandre Ghiti In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-GND-Sasl: alex@ghiti.fr Hi Jisheng, On 06/01/2024 15:05, Jisheng Zhang wrote: > On Sat, Jan 06, 2024 at 09:47:04PM +0800, Jisheng Zhang wrote: >> On Fri, Jan 05, 2024 at 02:36:44PM +0100, Alexandre Ghiti wrote: >>> On Thu, Jan 4, 2024 at 6:42 PM Alexandre Ghiti wrote: >>>> Hi Jisheng, >>>> >>>> On Wed, Jan 3, 2024 at 12:10 PM Jisheng Zhang wrote: >>>>> On Tue, Jan 02, 2024 at 03:18:51PM +0100, Alexandre Ghiti wrote: >>>>>> Allow to defer the flushing of the TLB when unmapping pges, which allows >>>>>> to reduce the numbers of IPI and the number of sfence.vma. >>>>>> >>>>>> The ubenchmarch used in commit 43b3dfdd0455 ("arm64: support >>>>>> batched/deferred tlb shootdown during page reclamation/migration") shows >>>>>> good performance improvement and perf reports an important decrease in >>>>>> time spent flushing the tlb (results come from qemu): >>>>> Hi Alex, >>>>> >>>>> I tried this micro benchmark with your patch on T-HEAD TH1520 platform, I >>>>> didn't see any performance improvement for the micro benchmark. Per >>>>> myunderstanding, the micro benchmark is special case for arm64 because >>>>> in a normal tlb flush flow, below sequence is necessary: >>>>> >>>>> tlbi >>>>> dsb >>>>> >>>>> >>>>> while with BATCHED_UNMAP_TLB_FLUSH, the arm64 just does 'tlbi', leaving >>>>> the dsb to the arch_tlbbatch_flush(). So the final result is >>>>> >>>>> several 'tlbi + dsb' sequence VS. several 'tlbi' instructions + only one dsb >>>>> The performance improvement comes from the unnecessary dsb eliminations. >>>> Some batching should take place, and with this patch, we only send one >>>> "full" sfence.vma instead of a "local" sfence.vma for each page, it >>>> seems weird that you don't see any improvement, I would have thought >>>> that one "full" sfence.vma would be better. >>>> >>>>> Do you have suitable benchmark(s) for BATCHED_UNMAP_TLB_FLUSH on riscv? >>>> Can you give the following benchmark a try? I simply created threads >>>> and dispatched them on all the cpus to force IPI usage, that should be >>>> way better if the batching of the first ubenchmark is not enough to >>>> exacerbate performance improvements, let me know and thanks for your >>>> tests! >>>> >>>> #define _GNU_SOURCE >>>> #include >>>> #include >>>> #include >>>> #include >>>> #include >>>> #include >>>> #include >>>> #include >>>> >>>> int stick_this_thread_to_core(int core_id) { >>>> int num_cores = sysconf(_SC_NPROCESSORS_ONLN); >>>> if (core_id < 0 || core_id >= num_cores) >>>> return EINVAL; >>>> >>>> cpu_set_t cpuset; >>>> CPU_ZERO(&cpuset); >>>> CPU_SET(core_id, &cpuset); >>>> >>>> pthread_t current_thread = pthread_self(); >>>> return pthread_setaffinity_np(current_thread, >>>> sizeof(cpu_set_t), &cpuset); >>>> } >>>> >>>> static void *fn_thread (void *p_data) >>>> { >>>> int ret; >>>> pthread_t thread; >>>> >>>> stick_this_thread_to_core((int)p_data); >>>> >>>> while (1) { >>>> sleep(1); >>>> } >>>> >>>> return NULL; >>>> } >>>> >>>> int main() >>>> { >>>> #define SIZE (1 * 1024 * 1024) >>>> volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, >>>> MAP_SHARED | MAP_ANONYMOUS, -1, 0); >>>> pthread_t threads[4]; >>>> int ret; >>>> >>>> for (int i = 0; i < 4; ++i) { >>>> ret = pthread_create(&threads[i], NULL, fn_thread, (void *)i); >>>> if (ret) >>>> { >>>> printf("%s", strerror (ret)); >>>> } >>>> } >>>> >>>> memset(p, 0x88, SIZE); >>>> >>>> for (int k = 0; k < 500 /* 10000 */; k++) { >>>> /* swap in */ >>>> for (int i = 0; i < SIZE; i += 4096) { >>>> (void)p[i]; >>>> } >>>> >>>> /* swap out */ >>>> madvise(p, SIZE, MADV_PAGEOUT); >>>> } >>>> >>>> for (int i = 0; i < 4; i++) >>>> { >>>> pthread_cancel(threads[i]); >>>> } >>>> >>>> for (int i = 0; i < 4; i++) >>>> { >>>> pthread_join(threads[i], NULL); >>>> } >>>> >>>> return 0; >>>> >>>> } >>>> >>> So I removed the dust from my unmatched and ran the benchmarks I proposed: >>> >>> Without this patch: >>> * benchmark from commit 43b3dfdd0455 (4 runs) : ~20.3s >>> * same benchmark with threads (4 runs) : ~27.4s >>> >>> With this patch: >>> * benchmark from commit 43b3dfdd0455 (4 runs) : ~17.9s >>> * same benchmark with threads (4 runs) : ~18.1s >>> >>> So a small improvement for the single thread benchmark, but it depends >>> on the number of pages that get flushed, so to me that's not >>> applicable for the general case. For the same benchmark with multiple >>> threads, that's ~34% improvement. I'll add those numbers to the v2, >>> and JIsheng if you can provide some too, I'll add them too! >> Hi Alex, >> >> the threaded version show ~78% improvement! impressive! > One more thing when you cook v2: it's better to patch the riscv entry > in Documentation/features/vm/TLB/arch-support.txt > >> So for the patch: >> >> Reviewed-by: Jisheng Zhang >> Tested-by: Jisheng Zhang >> >> Thanks Thanks for your tests and review, and the missing documentation update! Alex >>> Thanks, >>> >>> Alex >>> >>>>> Thanks >>>>> >>>>>> Before this patch: >>>>>> >>>>>> real 2m1.135s >>>>>> user 0m0.980s >>>>>> sys 2m0.096s >>>>>> >>>>>> 4.83% batch_tlb [kernel.kallsyms] [k] __flush_tlb_range >>>>>> >>>>>> After this patch: >>>>>> >>>>>> real 1m0.543s >>>>>> user 0m1.059s >>>>>> sys 0m59.489s >>>>>> >>>>>> 0.14% batch_tlb [kernel.kallsyms] [k] __flush_tlb_range >>>>>> >>>>>> Signed-off-by: Alexandre Ghiti >>>>>> --- >>>>>> arch/riscv/Kconfig | 1 + >>>>>> arch/riscv/include/asm/tlbbatch.h | 15 +++++++ >>>>>> arch/riscv/include/asm/tlbflush.h | 10 +++++ >>>>>> arch/riscv/mm/tlbflush.c | 71 ++++++++++++++++++++++--------- >>>>>> 4 files changed, 77 insertions(+), 20 deletions(-) >>>>>> create mode 100644 arch/riscv/include/asm/tlbbatch.h >>>>>> >>>>>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig >>>>>> index 7603bd8ab333..aa07bd43b138 100644 >>>>>> --- a/arch/riscv/Kconfig >>>>>> +++ b/arch/riscv/Kconfig >>>>>> @@ -53,6 +53,7 @@ config RISCV >>>>>> select ARCH_USE_MEMTEST >>>>>> select ARCH_USE_QUEUED_RWLOCKS >>>>>> select ARCH_USES_CFI_TRAPS if CFI_CLANG >>>>>> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if SMP && MMU >>>>>> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU >>>>>> select ARCH_WANT_FRAME_POINTERS >>>>>> select ARCH_WANT_GENERAL_HUGETLB if !RISCV_ISA_SVNAPOT >>>>>> diff --git a/arch/riscv/include/asm/tlbbatch.h b/arch/riscv/include/asm/tlbbatch.h >>>>>> new file mode 100644 >>>>>> index 000000000000..46014f70b9da >>>>>> --- /dev/null >>>>>> +++ b/arch/riscv/include/asm/tlbbatch.h >>>>>> @@ -0,0 +1,15 @@ >>>>>> +/* SPDX-License-Identifier: GPL-2.0-only */ >>>>>> +/* >>>>>> + * Copyright (C) 2023 Rivos Inc. >>>>>> + */ >>>>>> + >>>>>> +#ifndef _ASM_RISCV_TLBBATCH_H >>>>>> +#define _ASM_RISCV_TLBBATCH_H >>>>>> + >>>>>> +#include >>>>>> + >>>>>> +struct arch_tlbflush_unmap_batch { >>>>>> + struct cpumask cpumask; >>>>>> +}; >>>>>> + >>>>>> +#endif /* _ASM_RISCV_TLBBATCH_H */ >>>>>> diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h >>>>>> index 8f3418c5f172..f0b731ccc0c2 100644 >>>>>> --- a/arch/riscv/include/asm/tlbflush.h >>>>>> +++ b/arch/riscv/include/asm/tlbflush.h >>>>>> @@ -46,6 +46,16 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end); >>>>>> void flush_pmd_tlb_range(struct vm_area_struct *vma, unsigned long start, >>>>>> unsigned long end); >>>>>> #endif >>>>>> + >>>>>> +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH >>>>>> +bool arch_tlbbatch_should_defer(struct mm_struct *mm); >>>>>> +void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, >>>>>> + struct mm_struct *mm, >>>>>> + unsigned long uaddr); >>>>>> +void arch_flush_tlb_batched_pending(struct mm_struct *mm); >>>>>> +void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); >>>>>> +#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ >>>>>> + >>>>>> #else /* CONFIG_SMP && CONFIG_MMU */ >>>>>> >>>>>> #define flush_tlb_all() local_flush_tlb_all() >>>>>> diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c >>>>>> index e6659d7368b3..bb623bca0a7d 100644 >>>>>> --- a/arch/riscv/mm/tlbflush.c >>>>>> +++ b/arch/riscv/mm/tlbflush.c >>>>>> @@ -93,29 +93,23 @@ static void __ipi_flush_tlb_range_asid(void *info) >>>>>> local_flush_tlb_range_asid(d->start, d->size, d->stride, d->asid); >>>>>> } >>>>>> >>>>>> -static void __flush_tlb_range(struct mm_struct *mm, unsigned long start, >>>>>> - unsigned long size, unsigned long stride) >>>>>> +static void __flush_tlb_range(struct cpumask *cmask, unsigned long asid, >>>>>> + unsigned long start, unsigned long size, >>>>>> + unsigned long stride) >>>>>> { >>>>>> struct flush_tlb_range_data ftd; >>>>>> - const struct cpumask *cmask; >>>>>> - unsigned long asid = FLUSH_TLB_NO_ASID; >>>>>> bool broadcast; >>>>>> >>>>>> - if (mm) { >>>>>> - unsigned int cpuid; >>>>>> + if (cpumask_empty(cmask)) >>>>>> + return; >>>>>> >>>>>> - cmask = mm_cpumask(mm); >>>>>> - if (cpumask_empty(cmask)) >>>>>> - return; >>>>>> + if (cmask != cpu_online_mask) { >>>>>> + unsigned int cpuid; >>>>>> >>>>>> cpuid = get_cpu(); >>>>>> /* check if the tlbflush needs to be sent to other CPUs */ >>>>>> broadcast = cpumask_any_but(cmask, cpuid) < nr_cpu_ids; >>>>>> - >>>>>> - if (static_branch_unlikely(&use_asid_allocator)) >>>>>> - asid = atomic_long_read(&mm->context.id) & asid_mask; >>>>>> } else { >>>>>> - cmask = cpu_online_mask; >>>>>> broadcast = true; >>>>>> } >>>>>> >>>>>> @@ -135,25 +129,34 @@ static void __flush_tlb_range(struct mm_struct *mm, unsigned long start, >>>>>> local_flush_tlb_range_asid(start, size, stride, asid); >>>>>> } >>>>>> >>>>>> - if (mm) >>>>>> + if (cmask != cpu_online_mask) >>>>>> put_cpu(); >>>>>> } >>>>>> >>>>>> +static inline unsigned long get_mm_asid(struct mm_struct *mm) >>>>>> +{ >>>>>> + return static_branch_unlikely(&use_asid_allocator) ? >>>>>> + atomic_long_read(&mm->context.id) & asid_mask : FLUSH_TLB_NO_ASID; >>>>>> +} >>>>>> + >>>>>> void flush_tlb_mm(struct mm_struct *mm) >>>>>> { >>>>>> - __flush_tlb_range(mm, 0, FLUSH_TLB_MAX_SIZE, PAGE_SIZE); >>>>>> + __flush_tlb_range(mm_cpumask(mm), get_mm_asid(mm), >>>>>> + 0, FLUSH_TLB_MAX_SIZE, PAGE_SIZE); >>>>>> } >>>>>> >>>>>> void flush_tlb_mm_range(struct mm_struct *mm, >>>>>> unsigned long start, unsigned long end, >>>>>> unsigned int page_size) >>>>>> { >>>>>> - __flush_tlb_range(mm, start, end - start, page_size); >>>>>> + __flush_tlb_range(mm_cpumask(mm), get_mm_asid(mm), >>>>>> + start, end - start, page_size); >>>>>> } >>>>>> >>>>>> void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr) >>>>>> { >>>>>> - __flush_tlb_range(vma->vm_mm, addr, PAGE_SIZE, PAGE_SIZE); >>>>>> + __flush_tlb_range(mm_cpumask(vma->vm_mm), get_mm_asid(vma->vm_mm), >>>>>> + addr, PAGE_SIZE, PAGE_SIZE); >>>>>> } >>>>>> >>>>>> void flush_tlb_range(struct vm_area_struct *vma, unsigned long start, >>>>>> @@ -185,18 +188,46 @@ void flush_tlb_range(struct vm_area_struct *vma, unsigned long start, >>>>>> } >>>>>> } >>>>>> >>>>>> - __flush_tlb_range(vma->vm_mm, start, end - start, stride_size); >>>>>> + __flush_tlb_range(mm_cpumask(vma->vm_mm), get_mm_asid(vma->vm_mm), >>>>>> + start, end - start, stride_size); >>>>>> } >>>>>> >>>>>> void flush_tlb_kernel_range(unsigned long start, unsigned long end) >>>>>> { >>>>>> - __flush_tlb_range(NULL, start, end - start, PAGE_SIZE); >>>>>> + __flush_tlb_range((struct cpumask *)cpu_online_mask, FLUSH_TLB_NO_ASID, >>>>>> + start, end - start, PAGE_SIZE); >>>>>> } >>>>>> >>>>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>>> void flush_pmd_tlb_range(struct vm_area_struct *vma, unsigned long start, >>>>>> unsigned long end) >>>>>> { >>>>>> - __flush_tlb_range(vma->vm_mm, start, end - start, PMD_SIZE); >>>>>> + __flush_tlb_range(mm_cpumask(vma->vm_mm), get_mm_asid(vma->vm_mm), >>>>>> + start, end - start, PMD_SIZE); >>>>>> } >>>>>> #endif >>>>>> + >>>>>> +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH >>>>>> +bool arch_tlbbatch_should_defer(struct mm_struct *mm) >>>>>> +{ >>>>>> + return true; >>>>>> +} >>>>>> + >>>>>> +void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, >>>>>> + struct mm_struct *mm, >>>>>> + unsigned long uaddr) >>>>>> +{ >>>>>> + cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm)); >>>>>> +} >>>>>> + >>>>>> +void arch_flush_tlb_batched_pending(struct mm_struct *mm) >>>>>> +{ >>>>>> + flush_tlb_mm(mm); >>>>>> +} >>>>>> + >>>>>> +void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) >>>>>> +{ >>>>>> + __flush_tlb_range(&batch->cpumask, FLUSH_TLB_NO_ASID, 0, >>>>>> + FLUSH_TLB_MAX_SIZE, PAGE_SIZE); >>>>>> +} >>>>>> +#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ >>>>>> -- >>>>>> 2.39.2 >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> linux-riscv mailing list >>>>>> linux-riscv@lists.infradead.org >>>>>> http://lists.infradead.org/mailman/listinfo/linux-riscv > _______________________________________________ > linux-riscv mailing list > linux-riscv@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv