Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp2374709rdb; Fri, 8 Dec 2023 06:28:38 -0800 (PST) X-Google-Smtp-Source: AGHT+IFDWL8rNMxB995EnfrIBn25mBpHYf35Yhki2XRk3q/5p4fwlPBZp664sIO1CDq5ZZM0ORFu X-Received: by 2002:a05:6a20:918b:b0:190:63b6:2064 with SMTP id v11-20020a056a20918b00b0019063b62064mr105675pzd.92.1702045718561; Fri, 08 Dec 2023 06:28:38 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702045718; cv=none; d=google.com; s=arc-20160816; b=LxoOPlPXJR/PLWNGW0JE9nFE1CVEoOb7OgnFFi++9AfK5W1ceVkaQaYaOBwoy/MQKE 2F0qhJBmqpPz6oq8EWHG+4Ib3oI3+PRCzzn7YBbFS3ptuUbbyNLyOfTpUF1kDa+Do+DI X95xdN9aAC4I6gUiykFhN4XHvFXs5stpGzVxMK/OwMxb+tC284NJabOYuyhWcJq+QEXv mAjojvV6MviO0Q12L9FMe+Tn/BvSd86IpVF+NojzfSdmUzj5sftB+8yiHluezmE2cRGL RAYk9GkRGo4XtWOSgdg2HuEHTcPG+AbETJJViNlMwU/q8miLrAm/qB4trqvk0NtXlQ2S eolw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=PAMbjbUeNExg3jR9YfrOXqq7gkqmjKC91Dm2N2JHQ/A=; fh=LZ/kPJHwdpSujTRklo57y4hLM6TNFfUligE0HScPCOQ=; b=RySLmsu5jht9GaWD/tH/+ainGeorh7XwRIMnJB/6nIVomrIt/RzuAWC0sGJl9B3rgd 9JsFW69gRHTsAglgWyUtI825Rf8mp6PHSWTydvCY6zV+ioHEFhxs0EirLkxHdNQSlmhC C4ASWlWqClO1pkggENFKfhfR2US0NWPx5x043fQ/ljZtzej6ooNQ+rUKgZd3Aj27ZxaZ vHu3emwCZMtMMZvkvqg97GumYP1C2qp3rZVNRSCQgj5mKcBe59I//MR+c0c4E78sX0vr bbJIypooQKSoSJSEYvkyr3xioVgz5FS7nkllPsfKTaxz613gBZON/43J4+n+TOozcoQq FTYw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@rivosinc-com.20230601.gappssmtp.com header.s=20230601 header.b="Co4oXzV/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3]) by mx.google.com with ESMTPS id e11-20020a056a001a8b00b006ce4271775dsi1636584pfv.35.2023.12.08.06.28.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 Dec 2023 06:28:38 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) client-ip=2620:137:e000::3:3; Authentication-Results: mx.google.com; dkim=pass header.i=@rivosinc-com.20230601.gappssmtp.com header.s=20230601 header.b="Co4oXzV/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id D27498183EFE; Fri, 8 Dec 2023 06:28:34 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1573952AbjLHO2O (ORCPT + 99 others); Fri, 8 Dec 2023 09:28:14 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55620 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1573929AbjLHO2N (ORCPT ); Fri, 8 Dec 2023 09:28:13 -0500 Received: from mail-wr1-x42f.google.com (mail-wr1-x42f.google.com [IPv6:2a00:1450:4864:20::42f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 91E781723 for ; Fri, 8 Dec 2023 06:28:18 -0800 (PST) Received: by mail-wr1-x42f.google.com with SMTP id ffacd0b85a97d-33338c47134so1961655f8f.1 for ; Fri, 08 Dec 2023 06:28:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rivosinc-com.20230601.gappssmtp.com; s=20230601; t=1702045697; x=1702650497; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=PAMbjbUeNExg3jR9YfrOXqq7gkqmjKC91Dm2N2JHQ/A=; b=Co4oXzV/8LIMPdV5udqQbJ18Gm0wp/Bp8cTk5RxP9kpAn7h5mtic8IHrj/o7T2TAiK Or/CvHtohvLa2wOAiuMtHCuVLVF2c5wbCX8quRpjoyhUByB5HoHl9y7RC1EeORoXrPpj 6d/M8nZOEnlHBeRkfCV2hspG2WfSbCKOqlS7bZ2SqIcDSnXF4HMjjb/kLTcJ2RzQxppd vYDsIT2sORIBD6xP2KxCrC2avYa5Tgy50u3ZedGMlDu0Wv9Xw5c6AR1XI0Sf5AWVtIoH qls5BqICCXS8Dhz6W6ykBOGRMuXwd90/FeeUqMoE1Q/YEvOIn9webgIePD7OIK22la1f 6I2Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702045697; x=1702650497; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=PAMbjbUeNExg3jR9YfrOXqq7gkqmjKC91Dm2N2JHQ/A=; b=qVL7iRGa8kJynnfaa3gMNoODiqXe6iNP0c7yPfWylKBnjW4DqjjxujAl4X2MOmE9pD TYk1sakLnvSQHWZepAV25+3Tcq76LQqtqTcYwQTVhGaxhgGU74LXj0R+BRVEoP4SHnth WYjya6spVAb+btSTI6D1d3///Yz/JuH8mJP64o7faS6EhnzPNSgizF+ftU1cR2EKfSln 3rO2ef7JmyY3Ja+Kz3MFiLygeRtDuLvhTbPzz5HTGyNHFAgoOtW6C1GOoPtyq0nNZIws WflKzlywmJ4m566KAjU1tF0oix/rUwfJKuO2ikg1ge8aH3zO/0+EjE4bTgB8jIye1A1J U2gA== X-Gm-Message-State: AOJu0YzXaeSYuBLyVaY1Kfl/Cir3p/DZeYaWezwHcWjMzN/CWlhyVhM9 45FMcjbEQXHDDuI0Af5LICYfBetBRqIFJLWIc7JZEAsgXe+EXVue X-Received: by 2002:a05:6000:1a4f:b0:333:2fd2:6f74 with SMTP id t15-20020a0560001a4f00b003332fd26f74mr59420wry.126.1702045696879; Fri, 08 Dec 2023 06:28:16 -0800 (PST) MIME-Version: 1.0 References: <20231207150348.82096-1-alexghiti@rivosinc.com> <20231207150348.82096-2-alexghiti@rivosinc.com> <27d8dffc-cfd8-4f07-9c0a-b7101963c2ae@csgroup.eu> In-Reply-To: <27d8dffc-cfd8-4f07-9c0a-b7101963c2ae@csgroup.eu> From: Alexandre Ghiti Date: Fri, 8 Dec 2023 15:28:06 +0100 Message-ID: Subject: Re: [PATCH RFC/RFT 1/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings To: Christophe Leroy Cc: Catalin Marinas , Will Deacon , Thomas Bogendoerfer , Michael Ellerman , Nicholas Piggin , Paul Walmsley , Palmer Dabbelt , Albert Ou , Andrew Morton , Ved Shanbhogue , Matt Evans , Dylan Jhong , "linux-arm-kernel@lists.infradead.org" , "linux-kernel@vger.kernel.org" , "linux-mips@vger.kernel.org" , "linuxppc-dev@lists.ozlabs.org" , "linux-riscv@lists.infradead.org" , "linux-mm@kvack.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Fri, 08 Dec 2023 06:28:35 -0800 (PST) Hi Christophe, On Thu, Dec 7, 2023 at 4:52=E2=80=AFPM Christophe Leroy wrote: > > > > Le 07/12/2023 =C3=A0 16:03, Alexandre Ghiti a =C3=A9crit : > > In 6.5, we removed the vmalloc fault path because that can't work (see > > [1] [2]). Then in order to make sure that new page table entries were > > seen by the page table walker, we had to preventively emit a sfence.vma > > on all harts [3] but this solution is very costly since it relies on IP= I. > > > > And even there, we could end up in a loop of vmalloc faults if a vmallo= c > > allocation is done in the IPI path (for example if it is traced, see > > [4]), which could result in a kernel stack overflow. > > > > Those preventive sfence.vma needed to be emitted because: > > > > - if the uarch caches invalid entries, the new mapping may not be > > observed by the page table walker and an invalidation may be needed. > > - if the uarch does not cache invalid entries, a reordered access > > could "miss" the new mapping and traps: in that case, we would actua= lly > > only need to retry the access, no sfence.vma is required. > > > > So this patch removes those preventive sfence.vma and actually handles > > the possible (and unlikely) exceptions. And since the kernel stacks > > mappings lie in the vmalloc area, this handling must be done very early > > when the trap is taken, at the very beginning of handle_exception: this > > also rules out the vmalloc allocations in the fault path. > > > > Note that for now, we emit a sfence.vma even for uarchs that do not > > cache invalid entries as we have no means to know that: that will be > > fixed in the next patch. > > > > Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn= @kernel.org/ [1] > > Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dyla= n@andestech.com [2] > > Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexg= hiti@rivosinc.com/ [3] > > Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.o= rg/ [4] > > Signed-off-by: Alexandre Ghiti > > --- > > arch/riscv/include/asm/cacheflush.h | 19 +++++- > > arch/riscv/include/asm/thread_info.h | 5 ++ > > arch/riscv/kernel/asm-offsets.c | 5 ++ > > arch/riscv/kernel/entry.S | 94 +++++++++++++++++++++++++++= + > > arch/riscv/mm/init.c | 2 + > > 5 files changed, 124 insertions(+), 1 deletion(-) > > > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/a= sm/cacheflush.h > > index 3cb53c4df27c..a916cbc69d47 100644 > > --- a/arch/riscv/include/asm/cacheflush.h > > +++ b/arch/riscv/include/asm/cacheflush.h > > @@ -37,7 +37,24 @@ static inline void flush_dcache_page(struct page *pa= ge) > > flush_icache_mm(vma->vm_mm, 0) > > > > #ifdef CONFIG_64BIT > > -#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end= ) > > +extern u64 new_vmalloc[]; > > Can you have the table size here ? Would help GCC static analysis for > boundary checking. Yes, I'll do > > > +extern char _end[]; > > +#define flush_cache_vmap flush_cache_vmap > > +static inline void flush_cache_vmap(unsigned long start, unsigned long= end) > > +{ > > + if ((start < VMALLOC_END && end > VMALLOC_START) || > > + (start < MODULES_END && end > MODULES_VADDR)) { > > Can you use is_vmalloc_or_module_addr() instead ? Yes, I'll do > > > > + int i; > > + > > + /* > > + * We don't care if concurrently a cpu resets this value = since > > + * the only place this can happen is in handle_exception(= ) where > > + * an sfence.vma is emitted. > > + */ > > + for (i =3D 0; i < NR_CPUS / sizeof(u64) + 1; ++i) > > Use ARRAY_SIZE() ? And that too :) Thanks for the review, Alex > > > + new_vmalloc[i] =3D -1ULL; > > + } > > +} > > #endif > > > > #ifndef CONFIG_SMP > > diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/= asm/thread_info.h > > index 1833beb00489..8fe12fa6c329 100644 > > --- a/arch/riscv/include/asm/thread_info.h > > +++ b/arch/riscv/include/asm/thread_info.h > > @@ -60,6 +60,11 @@ struct thread_info { > > long user_sp; /* User stack pointer */ > > int cpu; > > unsigned long syscall_work; /* SYSCALL_WORK_ flags */ > > + /* > > + * Used in handle_exception() to save a0, a1 and a2 before knowin= g if we > > + * can access the kernel stack. > > + */ > > + unsigned long a0, a1, a2; > > }; > > > > /* > > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-of= fsets.c > > index d6a75aac1d27..340c1c84560d 100644 > > --- a/arch/riscv/kernel/asm-offsets.c > > +++ b/arch/riscv/kernel/asm-offsets.c > > @@ -34,10 +34,15 @@ void asm_offsets(void) > > OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]); > > OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]); > > OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]); > > + > > + OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu); > > OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags); > > OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_co= unt); > > OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp); > > OFFSET(TASK_TI_USER_SP, task_struct, thread_info.user_sp); > > + OFFSET(TASK_TI_A0, task_struct, thread_info.a0); > > + OFFSET(TASK_TI_A1, task_struct, thread_info.a1); > > + OFFSET(TASK_TI_A2, task_struct, thread_info.a2); > > > > OFFSET(TASK_THREAD_F0, task_struct, thread.fstate.f[0]); > > OFFSET(TASK_THREAD_F1, task_struct, thread.fstate.f[1]); > > diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S > > index 143a2bb3e697..3a3c7b563816 100644 > > --- a/arch/riscv/kernel/entry.S > > +++ b/arch/riscv/kernel/entry.S > > @@ -14,6 +14,88 @@ > > #include > > #include > > > > +.macro new_vmalloc_check > > + REG_S a0, TASK_TI_A0(tp) > > + REG_S a1, TASK_TI_A1(tp) > > + REG_S a2, TASK_TI_A2(tp) > > + > > + csrr a0, CSR_CAUSE > > + /* Exclude IRQs */ > > + blt a0, zero, _new_vmalloc_restore_context > > + /* Only check new_vmalloc if we are in page/protection fault */ > > + li a1, EXC_LOAD_PAGE_FAULT > > + beq a0, a1, _new_vmalloc_kernel_address > > + li a1, EXC_STORE_PAGE_FAULT > > + beq a0, a1, _new_vmalloc_kernel_address > > + li a1, EXC_INST_PAGE_FAULT > > + bne a0, a1, _new_vmalloc_restore_context > > + > > +_new_vmalloc_kernel_address: > > + /* Is it a kernel address? */ > > + csrr a0, CSR_TVAL > > + bge a0, zero, _new_vmalloc_restore_context > > + > > + /* Check if a new vmalloc mapping appeared that could explain the= trap */ > > + > > + /* > > + * Computes: > > + * a0 =3D &new_vmalloc[BIT_WORD(cpu)] > > + * a1 =3D BIT_MASK(cpu) > > + */ > > + REG_L a2, TASK_TI_CPU(tp) > > + /* > > + * Compute the new_vmalloc element position: > > + * (cpu / 64) * 8 =3D (cpu >> 6) << 3 > > + */ > > + srli a1, a2, 6 > > + slli a1, a1, 3 > > + la a0, new_vmalloc > > + add a0, a0, a1 > > + /* > > + * Compute the bit position in the new_vmalloc element: > > + * bit_pos =3D cpu % 64 =3D cpu - (cpu / 64) * 64 =3D cpu - (cpu = >> 6) << 6 > > + * =3D cpu - ((cpu >> 6) << 3) << 3 > > + */ > > + slli a1, a1, 3 > > + sub a1, a2, a1 > > + /* Compute the "get mask": 1 << bit_pos */ > > + li a2, 1 > > + sll a1, a2, a1 > > + > > + /* Check the value of new_vmalloc for this cpu */ > > + ld a2, 0(a0) > > + and a2, a2, a1 > > + beq a2, zero, _new_vmalloc_restore_context > > + > > + ld a2, 0(a0) > > + not a1, a1 > > + and a1, a2, a1 > > + sd a1, 0(a0) > > + > > + /* Only emit a sfence.vma if the uarch caches invalid entries */ > > + la a0, tlb_caching_invalid_entries > > + lb a0, 0(a0) > > + beqz a0, _new_vmalloc_no_caching_invalid_entries > > + sfence.vma > > +_new_vmalloc_no_caching_invalid_entries: > > + // debug > > + la a0, nr_sfence_vma_handle_exception > > + li a1, 1 > > + amoadd.w a0, a1, (a0) > > + // end debug > > + REG_L a0, TASK_TI_A0(tp) > > + REG_L a1, TASK_TI_A1(tp) > > + REG_L a2, TASK_TI_A2(tp) > > + csrw CSR_SCRATCH, x0 > > + sret > > + > > +_new_vmalloc_restore_context: > > + REG_L a0, TASK_TI_A0(tp) > > + REG_L a1, TASK_TI_A1(tp) > > + REG_L a2, TASK_TI_A2(tp) > > +.endm > > + > > + > > SYM_CODE_START(handle_exception) > > /* > > * If coming from userspace, preserve the user thread pointer and= load > > @@ -25,6 +107,18 @@ SYM_CODE_START(handle_exception) > > > > _restore_kernel_tpsp: > > csrr tp, CSR_SCRATCH > > + > > + /* > > + * The RISC-V kernel does not eagerly emit a sfence.vma after eac= h > > + * new vmalloc mapping, which may result in exceptions: > > + * - if the uarch caches invalid entries, the new mapping would n= ot be > > + * observed by the page table walker and an invalidation is nee= ded. > > + * - if the uarch does not cache invalid entries, a reordered acc= ess > > + * could "miss" the new mapping and traps: in that case, we onl= y need > > + * to retry the access, no sfence.vma is required. > > + */ > > + new_vmalloc_check > > + > > REG_S sp, TASK_TI_KERNEL_SP(tp) > > > > #ifdef CONFIG_VMAP_STACK > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c > > index 0798bd861dcb..379403de6c6f 100644 > > --- a/arch/riscv/mm/init.c > > +++ b/arch/riscv/mm/init.c > > @@ -36,6 +36,8 @@ > > > > #include "../kernel/head.h" > > > > +u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1]; > > + > > struct kernel_mapping kernel_map __ro_after_init; > > EXPORT_SYMBOL(kernel_map); > > #ifdef CONFIG_XIP_KERNEL