Received: by 2002:a05:6358:11c7:b0:104:8066:f915 with SMTP id i7csp6518105rwl; Tue, 4 Apr 2023 14:10:11 -0700 (PDT) X-Google-Smtp-Source: AKy350adDwBWIj/2pbneq6zqMFsXmco3bkpdmcjwsMBzcY/5+TQRQRn/Dz8tp7ZxXljNy0aHhoQk X-Received: by 2002:a17:906:3b86:b0:8de:e66a:ece9 with SMTP id u6-20020a1709063b8600b008dee66aece9mr780557ejf.24.1680642611617; Tue, 04 Apr 2023 14:10:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680642611; cv=none; d=google.com; s=arc-20160816; b=U2RAMjUQVZHEC1n/XPAmmXNU5fERo647lbAy38/FeyqHScKSS8f+/ahds0lis9foYm GH1H6VxrY1XOzD8hYn1GKH+c7Zp4jWVJka5SSmxR4SW9tld55VpPR1ghvW20WExVyx2O bCTVEKOW/HKjJgKkm8Ki8fqSj5v3FuB3hMx23+S1/K8VLy+QjvsI28hl7GCmWKSZmtQd QIlSJNYF8zrTstOCz7VVoNtFKZ6Za82VC1bP+7fOkd2XRX2oMMJ+mBz5nuqksJy7dYg+ I4sq1A5EvMzd6WPUNVwvWLTdUCliqRhK7V9oJDU/JGW+9CXBo4z4D5/1vcqfXDGlcLqR unrQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=D8wCu932h/gRywCSwLxOAEpXzxqJvpG+Yqrg4CPRzmc=; b=1LgRpnhDunnQ5a7U+gn9UoP0WrfGuLWRdEWJHFVdBk5PqCx821GlvV0lSD2Id4LDvP i6BO9L2bxm46dKR1RUVU0574cfG0KZlDwIgEP2woRX6t6BCXyY1DturTK1KNRqboEPC/ Bzrqj45++v8lcIvmFQwvk6z3hwR4CtDc3v9E5pojd7o8kKy0M98EP/oKIPzrlY4jZN+0 lBrm4yccR3s1akqi2qLzdHIv97+JsM15aaU1p6seTM73uwRQl7oDFaaHBt6Sr7xPWr+v UANR614pcHKwCBhOF8zMsnCzHjQxghqR9yJU9UC5CBXn5A+03Nf959+p3C1dJibSR/6K SzvQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="JAd/1nZV"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id md23-20020a170906ae9700b008c8f5756068si2177769ejb.766.2023.04.04.14.09.46; Tue, 04 Apr 2023 14:10:11 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="JAd/1nZV"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236156AbjDDVHV (ORCPT + 99 others); Tue, 4 Apr 2023 17:07:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48056 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235428AbjDDVHU (ORCPT ); Tue, 4 Apr 2023 17:07:20 -0400 Received: from mail-oi1-x22f.google.com (mail-oi1-x22f.google.com [IPv6:2607:f8b0:4864:20::22f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F0F1812D for ; Tue, 4 Apr 2023 14:07:18 -0700 (PDT) Received: by mail-oi1-x22f.google.com with SMTP id bj20so25245811oib.3 for ; Tue, 04 Apr 2023 14:07:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1680642438; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=D8wCu932h/gRywCSwLxOAEpXzxqJvpG+Yqrg4CPRzmc=; b=JAd/1nZVeoy351xux6KFsMVzLIQjl9PnuOcBokKzB3bGJ+AiPCVTDt2lioXuKb9qEq 5xsvX9ZUJR/B5ESv4oz0I9GWx8PfRAoXHcNqEjR+5xzUwdN1AYQqGCigmKxDNJcqs5bH M91VOT22HKSZKYEFfOaVFy5dpvkSqQvQAIZmIUEXQb6XP5bZBC1r8kT/j1fKPN1oz2yq 8yctitBEz1sts5F9bOkZ0F7bc4Na1Sx2pgp8pEjGQxqW+XLprm0XEm5h7+jAKBJDTpxW 8ZBHIYFtoJ/HMbC4p2PZQHGs/BsniOmP9elZZJnM7xXiA5UV0p6m6I6fLrv/eyX7EEsw w2tg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680642438; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=D8wCu932h/gRywCSwLxOAEpXzxqJvpG+Yqrg4CPRzmc=; b=LnpSR9OTcvxoAwfkn3B560+cFXifXL1URPfCpHcznmiXZHBypT2CYokT8uFzZzPYQr Jd5AqjEckjPYJuqIChgFe7QiMD4cYvR9KIJDYirPR3fBNkHiAtl63WXDTibb36v7oNl9 6ceKE0BB/6ADyt0cgWBiNeE3x6Vl4LyO7caMUweG3Vggd4OaUsOL6dP0a/TQbBJdpwCW dzPEw0Dx9E7qFGH1jknKDwhTRBC5OSbqIV+Vl3yPin+Aj7DNHx5BiW2UMlViW+yQe94p epyrLb5T/Pd6wfNe4+qINoFhwvrWw2l/tqTrZjye7ldViGRGszTv6N5AV+dW513xiGpW jg6w== X-Gm-Message-State: AAQBX9eeYa0kQB1Wvv+kaP42jf60//wGLX7iO05dk5qgx+UtGND5X7x/ nM+3uuUcOxTqNUEpD7KwvUv6vcFart2TscNWqUN/jw== X-Received: by 2002:a54:4483:0:b0:37f:ab56:ff42 with SMTP id v3-20020a544483000000b0037fab56ff42mr1246545oiv.9.1680642438159; Tue, 04 Apr 2023 14:07:18 -0700 (PDT) MIME-Version: 1.0 References: <20230206172340.2639971-1-rananta@google.com> <20230206172340.2639971-8-rananta@google.com> In-Reply-To: From: Raghavendra Rao Ananta Date: Tue, 4 Apr 2023 14:07:06 -0700 Message-ID: Subject: Re: [PATCH v2 7/7] KVM: arm64: Create a fast stage-2 unmap path To: Oliver Upton Cc: h@linux.dev, Oliver Upton , Marc Zyngier , Ricardo Koller , Reiji Watanabe , James Morse , Alexandru Elisei , Suzuki K Poulose , Will Deacon , Paolo Bonzini , Catalin Marinas , Jing Zhang , Colton Lewis , linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-15.7 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,ENV_AND_HDR_SPF_MATCH, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL, USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 4, 2023 at 12:19=E2=80=AFPM Oliver Upton wrote: > > On Tue, Apr 04, 2023 at 10:52:01AM -0700, Raghavendra Rao Ananta wrote: > > On Wed, Mar 29, 2023 at 5:42=E2=80=AFPM Oliver Upton wrote: > > > > > > On Mon, Feb 06, 2023 at 05:23:40PM +0000, Raghavendra Rao Ananta wrot= e: > > > > The current implementation of the stage-2 unmap walker > > > > traverses the entire page-table to clear and flush the TLBs > > > > for each entry. This could be very expensive, especially if > > > > the VM is not backed by hugepages. The unmap operation could be > > > > made efficient by disconnecting the table at the very > > > > top (level at which the largest block mapping can be hosted) > > > > and do the rest of the unmapping using free_removed_table(). > > > > If the system supports FEAT_TLBIRANGE, flush the entire range > > > > that has been disconnected from the rest of the page-table. > > > > > > > > Suggested-by: Ricardo Koller > > > > Signed-off-by: Raghavendra Rao Ananta > > > > --- > > > > arch/arm64/kvm/hyp/pgtable.c | 44 ++++++++++++++++++++++++++++++++= ++++ > > > > 1 file changed, 44 insertions(+) > > > > > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgta= ble.c > > > > index 0858d1fa85d6b..af3729d0971f2 100644 > > > > --- a/arch/arm64/kvm/hyp/pgtable.c > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c > > > > @@ -1017,6 +1017,49 @@ static int stage2_unmap_walker(const struct = kvm_pgtable_visit_ctx *ctx, > > > > return 0; > > > > } > > > > > > > > +/* > > > > + * The fast walker executes only if the unmap size is exactly equa= l to the > > > > + * largest block mapping supported (i.e. at KVM_PGTABLE_MIN_BLOCK_= LEVEL), > > > > + * such that the underneath hierarchy at KVM_PGTABLE_MIN_BLOCK_LEV= EL can > > > > + * be disconnected from the rest of the page-table without the nee= d to > > > > + * traverse all the PTEs, at all the levels, and unmap each and ev= ery one > > > > + * of them. The disconnected table is freed using free_removed_tab= le(). > > > > + */ > > > > +static int fast_stage2_unmap_walker(const struct kvm_pgtable_visit= _ctx *ctx, > > > > + enum kvm_pgtable_walk_flags visit) > > > > +{ > > > > + struct kvm_pgtable_mm_ops *mm_ops =3D ctx->mm_ops; > > > > + kvm_pte_t *childp =3D kvm_pte_follow(ctx->old, mm_ops); > > > > + struct kvm_s2_mmu *mmu =3D ctx->arg; > > > > + > > > > + if (!kvm_pte_valid(ctx->old) || ctx->level !=3D KVM_PGTABLE_M= IN_BLOCK_LEVEL) > > > > + return 0; > > > > + > > > > + if (!stage2_try_break_pte(ctx, mmu)) > > > > + return -EAGAIN; > > > > + > > > > + /* > > > > + * Gain back a reference for stage2_unmap_walker() to free > > > > + * this table entry from KVM_PGTABLE_MIN_BLOCK_LEVEL - 1. > > > > + */ > > > > + mm_ops->get_page(ctx->ptep); > > > > > > Doesn't this run the risk of a potential UAF if the refcount was 1 be= fore > > > calling stage2_try_break_pte()? IOW, stage2_try_break_pte() will drop > > > the refcount to 0 on the page before this ever gets called. > > > > > > Also, AFAICT this misses the CMOs that are required on systems w/o > > > FEAT_FWB. Without them it is possible that the host will read somethi= ng > > > other than what was most recently written by the guest if it is using > > > noncacheable memory attributes at stage-1. > > > > > > I imagine the actual bottleneck is the DSB required after every > > > CMO/TLBI. Theoretically, the unmap path could be updated to: > > > > > > - Perform the appropriate CMOs for every valid leaf entry *without* > > > issuing a DSB. > > > > > > - Elide TLBIs entirely that take place in the middle of the walk > > > > > > - After the walk completes, dsb(ish) to guarantee that the CMOs have > > > completed and the invalid PTEs are made visible to the hardware > > > walkers. This should be done implicitly by the TLBI implementation > > > > > > - Invalidate the [addr, addr + size) range of IPAs > > > > > > This would also avoid over-invalidating stage-1 since we blast the > > > entire stage-1 context for every stage-2 invalidation. Thoughts? > > > > > Correct me if I'm wrong, but if we invalidate the TLB after the walk > > is complete, don't you think there's a risk of race if the guest can > > hit in the TLB even though the page was unmapped? > > Yeah, we'd need to do the CMOs _after_ making the translation invalid in > the page tables and completing the TLB invalidation. Apologies. > > Otherwise, the only requirement we need to uphold w/ either the MMU > notifiers or userspace is that the translation has been invalidated at > the time of return. > Actually, my concern about the race was against the hardware. If we follow the above approach, let's say we invalidated a certain set of PTEs, but the TLBs aren't yet invalidated. During this point if another vCPU accesses the range governed by the invalidated PTEs, wouldn't it still hit in the TLB? Have I misunderstood you or am I missing something? Thank you. Raghavendra > -- > Thanks, > Oliver