Received: by 2002:a05:6358:11c7:b0:104:8066:f915 with SMTP id i7csp6549529rwl; Tue, 4 Apr 2023 14:47:33 -0700 (PDT) X-Google-Smtp-Source: AKy350bfbuG4k5uE4rVxKFO8uGdIzmyl5zlFa2sDv2J9XCoh5FJD+C30mATuUPuODaLEvCDHNgJE X-Received: by 2002:a17:906:5a83:b0:931:c99c:480 with SMTP id l3-20020a1709065a8300b00931c99c0480mr945023ejq.69.1680644853032; Tue, 04 Apr 2023 14:47:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680644853; cv=none; d=google.com; s=arc-20160816; b=maZVXrB8aDjEhCG0axhhEqFuubbn2REn+PvbeQrxPu2W6vSAS/lgmKu9wa+7mSIvj7 W+vXr2Nkz5nSs3d6CL0EpRs0umjed1TWo8aaj2msqYVXj8io9bdMdKvtlxixi/ZUCacC enhoI6aBwuuX5r+R5eV0OsGw/iI39QvqjaqxpdTZEoY3R7j0pWUToQKCGZHfGBf37Lh+ OyZz1c/WRNDL3tCCb4IGAu9aVW5010cFKjZlcsJaUA0w6JFIf8CkKYoeZgNy/MbhCMxW QeVOthMRpyslV6OKRWcqV9xgC4EWVHdh08S5m44TmVl171ziF79g5PCy7YEF0Rx+3mHT Bp6w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=sep2IsHHPi6VwoK6jVRyP5qZG2pWFbR4+s4uvdE1Pj8=; b=PqvsoiM6oLrO47MzKbq5GOVqNpHZYPApWfHbtFdDpsw6BXTwapvM76xF/XmUoQNwyY N9sSiCkICN4/L64FdLt5M0D81b4nNeuzn1jLOOI84cGk2hVJhXw9bECpil8lxshEyN+i vKFCqomYgKb+v1JzgLcvEYYtuZcRJypyMyDE75m2V04sv6BXasXrWzpp8hJ28bXNk5i5 ULI5jC9KoKRpRAxUEGeNJfum1Egv6k02aaUkXJU6G0oqh0SgHgySrodRmyYROXAN2I/S GOUxw+KaPP/+gl0jsILcjL9fsD2A7ahVkL88xVWv9FqoSRlpNOwye6L7x6QCr5p8EM93 xYlw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=TVNVkJ7W; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id w16-20020a170907271000b0092bdd3273c5si3426084ejk.925.2023.04.04.14.47.08; Tue, 04 Apr 2023 14:47:33 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=TVNVkJ7W; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236530AbjDDVqH (ORCPT + 99 others); Tue, 4 Apr 2023 17:46:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52352 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236450AbjDDVqG (ORCPT ); Tue, 4 Apr 2023 17:46:06 -0400 Received: from mail-ot1-x32e.google.com (mail-ot1-x32e.google.com [IPv6:2607:f8b0:4864:20::32e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5D84710D2 for ; Tue, 4 Apr 2023 14:46:05 -0700 (PDT) Received: by mail-ot1-x32e.google.com with SMTP id 72-20020a9d064e000000b006a2f108924cso6640266otn.1 for ; Tue, 04 Apr 2023 14:46:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1680644764; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=sep2IsHHPi6VwoK6jVRyP5qZG2pWFbR4+s4uvdE1Pj8=; b=TVNVkJ7Wgr3eBNyrguuJr4KbyC36pm49mYs5WIGWBdhvxbh36y5KZn55NFo2bZYEMS XJEYOZlehkfGVtYXb8z/Rh3KbYZCORYlfv1Yo16FL8npwtDgrJP42nOUqVBlup3W/OAg NbEweNaP+ZYnp0HJ6Oxg5pgrGU6th66YTULpbiUgOa3DRxZcmEXebsACoZTgNrJITZhX DnIt5IYnST38UbSJ4oKygwuHgqiS1+8zAlYV1RKtZnu7IixetGzLgEPnEgi9ipaB/00R E67fSn8ENKTlMswf/ShwbSXGWxe+2VxSp6GLh+5gEDDmA1837VObl0oKWs7chi17hnsw jUZw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680644764; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=sep2IsHHPi6VwoK6jVRyP5qZG2pWFbR4+s4uvdE1Pj8=; b=s+gQOAsPKe7EfBhK0b2aFlnb/Z4ooTdt1ld/d4h9E7RpcNguGG9qXHnA4PSTHOwsMM 6LAnrSe6t0h0JbKN00PpBFWwjoF108HqrrGFi80htqLvT2S2gSQ2Cm2Cn5i76mBLN7al Fe1h79+8QyOM7m+0sqGLtDuvP8UPQAxnyjD0fx7BU2olTWZQ/INS4kCsboIzSbr1iKXN /mjxoXsBrbwhCJNN5b5dQZ09qDzc/RBfIjmrlIOWmGQ6dKNhlLlNaNgj43+MrYWEutCL lFVk9uvBaHdrw9k0y3xs1Yj5TvHxfJ9ziwA48uI5e5Ee4Q8bZoBq0Dk1xsMTpxCeCXeA PCKw== X-Gm-Message-State: AAQBX9eC1cTiNOGbNeLNhK3NI+bx5P0ofZ44mPGGUWqphKINBK8jDdg2 D/qUqZeZsYM3e3me+Zs57FOADwic6MklQFFioit/vQ== X-Received: by 2002:a05:6830:1be4:b0:6a1:1b5c:c6db with SMTP id k4-20020a0568301be400b006a11b5cc6dbmr1292746otb.7.1680644764217; Tue, 04 Apr 2023 14:46:04 -0700 (PDT) MIME-Version: 1.0 References: <20230206172340.2639971-1-rananta@google.com> <20230206172340.2639971-8-rananta@google.com> In-Reply-To: From: Raghavendra Rao Ananta Date: Tue, 4 Apr 2023 14:45:52 -0700 Message-ID: Subject: Re: [PATCH v2 7/7] KVM: arm64: Create a fast stage-2 unmap path To: Oliver Upton Cc: h@linux.dev, Oliver Upton , Marc Zyngier , Ricardo Koller , Reiji Watanabe , James Morse , Alexandru Elisei , Suzuki K Poulose , Will Deacon , Paolo Bonzini , Catalin Marinas , Jing Zhang , Colton Lewis , linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-15.7 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,ENV_AND_HDR_SPF_MATCH, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL, USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 4, 2023 at 2:31=E2=80=AFPM Oliver Upton wrote: > > On Tue, Apr 04, 2023 at 02:07:06PM -0700, Raghavendra Rao Ananta wrote: > > On Tue, Apr 4, 2023 at 12:19=E2=80=AFPM Oliver Upton wrote: > > > > > > On Tue, Apr 04, 2023 at 10:52:01AM -0700, Raghavendra Rao Ananta wrot= e: > > > > On Wed, Mar 29, 2023 at 5:42=E2=80=AFPM Oliver Upton wrote: > > > > > > > > > > On Mon, Feb 06, 2023 at 05:23:40PM +0000, Raghavendra Rao Ananta = wrote: > > > > > > The current implementation of the stage-2 unmap walker > > > > > > traverses the entire page-table to clear and flush the TLBs > > > > > > for each entry. This could be very expensive, especially if > > > > > > the VM is not backed by hugepages. The unmap operation could be > > > > > > made efficient by disconnecting the table at the very > > > > > > top (level at which the largest block mapping can be hosted) > > > > > > and do the rest of the unmapping using free_removed_table(). > > > > > > If the system supports FEAT_TLBIRANGE, flush the entire range > > > > > > that has been disconnected from the rest of the page-table. > > > > > > > > > > > > Suggested-by: Ricardo Koller > > > > > > Signed-off-by: Raghavendra Rao Ananta > > > > > > --- > > > > > > arch/arm64/kvm/hyp/pgtable.c | 44 ++++++++++++++++++++++++++++= ++++++++ > > > > > > 1 file changed, 44 insertions(+) > > > > > > > > > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/= pgtable.c > > > > > > index 0858d1fa85d6b..af3729d0971f2 100644 > > > > > > --- a/arch/arm64/kvm/hyp/pgtable.c > > > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c > > > > > > @@ -1017,6 +1017,49 @@ static int stage2_unmap_walker(const str= uct kvm_pgtable_visit_ctx *ctx, > > > > > > return 0; > > > > > > } > > > > > > > > > > > > +/* > > > > > > + * The fast walker executes only if the unmap size is exactly = equal to the > > > > > > + * largest block mapping supported (i.e. at KVM_PGTABLE_MIN_BL= OCK_LEVEL), > > > > > > + * such that the underneath hierarchy at KVM_PGTABLE_MIN_BLOCK= _LEVEL can > > > > > > + * be disconnected from the rest of the page-table without the= need to > > > > > > + * traverse all the PTEs, at all the levels, and unmap each an= d every one > > > > > > + * of them. The disconnected table is freed using free_removed= _table(). > > > > > > + */ > > > > > > +static int fast_stage2_unmap_walker(const struct kvm_pgtable_v= isit_ctx *ctx, > > > > > > + enum kvm_pgtable_walk_flags visit) > > > > > > +{ > > > > > > + struct kvm_pgtable_mm_ops *mm_ops =3D ctx->mm_ops; > > > > > > + kvm_pte_t *childp =3D kvm_pte_follow(ctx->old, mm_ops); > > > > > > + struct kvm_s2_mmu *mmu =3D ctx->arg; > > > > > > + > > > > > > + if (!kvm_pte_valid(ctx->old) || ctx->level !=3D KVM_PGTAB= LE_MIN_BLOCK_LEVEL) > > > > > > + return 0; > > > > > > + > > > > > > + if (!stage2_try_break_pte(ctx, mmu)) > > > > > > + return -EAGAIN; > > > > > > + > > > > > > + /* > > > > > > + * Gain back a reference for stage2_unmap_walker() to fre= e > > > > > > + * this table entry from KVM_PGTABLE_MIN_BLOCK_LEVEL - 1. > > > > > > + */ > > > > > > + mm_ops->get_page(ctx->ptep); > > > > > > > > > > Doesn't this run the risk of a potential UAF if the refcount was = 1 before > > > > > calling stage2_try_break_pte()? IOW, stage2_try_break_pte() will = drop > > > > > the refcount to 0 on the page before this ever gets called. > > > > > > > > > > Also, AFAICT this misses the CMOs that are required on systems w/= o > > > > > FEAT_FWB. Without them it is possible that the host will read som= ething > > > > > other than what was most recently written by the guest if it is u= sing > > > > > noncacheable memory attributes at stage-1. > > > > > > > > > > I imagine the actual bottleneck is the DSB required after every > > > > > CMO/TLBI. Theoretically, the unmap path could be updated to: > > > > > > > > > > - Perform the appropriate CMOs for every valid leaf entry *witho= ut* > > > > > issuing a DSB. > > > > > > > > > > - Elide TLBIs entirely that take place in the middle of the walk > > > > > > > > > > - After the walk completes, dsb(ish) to guarantee that the CMOs = have > > > > > completed and the invalid PTEs are made visible to the hardwar= e > > > > > walkers. This should be done implicitly by the TLBI implementa= tion > > > > > > > > > > - Invalidate the [addr, addr + size) range of IPAs > > > > > > > > > > This would also avoid over-invalidating stage-1 since we blast th= e > > > > > entire stage-1 context for every stage-2 invalidation. Thoughts? > > > > > > > > > Correct me if I'm wrong, but if we invalidate the TLB after the wal= k > > > > is complete, don't you think there's a risk of race if the guest ca= n > > > > hit in the TLB even though the page was unmapped? > > > > > > Yeah, we'd need to do the CMOs _after_ making the translation invalid= in > > > the page tables and completing the TLB invalidation. Apologies. > > > > > > Otherwise, the only requirement we need to uphold w/ either the MMU > > > notifiers or userspace is that the translation has been invalidated a= t > > > the time of return. > > > > > Actually, my concern about the race was against the hardware. If we > > follow the above approach, let's say we invalidated a certain set of > > PTEs, but the TLBs aren't yet invalidated. During this point if > > another vCPU accesses the range governed by the invalidated PTEs, > > wouldn't it still hit in the TLB? Have I misunderstood you or am I > > missing something? > > Yep, that's exactly what would happen. There is no way to eliminate the > race you mention, there will always be a window of time where the page > tables no longer contain a particular translation but the TLBs may still > be holding a valid entry. > This is new to me :) > This race is benign so long as we guarantee that all translations for > the affected address (i.e. in the page tables, cached in a TLB) have > been invalidated before returning to the caller. For example, MM cannot > start swapping out guest memory until it is guaranteed that the guest is > no longer writing to it. > Well, if you feel that the risk is acceptable, we can probably defer the invalidations until the unmap walk is finished. Thank you. Raghavendra > -- > Thanks, > Oliver