Received: by 2002:a05:6358:11c7:b0:104:8066:f915 with SMTP id i7csp6421460rwl; Tue, 4 Apr 2023 12:26:32 -0700 (PDT) X-Google-Smtp-Source: AKy350aXuM0EP/ULZxBT3xqQQYuCOCq2mU3cE5EFdjZIBgTr+dmr7tV0xffjZGSKIBWwHlxveuGQ X-Received: by 2002:a05:6402:611:b0:4fe:1b62:4741 with SMTP id n17-20020a056402061100b004fe1b624741mr476027edv.28.1680636391784; Tue, 04 Apr 2023 12:26:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680636391; cv=none; d=google.com; s=arc-20160816; b=mtnWthw2BndV75woV7//VLHyyEydn277afHhCl+K/00KTzVtCCXWIbAPiJ0khBHdhk dKKA2PDZOKl25yEedLGiR44cT/+3tnQqSIxg6x+hDz6kZvKovNWVy++AZCqyxtJ8KX1B MVAgi8XWfsW2CObIgNEmvTNTTmS+r1FUSkemf14jrjjlL+Kn9jHkvBfrWw4oJ1UQ07aK WGiY4phPoukNW/l7dKsZENUJFI8lYs87XAOcjgeBzdrk/gt6s9qZP+QOFk15yByerz+N K0hc3zNWxv1ZiBAFEgI7BAj0ABBBsNZDZdsVC00BQLkZm2bXq2YwCGo44mQH/aDCwaQL kdfQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:dkim-signature:date; bh=bE39oER6mn2FeknwM80LYwyihVXXMbmc7AYdxCBaNKA=; b=LYwxXusXcVBcQFavmnXRcHNDJpcWzRcVyViOye+x6Sc7kPaQU/2UIJBBPei1vw71kc rZ3qrat0MWQoRcZhv2FZre2TgCauYvy0KS2aMp/6V0VXJJuh87Qozpl6Ig2NMhffPlQz GS0100itFYgucANTStk9MOMuIAu1keWXLDGj61PiIS55EH5QL/jQq7+A5TRAfC6gnyfo uwLeFuBVDdleLhRdWJdsXLofosIbIHkQ+mYU63vOh7E+EN1Hg5NzSUMNEBasd9zQP0eA iKz5isVcE9jepJIk9AFxg5GhMAtANrO2SYtK/Xt/duoX4UpS+iT8/j9zeLbCQIYlZ1Sd 73+A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=o2lpR3z3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id r19-20020a056402035300b004c0c33b95c1si9995230edw.324.2023.04.04.12.26.06; Tue, 04 Apr 2023 12:26:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=o2lpR3z3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235081AbjDDTTl (ORCPT + 99 others); Tue, 4 Apr 2023 15:19:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48350 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230331AbjDDTTk (ORCPT ); Tue, 4 Apr 2023 15:19:40 -0400 X-Greylist: delayed 613 seconds by postgrey-1.37 at lindbergh.monkeyblade.net; Tue, 04 Apr 2023 12:19:37 PDT Received: from out-32.mta1.migadu.com (out-32.mta1.migadu.com [IPv6:2001:41d0:203:375::20]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BFC953A92 for ; Tue, 4 Apr 2023 12:19:37 -0700 (PDT) Date: Tue, 4 Apr 2023 19:19:28 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1680635974; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=bE39oER6mn2FeknwM80LYwyihVXXMbmc7AYdxCBaNKA=; b=o2lpR3z3PI8EYPXhhSNQwIzQ2F9+bFAurHLQZJzbXBEVzJFyZXWqhNt/bd1MfOXfBjuzQI P0AX3UCRyzYkstzP+ET4V+we8YsuEbIBsKbzIcit5BA73KKiBALMyWal2F4R44fMblXhJh 4e93hK5GgLcKc6db51KQ8hGeCbJp7fk= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Oliver Upton To: Raghavendra Rao Ananta , h@linux.dev Cc: Oliver Upton , Marc Zyngier , Ricardo Koller , Reiji Watanabe , James Morse , Alexandru Elisei , Suzuki K Poulose , Will Deacon , Paolo Bonzini , Catalin Marinas , Jing Zhang , Colton Lewis , linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org Subject: Re: [PATCH v2 7/7] KVM: arm64: Create a fast stage-2 unmap path Message-ID: References: <20230206172340.2639971-1-rananta@google.com> <20230206172340.2639971-8-rananta@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-0.2 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 04, 2023 at 10:52:01AM -0700, Raghavendra Rao Ananta wrote: > On Wed, Mar 29, 2023 at 5:42 PM Oliver Upton wrote: > > > > On Mon, Feb 06, 2023 at 05:23:40PM +0000, Raghavendra Rao Ananta wrote: > > > The current implementation of the stage-2 unmap walker > > > traverses the entire page-table to clear and flush the TLBs > > > for each entry. This could be very expensive, especially if > > > the VM is not backed by hugepages. The unmap operation could be > > > made efficient by disconnecting the table at the very > > > top (level at which the largest block mapping can be hosted) > > > and do the rest of the unmapping using free_removed_table(). > > > If the system supports FEAT_TLBIRANGE, flush the entire range > > > that has been disconnected from the rest of the page-table. > > > > > > Suggested-by: Ricardo Koller > > > Signed-off-by: Raghavendra Rao Ananta > > > --- > > > arch/arm64/kvm/hyp/pgtable.c | 44 ++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 44 insertions(+) > > > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > > > index 0858d1fa85d6b..af3729d0971f2 100644 > > > --- a/arch/arm64/kvm/hyp/pgtable.c > > > +++ b/arch/arm64/kvm/hyp/pgtable.c > > > @@ -1017,6 +1017,49 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, > > > return 0; > > > } > > > > > > +/* > > > + * The fast walker executes only if the unmap size is exactly equal to the > > > + * largest block mapping supported (i.e. at KVM_PGTABLE_MIN_BLOCK_LEVEL), > > > + * such that the underneath hierarchy at KVM_PGTABLE_MIN_BLOCK_LEVEL can > > > + * be disconnected from the rest of the page-table without the need to > > > + * traverse all the PTEs, at all the levels, and unmap each and every one > > > + * of them. The disconnected table is freed using free_removed_table(). > > > + */ > > > +static int fast_stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, > > > + enum kvm_pgtable_walk_flags visit) > > > +{ > > > + struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops; > > > + kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops); > > > + struct kvm_s2_mmu *mmu = ctx->arg; > > > + > > > + if (!kvm_pte_valid(ctx->old) || ctx->level != KVM_PGTABLE_MIN_BLOCK_LEVEL) > > > + return 0; > > > + > > > + if (!stage2_try_break_pte(ctx, mmu)) > > > + return -EAGAIN; > > > + > > > + /* > > > + * Gain back a reference for stage2_unmap_walker() to free > > > + * this table entry from KVM_PGTABLE_MIN_BLOCK_LEVEL - 1. > > > + */ > > > + mm_ops->get_page(ctx->ptep); > > > > Doesn't this run the risk of a potential UAF if the refcount was 1 before > > calling stage2_try_break_pte()? IOW, stage2_try_break_pte() will drop > > the refcount to 0 on the page before this ever gets called. > > > > Also, AFAICT this misses the CMOs that are required on systems w/o > > FEAT_FWB. Without them it is possible that the host will read something > > other than what was most recently written by the guest if it is using > > noncacheable memory attributes at stage-1. > > > > I imagine the actual bottleneck is the DSB required after every > > CMO/TLBI. Theoretically, the unmap path could be updated to: > > > > - Perform the appropriate CMOs for every valid leaf entry *without* > > issuing a DSB. > > > > - Elide TLBIs entirely that take place in the middle of the walk > > > > - After the walk completes, dsb(ish) to guarantee that the CMOs have > > completed and the invalid PTEs are made visible to the hardware > > walkers. This should be done implicitly by the TLBI implementation > > > > - Invalidate the [addr, addr + size) range of IPAs > > > > This would also avoid over-invalidating stage-1 since we blast the > > entire stage-1 context for every stage-2 invalidation. Thoughts? > > > Correct me if I'm wrong, but if we invalidate the TLB after the walk > is complete, don't you think there's a risk of race if the guest can > hit in the TLB even though the page was unmapped? Yeah, we'd need to do the CMOs _after_ making the translation invalid in the page tables and completing the TLB invalidation. Apologies. Otherwise, the only requirement we need to uphold w/ either the MMU notifiers or userspace is that the translation has been invalidated at the time of return. -- Thanks, Oliver